A Comparative Study of Delta Parquet, Iceberg, and Hudi for Automotive Data Engineering Use Cases

International Journal of Computer Science and Engineering |
© 2025 by SSRG - IJCSE Journal |
Volume 12 Issue 7 |
Year of Publication : 2025 |
Authors : Dinesh Eswararaj, Ajay Babu Nellipudi, Vandana Kollati |
How to Cite?
Dinesh Eswararaj, Ajay Babu Nellipudi, Vandana Kollati, "A Comparative Study of Delta Parquet, Iceberg, and Hudi for Automotive Data Engineering Use Cases," SSRG International Journal of Computer Science and Engineering , vol. 12, no. 7, pp. 31-42, 2025. Crossref, https://doi.org/10.14445/23488387/IJCSE-V12I17P104
Abstract:
The automotive industry faces growing challenges in managing and analyzing vast volumes of data generated by vehicles, including sensor data, telemetry, diagnostics, and real-time operational insights. Efficient data engineering solutions are essential to unlock value from this data, which requires addressing issues such as latency, scalability, and data consistency. Modern data Lakehouse formats, such as Delta Parquet, Apache Iceberg, and Apache Hudi, have emerged as promising solutions for these challenges, offering robust features like ACID (Atomicity, Consistency, Isolation, Durability) transactions, schema enforcement, and real-time ingestion capabilities. These technologies combine the finest attributes of data lakes and data warehouses, providing a flexible and scalable architecture suitable for complex automotive use cases. This study presents a comparative analysis of Delta Parquet, Iceberg, and Hudi, with emphasis on real-world, time-series automotive telemetry data, including structured fields such as vehicle ID, timestamp, latitude/longitude, and event metrics. The evaluation covers data modeling approaches, partitioning strategies, and support for time-based analytics and Change Data Capture (CDC). The methodology involves evaluating these tools across several criteria, including performance, scalability, query support, data consistency, and ecosystem maturity. Key findings indicate that Delta Parquet excels in Machine Learning (ML) readiness and strong governance, Iceberg offers superior performance for batch analytics and cloud environments, while Hudi is optimized for real-time data ingestion and incremental processing. Each format demonstrates tradeoffs in query efficiency, time-travel capabilities, and update semantics under time-series workloads. The study provides valuable insights into how automotive companies can select or combine these formats based on specific use cases such as fleet management, predictive maintenance, and route optimization. The structured dataset and realistic query scenarios used in this work ensure that results are grounded in practical data engineering pipelines. The findings are particularly relevant for organizations looking to scale their data pipelines and integrate machine learning models in automotive applications.
Keywords:
Automotive data engineering, Data Lakehouse, Delta Parquet, Apache Iceberg, Apache Hudi.
References:
[1] Ahmed AbouZaid et al., “Building a Modern Data Platform Based on the Data Lakehouse Architecture and Cloud-Native Ecosystem,” Discover Applied Sciences, vol. 7, no. 3, pp. 1-22, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Alekhya Achanta, and Roja Boina, “Data Governance and Quality Management in Data Engineering,” International Journal of Computer Trends and Technology, vol. 71, no. 11, pp. 40-45, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Ashvin Agrawal et al., “XTable in Action: Seamless Interoperability in Data Lakes,” arXiv preprint arXiv:2401.09621, pp. 1-4, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[4] R. Bruce Ainsworth et al., “Linking the High-Resolution Architecture of Modern and Ancient Wave-Dominated Deltas: Processes, Products, and Forcing Factors,” Journal of Sedimentary Research, vol. 89, no. 2, pp. 168-185, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Soukaina Ait Errami et al., “Spatial Big Data Architecture: From Data Warehouses and Data Lakes to the Lakehouse,” Journal of Parallel and Distributed Computing, vol. 176, pp. 70-79, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Michael Armbrust et al., “Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3411-3424, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Sarah Azzabi, Zakiya Alfughi, and Abdelkader Ouda, “Data Lakes: A Survey of Concepts and Architectures,” Computers, vol. 13, no. 7, pp. 1-25, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Muhammad Babar, and Fahim Arif, “Real-Time Data Processing Scheme Using Big Data Analytics in Internet of Things Based Smart Transportation Environment,” Journal of Ambient Intelligence and Humanized Computing, vol. 10, pp. 4167-4177, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Zhiwei Bao et al., “Delta Tensor: Efficient Vector and Tensor Storage in Delta Lake,” arXiv preprint arXiv:2405.03708, pp. 1-13, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Jesús Camacho-Rodríguez et al., “LST-Bench: Benchmarking Log-Structured Tables in the Cloud,” Proceedings of the ACM on Management of Data, vol. 2, no. 1, pp. 1-26, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Pulkit Chadha, Data Engineering with Databricks Cookbook: Build Effective Data and AI Solutions Using Apache Spark, Databricks, and Delta Lake, Packt Publishing Ltd, 2024.
[Google Scholar] [Publisher Link]
[12] Akash Vijayrao Chaudhari, and Pallavi Ashokrao Charate, “Optimizing Data Lakehouse Architectures for Scalable Real-Time Analytics,” International Journal of Scientific Research in Science Engineering and Technology, vol. 12, no. 2, pp. 809-822, 2025.
[CrossRef] [Publisher Link]
[13] Giuseppe Distefano, “Design of an Infrastructure for Collecting, Storing and Using Data in the Context of Renewable Energy,” Doctoral Dissertation, 2025.
[Google Scholar] [Publisher Link]
[14] E.A.M. van Eeden, “A New Dynamic Landscape for the Haringvliet: Landscape Architecture Explorations for Delta 21,” Master Thesis, 2021.
[Google Scholar] [Publisher Link]
[15] Tobias Götz, Daniel Ritter, and Jana Giceva, “LakeVilla: Multi-Table Transactions for Lakehouses,” arXiv preprint arXiv:2504.20768, pp. 1-14, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Anja Gruenheid et al., “AutoComp: Automated Data Compaction for Log-Structured Tables in Data Lakes,” Companion of the 2025 International Conference on Management of Data, pp. 404-417, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Bennie Haelen, and Dan Davis, Delta Lake: Up and Running, O'Reilly Media, Inc, 2023.
[Publisher Link]
[18] Sasun Hambardzumyan et al., “Deep Lake: A Lakehouse for Deep Learning,” arXiv preprint arXiv:2209.10785, pp. 1-12, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Fredrik Hellman, “Study and Comparison of Data Lakehouse Systems,” Master’s Thesis, 2023.
[Google Scholar] [Publisher Link]
[20] Herodotos Herodotou, Yuxing Chen, and Jiaheng Lu, “A Survey on Automatic Parameter Tuning for Big Data Processing Systems,” ACM Computing Surveys, vol. 53, no. 2, pp. 1-37, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Kazuaki Hori, and Yoshiki Saito, “Classification, Architecture, and Evolution of Large River Deltas,” Large Rivers: Geomorphology and Management, pp. 114-145, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Rocco Italiano, “Automotive Use Cases: From the Real Time Data Ingestion to the Data Analysis of Connected Vehicles,” Doctoral Dissertation, 2020.
[Google Scholar] [Publisher Link]
[23] Enrico Vompa, “Data Lakehouse Architecture for Big Data with Apache Hudi,” Master's Thesis, 2023.
[Google Scholar]
[24] Paras Jain et al., “Analyzing and Comparing Lakehouse Storage Systems,” CIDR, 2023.
[Google Scholar]
[25] Manjula Jayavel, “Optimizing ETL Performance with Delta Lake for Data Analytics Solutions,” International Journal of Innovative Research in Computer and Communication Engineering, vol. 13, no. 3, pp. 2039-2045, 2025.
[CrossRef] [Publisher Link]
[26] Chandrakanth Lekkala, “Building Resilient Big Data Pipelines with Delta Lake for Improved Data Governance,” European Journal of Advances in Engineering and Technology, vol. 7, no. 12, pp. 101-106, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Grischa Liebel, Matthias Tichy, and Eric Knauss, “Use, Potential, and Showstoppers of Models in Automotive Requirements Engineering,” Software & Systems Modeling, vol. 18, no. 4, pp. 2587-2607, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Nicolas Martin, “Lakehouse Architecture for Simplifying Data Sciencepipelines,” 2023.
[Google Scholar] [Publisher Link]
[29] Britney Johnson Mary, “Unified Data Architecture for Machine Learning: A Comparative Review of Data Lakehouse, Data Lakes, and Data Warehouses,” 2025.
[Google Scholar]
[30] Dipankar Mazumdar, Jason Hughes, and J.B. Onofre, “The Data Lakehouse: Data Warehousing and More,” arXiv preprint arXiv:2310.08697, pp. 1-12, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Cornel Olariu et al., “Controls on the Stratal Architecture of Lacustrine Delta Successions in Low-Accommodation Conditions,” Sedimentology, vol. 68, no. 5, pp. 1941-1963, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Daniela Pleșoianu, and Ștefana Vedea, “Characteristic Aspects of the Danube Delta Lakes,” Scientific Papers Series Management, Economic Engineering in Agriculture and Rural Development, vol. 19, no. 1, pp. 353-358, 2019.
[Google Scholar] [Publisher Link]
[33] Christian Prehofer, and Shafqat Mehmood, “Big Data Architectures for Vehicle Data Analysis,” IEEE International Conference on Big Data, Atlanta, GA, USA, pp. 3404-3412, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[34] D. Saha, “Disruptor in Data Engineering-Comprehensive Review of Apache Iceberg,” 2024.
[Google Scholar]
[35] Philip Salqvist, “A Comparative Study of the Data Warehouse and Data Lakehouse Architecture,” 2024.
[Google Scholar] [Publisher Link]
[36] Jan Schneider, Christoph Gröger, and Arnold Lutsch, “The Data Platform Evolution: From Data Warehouses over Data Lakes to Lakehouses,” Proceedings of the 34th GI-Workshop on Foundations of Databases, pp. 1-5, 2023.
[Google Scholar] [Publisher Link]
[37] Jan Schneider et al., “The Lakehouse: State of the Art on Concepts and Technologies,” SN Computer Science, vol. 5, no. 5, pp. 1-39, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Tapan Srivastava, Jacopo Tagliabue, and Ciro Greco, “Eudoxia: A FaaS Scheduling Simulator for the Composable Lakehouse,” arXiv preprint, pp. 1-7, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Miroslaw Staron, “Requirements Engineering for Automotive Embedded Systems,” Automotive Systems and Software Engineering: State of the Art and Future Trends, pp. 11-28, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[40] R.F.L. Vargas, “A Performance Comparison of Data Lake Table Formats in Cloud Object Storages,” 2022.
[Google Scholar]