An Efficient Spark-Based Parallel FP-Growth For Big Data Mining With Key-Value Pair Model

Baokui Liao, Mohd Nurul Hafiz Ibrahim, Mustafa Muwafak Alobaedy, S. B. Goyal

Citation :

Baokui Liao, Mohd Nurul Hafiz Ibrahim, Mustafa Muwafak Alobaedy, S. B. Goyal, "An Efficient Spark-Based Parallel FP-Growth For Big Data Mining With Key-Value Pair Model," International Journal of Computer Science and Engineering , vol. 12, no. 9, pp. 12-23, 2025. Crossref, https://doi.org/10.14445/23488387/IJCSE-V12I9P103

Abstract

In the era of big data, the exponential growth of information has made extracting valuable insights from massive datasets an urgent challenge. Association rule mining, particularly the FP-Growth, plays a crucial role in discovering high frequency patterns. Traditional FP-Growth faces significant challenges when processing large-scale data, including memory overflow and computational inefficiency. Existing improvements to FP-Growth have achieved some success in parallelization, with the PFP being a notable example. This paper proposes an algorithmic parallelization scheme based on Spark, enhancing mining efficiency by splitting FP trees using key-value pairs and optimizing database scanning processes. Unlike traditional methods relying on global FP trees, this algorithm leverages Spark's distributed in-memory computing model to eliminate time consuming FP tree traversal operations and reduce inter-node communication. Tests demonstrate that the KVBFP exhibits high stability, achieving approximately 55% lower communication overhead compared to PFP. It also reduces the mean variance of cluster CPU and memory utilization by 87.2% and 92.4%, respectively, while boosting overall mining efficiency by 44.7%.

Keywords

FP-Growth, Spark, Parallel Mining, Big Data, Data Mining.

References

[1] Raghavendra Kumar Chunduri, and Aswani Kumar Cherukuri, “Scalable Algorithm for Generation of Attribute Implication Base using FP-Growth and Spark,” Soft Computing, vol. 25, pp. 9219-92401, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Jeffri Prayitno Bangkit Saputra, Silvia Anggun Rahayu, and Taqwa Hariguna, “Market Basket Analysis Using FP-Growth Algorithm to Design Marketing Strategy by Determining Consumer Purchasing Patterns,” Journal of Applied Data Sciences, vol. 4, no. 1, pp. 1-12, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Ali Hassani et al., “Escaping the Big Data Paradigm with Compact Transformers,” arXiv preprint, pp. 1-18, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Trevor Hastie et al., “The Elements of Statistical Learning: Data Mining, Inference, and Prediction,” Journal of the Royal Statistical Society Series A: Statistics in Society, vol. 173, no. 3, pp. 693-694, 2010.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Chaganti Sri Karthikeya Sahith, Satish Muppidi, and Suneetha Merugula, “Apache Spark Big data Analysis, Performance Tuning, and Spark Application Optimization,” 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT), pp. 1-8, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Farhan Ullah et al., “NIDS-VSB: Network Intrusion Detection System for VANET using Spark-Based Big Data Optimization and Transfer Learning,” IEEE Transactions on Consumer Electronics, vol. 70, no. 1, pp. 1798-1809, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Dewi Anisa Istiqomah, Yuli Astuti, and Siti Nurjanah, “Implementation of FP-Growth And Apriori Algorithms For Product Inventory,” Polinema Informatics Journal, vol. 8, no. 2, pp. 1-6, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Yang Yang et al., “A Parallel FP-Growth Mining Algorithm with Load Balancing Constraints for Traffic Crash Data,” International Journal of Computers Communications & Control, vol. 17, no. 4, pp. 1-16, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Mingzheng Li et al., “TCM Constitution Analysis Method Based on Parallel FP-Growth Algorithm in Hadoop Framework,” Journal of Healthcare Engineering, vol. 2022, no. 1, pp. 1-14, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Xinyan Wang, and Guie Jiao, “Research on Association Rules of Course Grades based on Parallel FP-Growth Algorithm,” Journal of Computational Methods in Sciences and Engineering, vol. 20, no. 3, pp. 759-769, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[11] A. Senthilkumar, and D. Hariprasad, “A Spark Based Frequent Itemset Mining Using Resource Management for Implementation of FP Growth Algorithm in Cloud Environment,” Annals of the Romanian Society for Cell Biology, vol. 25, no. 4, pp. 6050-6059, 2021.
[Google Scholar] [Publisher Link]
[12] Zakria Mahrousa, Dima Mufti Alchawafa, and Hasan Kazzaz, “Frequent Itemset Mining Based on Development of FP-growth Algorithm and Use MapReduce Technique,” Association of Arab Universities Journal of Engineering Sciences, vol. 28, no. 1, pp. 1-16, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Amr Essam, Manal A. Abdel-Fattah, and Laila Abdelhamid, “Towards Enhancing the Performance of Parallel FP-Growth on Spark,” IEEE Access, vol. 10, pp. 286-296, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Priyanka Gupta, and Vinaya Sawant, “A Parallel Apriori Algorithm and FP- Growth Based on SPARK,” International Conference on Automation, Computing and Communication, vol. 40, pp. 1-5, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Youssef Fakir, Salim Khalil, and Mohamed Fakir, “Extraction of Association Rules in a Diabetic Dataset using Parallel FP-growth Algorithm under Apache Spark,” International Journal of Informatics and Communication Technology, vol. 13, no. 3, pp. 445-452, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Shubhangi Chaturvedi, Sri Khetwat Saritha, and Animesh Chaturvedi, “Spark based Parallel Frequent Pattern Rules for Social Media Data Analytics,” 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW), Bangalore, India, pp. 168-175, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Jeff Heaton, “Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms,” SoutheastCon 2016, Norfolk, VA, USA, pp. 1-7, 2016.
[CrossRef] [Google Scholar] [Publisher Link]