A Semi-Supervised Ensemble Classification with Data Augmentation to Reduce Imbalanced Class Distribution in Big Data

International Journal of Electronics and Communication Engineering |
© 2025 by SSRG - IJECE Journal |
Volume 12 Issue 7 |
Year of Publication : 2025 |
Authors : S. Sindhu, S. Veni |
How to Cite?
S. Sindhu, S. Veni, "A Semi-Supervised Ensemble Classification with Data Augmentation to Reduce Imbalanced Class Distribution in Big Data," SSRG International Journal of Electronics and Communication Engineering, vol. 12, no. 7, pp. 1-12, 2025. Crossref, https://doi.org/10.14445/23488549/IJECE-V12I7P101
Abstract:
With the age of digital information, managing, retrieving, and processing information in the form of big data has become the most substantial and difficult task. Though several tools are available for data analysis, they may not perform well with high-dimensional data. Machine learning algorithms play a major role in acquiring useful knowledge from this massive data. Consequently, a better technique that is extremely fast in handling a huge volume of data with numerous dimensions and processing ability in a distributed setting is indispensable. As the data has been gathered from various sources, data preprocessing plays a vital role in transforming it into a suitable form for acquiring useful insights. Despite the data being pre-processed, class imbalance remains a significant concern that must be addressed to enhance the outcomes. This work proposes a semi-supervised ensemble classifier utilizing data augmentation and the classroom rebalancing sampling to mitigate uneven class distribution in large datasets. The model employs ensemble techniques with two heterogeneous classifiers: extreme gradient boosting as a base learner and random forest for pseudo-labelling and classification. The Hadoop framework implements the model, which is suitable and effective for very large datasets. Following the execution of the experimental evaluation for the suggested framework, the evaluation metrics of the results indicate that the approach achieves an average accuracy of 84.3% over 16 datasets, reflecting an increase of around 17.65% in prediction performance.
Keywords:
Class imbalance, Data augmentation, Ensemble learning, Extreme gradient boosting, Semi-supervised learning.
References:
[1] Vellingiri Jayagobal, and K.K. Basser, Data Management and Big Data Analytics: Data Management in Digital Economy, Research Anthology on Big Data Analytics, Architectures, and Applications, IGI Global, pp. 1614-1633, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Martin Hilbert, and Priscila López, “The World’s Technological Capacity to Store, Communicate, and Compute Information,” Science, vol. 332, pp. 60-65, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Mandeep Kaur Saggi, and Sushma Jain, “A Survey towards an Integration of Big Data Analytics to Big Insights for Value-Creation,” Information Processing & Management, vol. 54, no. 5, pp. 758-790, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Fakhitah Ridzuan, and Wan Mohd Nazmee Wan Zainon, “A Review on Data Quality Dimensions for Big Data,” Procedia Computer Science, vol. 234, pp. 341-348, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[5] P.V. Thayyib et al., “State-of-the-Art of Artificial Intelligence and Big Data Analytics Reviews in Five Different Domains: A Bibliometric Summary,” Sustainability, vol. 15, no. 5, pp. 1-38, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Isaac Triguero et al., “Transforming Big Data into Smart Data: An Insight on the Use of the K-Nearest Neighbors Algorithm to Obtain Quality Data,” WIREs Data Mining and Knowledge Discovery, vol. 9, no. 2, pp. 1-24, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Abdulaziz Aldoseri, Khalifa N. Al-Khalifa, and Abdel Magid Hamouda, “Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges,” Applied Sciences, vol. 13, no. 12, pp. 1-33, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Jyoti Sharma, “Big Data Management Using Map Reduce Technique,” Academy of Management Annals, vol. 15, no. 1, pp. 45-51, 2024.
[Google Scholar] [Publisher Link]
[9] S.S. Blessy Trencia Lincy, and Suresh Kumar Nagarajan, “A Distributed Support Vector Machine Using Apache Spark for Semi-Supervised Classification with Data Augmentation,” Proceedings of Soft Computing and Signal Processing, Singapore, pp. 395-405, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Leo L. Duan, James E. Johndrow, and David B. Dunson, “Scaling up Data Augmentation MCMC via Calibration,” Journal of Machine Learning Research, vol. 19, pp. 1-34, 2018.
[Google Scholar] [Publisher Link]
[11] Guillermo Iglesias et al., “Data Augmentation Techniques in Time Series Domain: A Survey and Taxonomy,” Neural Computing and Applications, vol. 35, pp. 10123-10145, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[12] S. Sathya Bama, and A. Saravanan, “Efficient Classification Using Average Weighted Pattern Score with Attribute Rank Based Feature Selection,” International Journal of Intelligent Systems and Applications, vol. 11, no. 7, pp. 29-42, 2019. [CrossRef] [Google Scholar] [Publisher Link]
[13] Aleum Kim, and Sung-Bae Cho, “An Ensemble Semi-Supervised Learning Method for Predicting Defaults in Social Lending,” Engineering Applications of Artificial Intelligence, vol. 81, pp. 193-199, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Kushankur Ghosh et al., “The Class Imbalance Problem in Deep Learning,” Machine Learning, vol. 113, pp. 4845-4901, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Shivani Goswami, and Anil Kumar Singh, “A Literature Survey on Various Aspect of Class Imbalance Problem in Data Mining,” Multimedia Tools and Applications, vol. 83, pp. 70025-70050, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[16] N.V. Chawla et al., “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski, “A Systematic Study of the Class Imbalance Problem in Convolutional Neural Networks,” Neural Networks, vol. 106, pp. 249-259, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Shuo Wang, Leandro L. Minku, and Xin Yao, “Resampling-Based Ensemble Methods for Online Class Imbalance Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 5, pp. 1356-1368, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Saravanan Arumugam, Anandhi Damotharan, and Srividya Marudhachalam, “Class Probability Distribution Based Maximum Entropy Model for Classification of Datasets with Sparse Instances,” Computer Science and Information Systems, vol. 20, no. 3, pp. 949-976, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Chen Wei et al., “CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 10852-10861, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Xiaojin Zhu, “Semi-Supervised Learning Literature Survey,” Technical Report, University of Wisconsin-Madison, pp. 1-60, 2005.
[Google Scholar] [Publisher Link]
[22] Kristin P. Bennett, and Ayhan Demiriz, “Semi-Supervised Support Vector Machines,” Proceedings of the 12th International Conference on Neural Information Processing Systems, Cambridge, MA, United States, pp. 368-374, 1998.
[Google Scholar] [Publisher Link]
[23] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani, “Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples,” Journal of Machine Learning Research, vol. 7, pp. 2399-2434, 2006.
[Google Scholar] [Publisher Link]
[24] Harri Valpola, Chapter 8 - From Neural PCA to Deep Unsupervised Learning, Advances in Independent Component Analysis and Learning Machines, Academic Press, pp. 143-171, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Antti Rasmus et al., “Semi-Supervised Learning with Ladder Networks,” Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal Canada, vol. 2, pp. 3546-3554, 2015.
[Google Scholar] [Publisher Link]
[26] Ming Li, and Zhi-Hua Zhou, “Improve Computer-Aided Diagnosis with Machine Learning Techniques Using Undiagnosed Samples,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 37, no. 6, pp. 1088-1098, 2007.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Saul Calderon-Ramirez et al., “Correcting Data Imbalance for Semi-Supervised COVID-19 Detection Using X-Ray Chest Images,” Applied Soft Computing, vol. 111, pp. 1-15, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Paola Cascante-Bonilla et al., “Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning,” Thirty Fifth Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, pp. 6912-6920, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Sara del Río et al., “A MapReduce Approach to Address Big Data Classification Problems Based on the Fusion of Linguistic Fuzzy Rules,” International Journal of Computational Intelligence Systems, vol. 8, pp. 422-437, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Nicholas G. Polson, and Steven L. Scott, “Data Augmentation for Support Vector Machines,” Bayesian Analysis, vol. 6, no. 1, pp. 1-23, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Panayiota Touloupou et al., “Efficient Model Comparison Techniques for Models Requiring Large Scale Data Augmentation,” Bayesian Analysis, vol. 13, no. 2, pp. 437-459, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Seth Nabarro et al., “Data Augmentation in Bayesian Neural Networks and the Cold Posterior Effect,” Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence PMLR, UAI, vol. 180, pp. 1434-1444, 2022.
[Google Scholar] [Publisher Link]
[33] Giovanna Garcia et al., “An Iterative Heuristic Approach for Channel and Power Allocation in Wireless Networks,” Annals of Telecommunications, vol. 73, pp. 293-303, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Ranjeet Kumar Singh, “Developing a Big Data Analytics Platform Using Apache Hadoop Ecosystem for Delivering Big Data Services in Libraries,” Digital Library Perspectives, vol. 40, no. 2, pp. 160-186, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Mingxing Duan et al., “A Parallel Multiclassification Algorithm for Big Data Using an Extreme Learning Machine,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2337-2351, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Tianjing Zhao et al., “Fast Parallelized Sampling of Bayesian Regression Models for Whole-Genome Prediction,” Genetics Selection Evolution, vol. 52, pp. 1-11, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Tianqi Chen, and Carlos Guestrin, “Xgboost: A Scalable Tree Boosting System,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA, pp. 785-794, 2016. [CrossRef] [Google Scholar] [Publisher Link]
[38] Neha Srivastava, and Devendra K. Tayal, “Assessing Gene Stability and Gene Affinity in Microarray Data Classification Using An Extended Relieff Algorithm,” Multimedia Tools and Applications, vol. 83, pp. 45761-45776, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Yongxiao Qiu, Guanghui Du, and Song Chai, “A Novel Algorithm for Distributed Data Stream Using Big Data Classification Model,” International Journal of Information Technology and Web Engineering, vol. 15, no. 4, pp. 1-17, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Huihui Wang et al., “Distributed Classification for Imbalanced Big Data in Distributed Environments,” Wireless Networks, vol. 30, pp. 3657-3668, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Mikel Elkano et al., “CFM-BD: A Distributed Rule Induction Algorithm for Building Compact Fuzzy Models in Big Data Classification Problems,” IEEE Transactions on Fuzzy Systems, vol. 28, no. 1, pp. 163-177, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Yuzhe Yang, and Zhi Xu, “Rethinking the Value of Labels for Improving Class-Imbalanced Learning,” Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver BC Canada, pp. 19290-19301, 2020.
[Google Scholar] [Publisher Link]
[43] Hyuck Lee, Seungjae Shin, and Heeyoung Kim, “ABC: Auxiliary Balanced Classifier for Class-Imbalanced Semi-Supervised Learning,” Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 7082-7094, 2021.
[Google Scholar] [Publisher Link]
[44] Fei Sun et al., “An Impartial Semi-Supervised Learning Strategy for Imbalanced Classification on VHR Images,” Sensors, vol. 20, no. 22, pp. 1-20, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[45] Zhiguo Wang et al., “Optimally Combining Classifiers for Semi-Supervised Learning,” Arxiv Preprint, pp. 1-13, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Lucas R. Hope, and Kevin B. Korb, “Bayesian Information Reward,” Proceedings 15th Australian Joint Conference on AI 2002: Advances in Artificial Intelligence, Canberra, Australia, pp. 272-283, 2002.
[CrossRef] [Google Scholar] [Publisher Link]
[47] A. Saravanan, and Vineetha Venugopal, “Detection and Verification of Cloned Profiles in Online Social Networks Using MapReduce Based Clustering and Classification,” International Journal of Intelligent Systems and Applications in Engineering, vol. 11, no. 1, pp. 195-207, 2023.
[Google Scholar] [Publisher Link]
[48] A. Saravanan, C. Stanly Felix, and M. Umarani, “Maximum Relevancy and Minimum Redundancy Based Ensemble Feature Selection Model for Effective Classification,” Proceedings of ICACIT Advanced Computing and Intelligent Technologies, Manipur, India, pp. 131-146, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[49] Chih-Chung Chang, and Chih-Jen Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1-27, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[50] Dheeru Dua, and Casey A Graff, “UC Irvine Machine Learning Repository,” University of California, Irvine, School of Information and Computer Sciences, 2017.
[Google Scholar] [Publisher Link]