Influence of Data Scaling and Train/Test Split Ratios on LightGBM Efficacy for Obesity Rate  Prediction

NUR FITRIANTI FAHRUDIN; KURNIA RAMADHAN PUTRA; SOFIA UMAROH; GAMAS BLOORY LAUTAN

doi:10.26760/mindjournal.v9i2.220-234

Influence of Data Scaling and Train/Test Split Ratios on LightGBM Efficacy for Obesity Rate Prediction

NUR FITRIANTI FAHRUDIN, KURNIA RAMADHAN PUTRA, SOFIA UMAROH, GAMAS BLOORY LAUTAN

Sari

Abstrak

Normalisasi adalah proses yang tidak dapat dilewatkan dalam data mining yang membantu menyesuaikan nilai atribut data ke skala yang sama. Dalam konteks data mining, perbedaan skala antar atribut dapat menyebabkan kesalahan dalam pemodelan atau interpretasi hasil. Penggunaan normalisasi dalam pra-pemrosesan masih diperdebatkan, terutama ketika menggunakan algoritma dari kelompok pohon keputusan. Penelitian ini membandingkan model dengan data yang dinormalisasi dan tidak dinormalisasi dengan menggunakan metode normalisasi, MinMaxScaler, MaxAbsScaler, dan RobustScaler. Hasil penelitian menunjukkan bahwa model LightGBM tanpa normalisasi memiliki tingkat akurasi sebesar 96,6 dalam mengklasifikasikan tingkat obesitas pada data saat ini. Tidak hanya normalisasi yang mempengaruhi hasil klasifikasi, tetapi juga jumlah rasio antara data pelatihan dan pengujian. Penelitian menunjukkan bahwa semakin besar persentase data yang digunakan untuk pelatihan, semakin tinggi tingkat akurasinya. Pada dataset obesitas, rasio 80:20 memiliki akurasi hingga 97%.

Kata kunci: Decision Tree, LightGBM, Obesitas, Data Mining, Klasifikasi

Abstract

Normalization is an essential process in data mining that helps adjust the values of data attributes to the same scale. In data mining, differences in attribute scales can lead to errors in modeling or interpreting results. Normalization in preprocessing is still debated, particularly when using algorithms from the decision tree family. This study compares models with normalized and non-normalized data using normalization methods such as MinMaxScaler, MaxAbsScaler, and RobustScaler. The results show that the LightGBM model without normalization achieved an accuracy rate of 96.6% in classifying obesity levels in the current dataset. Not only does normalization affect classification results, but the ratio between training and testing data also plays a role. The study indicates that the larger the percentage of data used for training, the higher the accuracy rate. In the obesity dataset, an 80:20 ratio resulted in an accuracy rate of up to 97%.

Keywords: Decision Tree, LightGBM, Obesity, Data Mining, Classification

Teks Lengkap:

PDF (English)

Referensi

A. Mohammed, M., Kadhem, S., Maisa, & Ali, A. (2021). Insider Attacker Detection Using Light Gradient Boosting Machine. 1(February), 48–66.

Dogan, A., & Birant, D. (2021). Machine learning and data mining in manufacturing. Expert Systems with Applications, 166, 114060. https://doi.org/10.1016/j.eswa.2020.114060

Dwivedi, A. D., Srivastava, G., Dhar, S., & Singh, R. (2019). A decentralized privacy-preserving healthcare blockchain for IoT. Sensors (Switzerland), 19(2), 1–17. https://doi.org/10.3390/s19020326

Jin, D., Lu, Y., Qin, J., Cheng, Z., & Mao, Z. (2020). SwiftIDS: Real-time intrusion detection system based on LightGBM and parallel intrusion detection mechanism. Computers and Security, 97, 101984. https://doi.org/10.1016/j.cose.2020.101984

Kumar, S., & Singh, M. (2019). Big data analytics for healthcare industry: Impact, applications, and tools. Big Data Mining and Analytics, 2(1), 48–57. https://doi.org/10.26599/BDMA.2018.9020031

Machado, M. R., Karray, S., & De Sousa, I. T. (2019). LightGBM: An effective decision tree gradient boosting method to predict customer loyalty in the finance industry. 14th International Conference on Computer Science and Education, ICCSE 2019, (Iccse), 1111–1116. https://doi.org/10.1109/ICCSE.2019.8845529

Nugraha, W. (n.d.). Prediksi penyakit jantung cardiovascular menggunakan model algoritma klasifikasi.

Pagan, M., Zarlis, M., & Candra, A. (2023). Investigating the impact of data scaling on the k-nearest neighbor algorithm. Computer Science and Information Technologies, 4(2), 135–142. https://doi.org/10.11591/csit.v4i2.pp135-142

Palanisamy, V., & Thirunavukarasu, R. (2019). Implications of big data analytics in developing healthcare frameworks – A review. Journal of King Saud University - Computer and Information Sciences, 31(4), 415–425. https://doi.org/10.1016/j.jksuci.2017.12.007

Patel, H. H., & Prajapati, P. (2018). Study and Analysis of Decision Tree Based Classification Algorithms. International Journal of Computer Sciences and Engineering, 6(10), 74–78. https://doi.org/10.26438/ijcse/v6i10.7478

Pawluszek-Filipiak, K., & Borkowski, A. (2020). On the importance of train-test split ratio of datasets in automatic landslide detection by supervised classification. Remote Sensing, 12(18). https://doi.org/10.3390/rs12183054

Rácz, A., Bajusz, D., & Héberger, K. (2021). Effect of dataset size and train/test split ratios in qsar/qspr multiclass classification. Molecules, 26(4), 1–16. https://doi.org/10.3390/molecules26041111

Santisteban Quiroz, J. P. (2022). Estimation of obesity levels based on dietary habits and condition physical using computational intelligence. Informatics in Medicine Unlocked, 29(July 2021), 100901. https://doi.org/10.1016/j.imu.2022.100901

Saura, J. R., Herraez, B. R., & Reyes-Menendez, A. (2019). Comparing a traditional approach for financial brand communication analysis with a big data analytics technique. IEEE Access, 7, 37100–37108. https://doi.org/10.1109/ACCESS.2019.2905301

Shehadeh, A., Alshboul, O., Al Mamlook, R. E., & Hamedat, O. (2021). Machine learning models for predicting the residual value of heavy construction equipment: An evaluation of modified decision tree, LightGBM, and XGBoost regression. Automation in Construction, 129(June), 103827. https://doi.org/10.1016/j.autcon.2021.103827

Singh, D., & Singh, B. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524. https://doi.org/10.1016/j.asoc.2019.105524

Sun, Y., Wang, S., & Sun, X. (2020). Estimating neighbourhood-level prevalence of adult obesity by socio-economic, behavioural and built environment factors in New York City. Public Health, 186, 57–62. https://doi.org/10.1016/j.puhe.2020.05.003

Yamada, Y., Suzuki, E., Yokoi, H., & Takabayashi, K. (2003). Decision-tree Induction from Time-series Data Based on a Standard-example Split Test. Proceedings, Twentieth International Conference on Machine Learning, 2, 840–847.

DOI: https://doi.org/10.26760/mindjournal.v9i2.220-234