Influence of Data Scaling and Train/Test Split Ratios on LightGBM Efficacy for Obesity Rate Prediction
Sari
Normalisasi adalah proses yang tidak dapat dilewatkan dalam data mining yang membantu menyesuaikan nilai atribut data ke skala yang sama. Dalam konteks data mining, perbedaan skala antar atribut dapat menyebabkan kesalahan dalam pemodelan atau interpretasi hasil. Penggunaan normalisasi dalam pra-pemrosesan masih diperdebatkan, terutama ketika menggunakan algoritma dari kelompok pohon keputusan. Penelitian ini membandingkan model dengan data yang dinormalisasi dan tidak dinormalisasi dengan menggunakan metode normalisasi, MinMaxScaler, MaxAbsScaler, dan RobustScaler. Hasil penelitian menunjukkan bahwa model LightGBM tanpa normalisasi memiliki tingkat akurasi sebesar 96,6 dalam mengklasifikasikan tingkat obesitas pada data saat ini. Tidak hanya normalisasi yang mempengaruhi hasil klasifikasi, tetapi juga jumlah rasio antara data pelatihan dan pengujian. Penelitian menunjukkan bahwa semakin besar persentase data yang digunakan untuk pelatihan, semakin tinggi tingkat akurasinya. Pada dataset obesitas, rasio 80:20 memiliki akurasi hingga 97%.
Kata kunci: Decision Tree, LightGBM, Obesitas, Data Mining, Klasifikasi
AbstractNormalization is an essential process in data mining that helps adjust the values of data attributes to the same scale. In data mining, differences in attribute scales can lead to errors in modeling or interpreting results. Normalization in preprocessing is still debated, particularly when using algorithms from the decision tree family. This study compares models with normalized and non-normalized data using normalization methods such as MinMaxScaler, MaxAbsScaler, and RobustScaler. The results show that the LightGBM model without normalization achieved an accuracy rate of 96.6% in classifying obesity levels in the current dataset. Not only does normalization affect classification results, but the ratio between training and testing data also plays a role. The study indicates that the larger the percentage of data used for training, the higher the accuracy rate. In the obesity dataset, an 80:20 ratio resulted in an accuracy rate of up to 97%.
Keywords: Decision Tree, LightGBM, Obesity, Data Mining, Classification
Teks Lengkap:
PDF (English)Referensi
A. Mohammed, M., Kadhem, S., Maisa, & Ali, A. (2021). Insider Attacker Detection Using Light Gradient Boosting Machine. 1(February), 48–66.
Dogan, A., & Birant, D. (2021). Machine learning and data mining in manufacturing. Expert Systems with Applications, 166, 114060. https://doi.org/10.1016/j.eswa.2020.114060
Dwivedi, A. D., Srivastava, G., Dhar, S., & Singh, R. (2019). A decentralized privacy-preserving healthcare blockchain for IoT. Sensors (Switzerland), 19(2), 1–17. https://doi.org/10.3390/s19020326
Jin, D., Lu, Y., Qin, J., Cheng, Z., & Mao, Z. (2020). SwiftIDS: Real-time intrusion detection system based on LightGBM and parallel intrusion detection mechanism. Computers and Security, 97, 101984. https://doi.org/10.1016/j.cose.2020.101984
Kumar, S., & Singh, M. (2019). Big data analytics for healthcare industry: Impact, applications, and tools. Big Data Mining and Analytics, 2(1), 48–57. https://doi.org/10.26599/BDMA.2018.9020031
Machado, M. R., Karray, S., & De Sousa, I. T. (2019). LightGBM: An effective decision tree gradient boosting method to predict customer loyalty in the finance industry. 14th International Conference on Computer Science and Education, ICCSE 2019, (Iccse), 1111–1116. https://doi.org/10.1109/ICCSE.2019.8845529
Nugraha, W. (n.d.). Prediksi penyakit jantung cardiovascular menggunakan model algoritma klasifikasi.
Pagan, M., Zarlis, M., & Candra, A. (2023). Investigating the impact of data scaling on the k-nearest neighbor algorithm. Computer Science and Information Technologies, 4(2), 135–142. https://doi.org/10.11591/csit.v4i2.pp135-142
Palanisamy, V., & Thirunavukarasu, R. (2019). Implications of big data analytics in developing healthcare frameworks – A review. Journal of King Saud University - Computer and Information Sciences, 31(4), 415–425. https://doi.org/10.1016/j.jksuci.2017.12.007
Patel, H. H., & Prajapati, P. (2018). Study and Analysis of Decision Tree Based Classification Algorithms. International Journal of Computer Sciences and Engineering, 6(10), 74–78. https://doi.org/10.26438/ijcse/v6i10.7478
Pawluszek-Filipiak, K., & Borkowski, A. (2020). On the importance of train-test split ratio of datasets in automatic landslide detection by supervised classification. Remote Sensing, 12(18). https://doi.org/10.3390/rs12183054
Rácz, A., Bajusz, D., & Héberger, K. (2021). Effect of dataset size and train/test split ratios in qsar/qspr multiclass classification. Molecules, 26(4), 1–16. https://doi.org/10.3390/molecules26041111
Santisteban Quiroz, J. P. (2022). Estimation of obesity levels based on dietary habits and condition physical using computational intelligence. Informatics in Medicine Unlocked, 29(July 2021), 100901. https://doi.org/10.1016/j.imu.2022.100901
Saura, J. R., Herraez, B. R., & Reyes-Menendez, A. (2019). Comparing a traditional approach for financial brand communication analysis with a big data analytics technique. IEEE Access, 7, 37100–37108. https://doi.org/10.1109/ACCESS.2019.2905301
Shehadeh, A., Alshboul, O., Al Mamlook, R. E., & Hamedat, O. (2021). Machine learning models for predicting the residual value of heavy construction equipment: An evaluation of modified decision tree, LightGBM, and XGBoost regression. Automation in Construction, 129(June), 103827. https://doi.org/10.1016/j.autcon.2021.103827
Singh, D., & Singh, B. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524. https://doi.org/10.1016/j.asoc.2019.105524
Sun, Y., Wang, S., & Sun, X. (2020). Estimating neighbourhood-level prevalence of adult obesity by socio-economic, behavioural and built environment factors in New York City. Public Health, 186, 57–62. https://doi.org/10.1016/j.puhe.2020.05.003
Yamada, Y., Suzuki, E., Yokoi, H., & Takabayashi, K. (2003). Decision-tree Induction from Time-series Data Based on a Standard-example Split Test. Proceedings, Twentieth International Conference on Machine Learning, 2, 840–847.
DOI: https://doi.org/10.26760/mindjournal.v9i2.220-234
Refbacks
- Saat ini tidak ada refbacks.
____________________________________________________________
ISSN (cetak) : 2338-8323 | ISSN (elektronik) : 2528-0902
diterbitkan oleh:
Informatika Institut Teknologi Nasional Bandung
Alamat : Gedung 2 Jl. PHH. Mustofa 23 Bandung 40124
Kontak : Tel. 7272215 (ext. 181)Â Fax. 7202892
Email : mind.journal@itenas.ac.id
____________________________________________________________
Statistik Pengunjung :
Jurnal ini terlisensi oleh Creative Commons Attribution-ShareAlike 4.0 International License.