Influence of Data Scaling and Train/Test Split Ratios on LightGBM Efficacy for Obesity Rate Prediction




Normalisasi adalah proses yang tidak dapat dilewatkan dalam data mining yang membantu menyesuaikan nilai atribut data ke skala yang sama. Dalam konteks data mining, perbedaan skala antar atribut dapat menyebabkan kesalahan dalam pemodelan atau interpretasi hasil. Penggunaan normalisasi dalam pra-pemrosesan masih diperdebatkan, terutama ketika menggunakan algoritma dari kelompok pohon keputusan.  Penelitian ini membandingkan model dengan data yang dinormalisasi dan tidak dinormalisasi dengan menggunakan metode normalisasi, MinMaxScaler, MaxAbsScaler, dan RobustScaler. Hasil penelitian menunjukkan bahwa model LightGBM tanpa normalisasi memiliki tingkat akurasi sebesar 96,6 dalam mengklasifikasikan tingkat obesitas pada data saat ini. Tidak hanya normalisasi yang mempengaruhi hasil klasifikasi, tetapi juga jumlah rasio antara data pelatihan dan pengujian. Penelitian menunjukkan bahwa semakin besar persentase data yang digunakan untuk pelatihan, semakin tinggi tingkat akurasinya. Pada dataset obesitas, rasio 80:20 memiliki akurasi hingga 97%.

Kata kunci: Decision Tree, LightGBM, Obesitas, Data Mining, Klasifikasi


Normalization is an essential process in data mining that helps adjust the values of data attributes to the same scale. In data mining, differences in attribute scales can lead to errors in modeling or interpreting results. Normalization in preprocessing is still debated, particularly when using algorithms from the decision tree family. This study compares models with normalized and non-normalized data using normalization methods such as MinMaxScaler, MaxAbsScaler, and RobustScaler. The results show that the LightGBM model without normalization achieved an accuracy rate of 96.6% in classifying obesity levels in the current dataset. Not only does normalization affect classification results, but the ratio between training and testing data also plays a role. The study indicates that the larger the percentage of data used for training, the higher the accuracy rate. In the obesity dataset, an 80:20 ratio resulted in an accuracy rate of up to 97%.

Keywords: Decision Tree, LightGBM, Obesity, Data Mining, Classification

Teks Lengkap:

PDF (English)


A. Mohammed, M., Kadhem, S., Maisa, & Ali, A. (2021). Insider Attacker Detection Using Light Gradient Boosting Machine. 1(February), 48–66.

Dogan, A., & Birant, D. (2021). Machine learning and data mining in manufacturing. Expert Systems with Applications, 166, 114060.

Dwivedi, A. D., Srivastava, G., Dhar, S., & Singh, R. (2019). A decentralized privacy-preserving healthcare blockchain for IoT. Sensors (Switzerland), 19(2), 1–17.

Jin, D., Lu, Y., Qin, J., Cheng, Z., & Mao, Z. (2020). SwiftIDS: Real-time intrusion detection system based on LightGBM and parallel intrusion detection mechanism. Computers and Security, 97, 101984.

Kumar, S., & Singh, M. (2019). Big data analytics for healthcare industry: Impact, applications, and tools. Big Data Mining and Analytics, 2(1), 48–57.

Machado, M. R., Karray, S., & De Sousa, I. T. (2019). LightGBM: An effective decision tree gradient boosting method to predict customer loyalty in the finance industry. 14th International Conference on Computer Science and Education, ICCSE 2019, (Iccse), 1111–1116.

Nugraha, W. (n.d.). Prediksi penyakit jantung cardiovascular menggunakan model algoritma klasifikasi.

Pagan, M., Zarlis, M., & Candra, A. (2023). Investigating the impact of data scaling on the k-nearest neighbor algorithm. Computer Science and Information Technologies, 4(2), 135–142.

Palanisamy, V., & Thirunavukarasu, R. (2019). Implications of big data analytics in developing healthcare frameworks – A review. Journal of King Saud University - Computer and Information Sciences, 31(4), 415–425.

Patel, H. H., & Prajapati, P. (2018). Study and Analysis of Decision Tree Based Classification Algorithms. International Journal of Computer Sciences and Engineering, 6(10), 74–78.

Pawluszek-Filipiak, K., & Borkowski, A. (2020). On the importance of train-test split ratio of datasets in automatic landslide detection by supervised classification. Remote Sensing, 12(18).

Rácz, A., Bajusz, D., & Héberger, K. (2021). Effect of dataset size and train/test split ratios in qsar/qspr multiclass classification. Molecules, 26(4), 1–16.

Santisteban Quiroz, J. P. (2022). Estimation of obesity levels based on dietary habits and condition physical using computational intelligence. Informatics in Medicine Unlocked, 29(July 2021), 100901.

Saura, J. R., Herraez, B. R., & Reyes-Menendez, A. (2019). Comparing a traditional approach for financial brand communication analysis with a big data analytics technique. IEEE Access, 7, 37100–37108.

Shehadeh, A., Alshboul, O., Al Mamlook, R. E., & Hamedat, O. (2021). Machine learning models for predicting the residual value of heavy construction equipment: An evaluation of modified decision tree, LightGBM, and XGBoost regression. Automation in Construction, 129(June), 103827.

Singh, D., & Singh, B. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524.

Sun, Y., Wang, S., & Sun, X. (2020). Estimating neighbourhood-level prevalence of adult obesity by socio-economic, behavioural and built environment factors in New York City. Public Health, 186, 57–62.

Yamada, Y., Suzuki, E., Yokoi, H., & Takabayashi, K. (2003). Decision-tree Induction from Time-series Data Based on a Standard-example Split Test. Proceedings, Twentieth International Conference on Machine Learning, 2, 840–847.



  • Saat ini tidak ada refbacks.


ISSN (cetak) : 2338-8323  |  ISSN (elektronik) :  2528-0902

diterbitkan oleh:

Informatika Institut Teknologi Nasional Bandung

Alamat : Gedung 2 Jl. PHH. Mustofa 23 Bandung 40124

Kontak : Tel. 7272215 (ext. 181)  Fax. 7202892

Email :


Statistik Pengunjung :

Flag Counter

Analytics Statistik Pengunjung

 Jurnal ini terlisensi oleh Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License