RESULTANT: Data Preparation Techniques to Improve XGBoost Algorithm Performance

KURNIA RAMADHAN PUTRA, SOFIA UMAROH, NUR FITRIANTI, SATRIA NUGRAHA

Sari


ABSTRAK

Prediksi credit scoring saat ini banyak digunakan dalam layanan peer-to-peer lending oleh perusahaan teknologi finansial. Salah satu teknologi yang digunakan untuk credit scoring adalah data mining menggunakan algoritma machine learning XGBoost yang memiliki tingkat akurasi yang tinggi. RESULTANT diusulkan sebagai teknik yang digunakan untuk memaksimalkan hasil dari salah satu tahapan data mining yaitu preparasi data. Dataset yang digunakan adalah data Lending Club dengan total 2.260.701 record dan 151 variabel. Tahapan yang dilakukan pada RESULTANT adalah seleksi fitur, penanganan missing value, penanganan data outlier dan penanganan data ketidakseimbangan. Dari tahap RESULTANT, dihasilkan 44 variabel akhir yang siap digunakan untuk membangun model dengan menggunakan algoritma XGBoost. Hasil menunjukkan bahwa RESULTANT mampu meningkatkan performa algoritma XGBoost dengan akurasi 99,17%, presisi 99,28%, recall 99,05%, spesifisitas 99,29%, ROC/AUC 99,94%, dan skor f1 99,17%.

Kata kunci: XGBoost, Preparasi Data, Seleksi Fitur, Missing Value, Outlier

ABSTRACT

Credit scoring predictions are currently widely used in peer-to-peer lending services by financial technology companies. One of the technologies used for credit scoring is data mining using the XGBoost machine learning algorithm which has a high degree of accuracy. We present RESULTANT as a technique used to maximize the results of one of the stages of data mining, namely data preparation. The dataset used is Lending Club data with a total of 2,260,701 records and 151 variables. The stages carried out in RESULTANT are feature selection, handling missing values, handling outlier data and handling imbalance data. From the RESULTANT stage, 44 final variables are produced which are ready to be used to build models using the XGBoost algorithm. The results showed that RESULTANT was able to improve the performance of the XGBoost algorithm with accuracy 99,17%, precision 99,28%, recall 99,05%, specificity 99,29%, ROC/AUC 99.94%, and f1-score 99,17%.

Keywords: XGBoost, Data Preparation, Feature Selection, Missing Value, Outlier


Teks Lengkap:

PDF

Referensi


Abdykalykova. (2020). Credit Scoring Using Machine Learning, Information Technologies and Management, 45 - 46.

Kartika, Risna. (2020). Analisis Peer To Peer Lending Di Indonesia. Ilmu-Ilmu Ekon, vol. 12, no. 2, pp. 75–86, 2020, doi: 10.35457/akuntabilitas.v12i2.902.

A. Kadav, J. Kawale., & Mitra P. (2013). Data Mining Standards. Online: http://www.idmarch.org/document_cache/0e9fd64335b3c9f01f6b39320f99c190.pdf.

Wirth, R., & Hipp, J. (2000). CRISP-DM: towards a standard process model for data mining. Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining, 29-39. Proc. Fourth Int. Conf. Pract. Appl. Knowl. Discov. Data Min., no. 24959, 29–39. Online: https://www.researchgate.net/publication/239585378_CRISPDM_Towards_a_standard_process_model_for_data_mining.

Schröer, C., Kruse, F., & Gómez, J. M. (2019). A systematic literature review on applying CRISP-DM process model. Procedia Comput. Sci., vol. 181, 526 - 534. doi: 10.1016/j.procs.2021.01.199.

Abdallah, Z. S., and Webb, G. (2017). Encyclopedia of Machine Learning and Data Mining. Encycl. Mach. Learn. Data Min., no. September 2018. doi: 10.1007/978-1-4899-7687-1.

Hartini, E. (2016). Efficiency Comparison of Method of Handling Missing Value in Data Evaluation System or Component. Pros. Semin. Nas. Teknol. Energi Nukl, 4–5.

Samuels, P. (2014). Pearson Corelation,†no. April 2014, 1 - 5. [Online]. Available: https://www.researchgate.net/publication/274635640.

Sedgwick, P. (2014). Understanding P values. BMJ, vol. 349, no. July 2014, pp. 10–12, 2014, doi: 10.1136/bmj.g4550.

Cousineau, D., & Chartier, S. (2010). Outliers detection and treatment: a review. Int. J. Psychol. Res., vol. 3, no. 1, 58–67. doi: 10.21500/20112084.844.

Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2011). Handling imbalanced datasets : A review. Science (80-. )., vol. 30, no. 1, 25 - 36, [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.9248&rep=rep1&type=pdf.

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. 13 - 17-Augu, 785–794. doi: 10.1145/2939672.2939785.

Doshi, M., & Chaturvedi, S. K. (2014). Correlation Based Feature Selection (CFS) Technique to Predict Student Perfromance. Int. J. Comput. Networks Commun., vol. 6, no. 3, pp. 197–206, 2014, doi: 10.5121/ijcnc.2014.6315.

Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul Rahman, S., & Fong, S. (2018). Feature selection methods: Case of filter and wrapper approaches for maximising classification accuracy. Pertanika J. Sci. Technol., vol. 26, no. 1, 329 - 340.

Jassim, A. M., Abdul Wahid, S. N. (2020). Data Mining Preparation: Process, Techniques, and Major Issues in Data Analysis. ICEST, 1090 (2021) 012053.




DOI: https://doi.org/10.26760/mindjournal.v8i1.42-51

Refbacks

  • Saat ini tidak ada refbacks.


____________________________________________________________

ISSN (cetak) : 2338-8323  |  ISSN (elektronik) :  2528-0902

diterbitkan oleh:

Informatika Institut Teknologi Nasional Bandung

Alamat : Gedung 2 Jl. PHH. Mustofa 23 Bandung 40124

Kontak : Tel. 7272215 (ext. 181)  Fax. 7202892

Email : mind.journal@itenas.ac.id

____________________________________________________________

Statistik Pengunjung :

Flag Counter

  Web
Analytics Statistik Pengunjung

 Jurnal ini terlisensi oleh Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License