Klasifikasi Halaman SEO Berbasis Machine Learning Melalui Mutual Information dan Random Forest Feature Importance

SITI NURADILLA; KUSMAN SADIK; CICI SUHAENI; AGUS M SOLEH

doi:10.26760/mindjournal.v10i1.114-129

Klasifikasi Halaman SEO Berbasis Machine Learning Melalui Mutual Information dan Random Forest Feature Importance

SITI NURADILLA, KUSMAN SADIK, CICI SUHAENI, AGUS M SOLEH

Sari

Abstrak

Proses optimasi SEO melibatkan banyak faktor yang saling terkait, sehingga sulit bagi tim SEO dalam menentukan halaman mana yang memerlukan perbaikan lebih lanjut. Penelitian ini bertujuan untuk mengembangkan model berbasis machine learning yang tidak hanya akurat dalam mengklasifikasikan halaman, tetapi juga efisien dalam memilih fitur yang paling informatif. Metode yang digunakan dalam penelitian ini melibatkan seleksi fitur menggunakan Mutual Information (MI) dan Random Forest Feature Importance (RFFI) untuk mengidentifikasi faktor-faktor yang paling penting untuk optimasi SEO, yang dimodelkan menggunakan Random Forest dan Weighted Voting Ensemble (WVE). Model dievaluasi berdasarkan Accuracy, Precision, Recall, dan ROC AUC. Hasil penelitian menunjukkan bahwa model Random Forest dengan 20 fitur berdasarkan RFFI, memberikan performa terbaik dengan ROC AUC sebesar 75.87%, Accuracy 77,74%, Precision 60,51%, dan Recall 71.29%. Model mampu membedakan secara efektif halaman yang membutuhkan optimasi SEO atau tidak.

Kata kunci: Feature Importance, Random Forest, SEO, Seleksi Variabel, WVE

Abstract

The SEO optimization process involves many interrelated factors, making it challenging to identify which pages need further improvement. This study proposes a machine learning-based model that is accurate in classifying web pages and efficient in selecting the most relevant features. Feature selection is performed using Mutual Information (MI) and Random Forest Feature Importance (RFFI) to identify key factors for SEO optimization, followed by modeling with Random Forest and Weighted Voting Ensemble (WVE). The model is evaluated using Accuracy, Precision, Recall, and ROC AUC. Results indicate that the Random Forest model with 20 features selected via RFFI delivers the best performance, achieving a ROC AUC of 75.87%, Accuracy of 77.74%, Precision of 60.51%, and Recall of 71.29%. The model effectively distinguishes between pages that require SEO optimization and those that do not.

Keywords: Feature Importance, Random Forest, SEO, Variable Selection, WVE

Teks Lengkap:

PDF

Referensi

Alduailij, M., Khan, Q. W., Tahir, M., Sardaraz, M., Alduailij, M., & Malik, F. (2022). Machine-Learning-Based DDoS Attack Detection Using Mutual Information and Random Forest Feature Importance Method. Symmetry, 14(6), 1095. https://doi.org/10.3390/sym14061095

Alfiana, F., Khofifah, N., Ramadhan, T., Septiani, N., Wahyuningsih, W., Azizah, N. N., & Ramadhona, N. (2023). Apply the Search Engine Optimization (SEO) Method to determine Website Ranking on Search Engines. International Journal of Cyber and IT Service Management, 3(1), 65–73. https://doi.org/10.34306/ijcitsm.v3i1.126

Aryani, D., Shine Pintor Siolemba Patiro, Setiawan, A., & Tjahjono, B. (2023). Comparative Analysis Of On-Page And Off-Page White Hat Search Engine Optimization (SEO) Techniques On Website Popularity. International Journal of Science, Technology & Management, 4(3), 527–533. https://doi.org/10.46729/ijstm.v4i3.815

Badri, N., Kboubi, F., & Chaibi, A. H. (2022). Combining FastText and Glove Word Embedding for Offensive and Hate speech Text Detection. Procedia Computer Science, 207, 769–778. https://doi.org/10.1016/j.procs.2022.09.132

Beraha, M., Metelli, A. M., Papini, M., Tirinzoni, A., & Restelli, M. (2019). Feature Selection via Mutual Information: New Theoretical Insights (No. arXiv:1907.07384). arXiv. https://doi.org/10.48550/arXiv.1907.07384

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051

Capitaine, L., Genuer, R., & Thiebaut, R. (2021). Random Forests for high-dimensional longitudinal data. Statistical Methods in Medical Research, 30(1), 166–184. https://doi.org/10.1177/0962280220946080

Carrington, A. M., Manuel, D. G., Fieguth, P. W., Ramsay, T., Osmani, V., Wernly, B., Bennett, C., Hawken, S., Magwood, O., Sheikh, Y., McInnes, M., & Holzinger, A. (2023). Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3145392

Disha, R. A., & Waheed, S. (2022). Performance analysis of machine learning models for intrusion detection system using Gini impurity-based Weighted Random Forest (GIWRF) feature selection technique.

Dogan, A., & Birant, D. (2019). A Weighted Majority Voting Ensemble Approach for Classification. IEEE Access.

Fatima, S., Hussain, A., Amir, S. B., Ahmed, S. H., & Aslam, S. M. H. (2023). XGBoost and Random Forest Algorithms: An in Depth Analysis. Pakistan Journal of Scientific Research, 3(1), 26–31. https://doi.org/10.57041/pjosr.v3i1.946

Hastuti, R. P., Riona, V., & Hardiyanti, M. (2023). Content Retrieval dengan Fasttext Word Embedding pada Learning Management System Olimpiade. Journal of Internet and Software Engineering, 4(1), 18–22. https://doi.org/10.22146/jise.v4i1.6766

Hoilijoki, S., Kilpua, E. K. J., Osmane, A., Kalliokoski, M. M. H., George, H., Savola, M., & Asikainen, T. (2022). Using Mutual Information to investigate non-linear correlation between AE index, ULF Pc5 wave activity and electron precipitation. Frontiers in Astronomy and Space Sciences, 9, 987913. https://doi.org/10.3389/fspas.2022.987913

Khomsah, S., Ramadhani, R. D., & Wijaya, S. (2022). The Accuracy Comparison Between Word2Vec and FastText On Sentiment Analysis of Hotel Reviews. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 6(3), 352–358. https://doi.org/10.29207/resti.v6i3.3711

Laarne, P., Zaidan, M. A., & Nieminen, T. (2021). ennemi: Non-linear correlation detection with Mutual Information.

Matosevic, G., Dobsa, J., & Mladenic, D. (2021). Using Machine Learning for Web Page Classification in Search Engine Optimization. Future Internet, 13(1), 9. https://doi.org/10.3390/fi13010009

Mienye, I. D., & Sun, Y. (2022). A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access, 10, 99129–99149. https://doi.org/10.1109/ACCESS.2022.3207287

Naidu, G., Zuva, T., & Sibanda, E. M. (2023). A Review of Evaluation Metrics in Machine Learning Algorithms. Dalam R. Silhavy & P. Silhavy (Ed.), Artificial Intelligence Application in Networks and Systems (hlm. 15–25). Springer International Publishing. https://doi.org/10.1007/978-3-031-35314-7_2

Naseem, U., Razzak, I., & Eklund, P. W. (2021). A survey of pre-processing techniques to improve short-text quality: A case study on hate speech detection on twitter. Multimedia Tools and Applications, 80(28–29), 35239–35266. https://doi.org/10.1007/s11042-020-10082-6

Osamor, V. C., & Okezie, A. F. (2021). Enhancing the weighted voting ensemble algorithm for tuberculosis predictive diagnosis. Scientific Reports, 11(1), 14806. https://doi.org/10.1038/s41598-021-94347-6

Pandey, P., & Pandeya, Y. R. (2023). Machine Learning Techniques for Web Page Classification with Search Engine Optimization. 8(2).

Probst, P., Wright, M., & Boulesteix, A.-L. (2019). Hyperparameters and Tuning Strategies for Random Forest. WIREs Data Mining and Knowledge Discovery, 9(3), e1301. https://doi.org/10.1002/widm.1301

Rachita, R., & Pandey, S. (2024). A Comprehensive Examination of Search Engine Optimization (SEO) Audit and Optimization Strategies. International Scientific Journal of Engineering and Management, 03(05), 1–9. https://doi.org/10.55041/ISJEM01723

Sannasi Chakravarthy, S. R., & Rajaguru, H. (2022). Ensemble-Based Weighted Voting Approach for the Early Diagnosis of Diabetes Mellitus. Dalam P. Karrupusamy, V. E. Balas, & Y. Shi (Ed.), Sustainable Communication Networks and Application (hlm. 451–460). Springer Nature. https://doi.org/10.1007/978-981-16-6605-6_33

Sohil, F., Sohali, M. U., & Shabbir, J. (2022). An introduction to statistical learning with applications in R: By Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, New York, Springer Science and Business Media, 2013, $41.98, eISBN: 978-1-4614-7137-7. Statistical Theory and Related Fields, 6(1), 87–87. https://doi.org/10.1080/24754269.2021.1980261

Xie, X.-R., Yuan, M.-J., Bai, X.-T., Gao, W., & Zhou, Z.-H. (2023). On the Gini-impurity Preservation For Privacy Random Forests. Advances in Neural Information Processing Systems.

Zaidan, M. A., Dada, L., Alghamdi, M. A., Al-Jeelani, H., Lihavainen, H., Hyvarinen, A., & Hussein, T. (2019). Mutual Information Input Selector and Probabilistic Machine Learning Utilisation for Air Pollution Proxies.

Ziakis, C., Vlachopoulou, M., Kyrkoudis, T., & Karagkiozidou, M. (2019). Important Factors for Improving Google Search Rank. Future Internet, 11(2), 32. https://doi.org/10.3390/fi11020032

DOI: https://doi.org/10.26760/mindjournal.v10i1.114-129