Development of a Data Cleaning System for Consumer Master Data using Sorted Neighborhood and N-Gram Methods
Abstract
ABSTRAK
Penelitian ini mengembangkan sistem pembersihan data master menggunakan metode Sorted Neighborhood Method (SNM) dan N-gram untuk mendeteksi dan menghilangkan duplikasi serta menstandarkan format nama dan alamat. SNM menangani pra-pembersihan, menghapus karakter dan judul tertentu, dan membentuk token untuk perbandingan. N-gram menghitung kemiripan dengan nilai dan ambang batas yang ditentukan. Efektivitas metode dievaluasi menggunakan metrik recall, precision, dan F-measure pada dua set data: kecil dan besar. Ambang batas optimal, panjang token, dan nilai N-gram masing-masing adalah 0.7, 5, dan 2, menghasilkan nilai F-measure tertinggi. Hasilnya mengonfirmasi keberhasilan implementasi dan meningkatkan kualitas data. Identifikasi parameter optimal memberikan tolok ukur untuk upaya pembersihan data, berpotensi menyederhanakan proses dan mengurangi sumber daya pemeliharaan data.
Kata kunci: Pembersihan data, Deteksi duplikasi, Sorted Neighborhood, N-Gram, Kualitas data.
ABSTRACT
This study developed a data cleaning system for master data using the Sorted Neighborhood Method (SNM) and N-gram methods to detect and eliminate duplicates and standardize name and address formats. The proposed SNM algorithm handles precleaning tasks, removes specific characters and titles, and forms tokens for comparison. The N-gram algorithm calculates record similarity using user-defined N-gram values and thresholds. The effectiveness was evaluated using recall, precision, and F-measure metrics on small and large datasets. The optimal threshold, token length, and N-gram values were 0.7, 5, and 2, respectively, yielding the highest F-measure scores. The results confirm the successful implementation and improvement of data quality. Identifying optimal parameters provides a benchmark for future data-cleaning efforts, potentially streamlining processes and reducing resources.
Keywords: Data cleaning, duplicate Detection, Sorted Neighborhood, N-Gram, Data quality.
Keywords
Full Text:
PDFReferences
Aldoseri, A., Al-Khalifa, K. N., & Hamouda, A. M. (2023). Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges. Applied Sciences, 13(12), 7082. Retrieved from https://www.mdpi.com/2076-3417/13/12/7082
Awan, U., Shamim, S., Khan, Z., Zia, N. U., Shariq, S. M., & Khan, M. N. (2021). Big data analytics capability and decision-making: The role of data-driven insight on circular economy performance. Technological Forecasting and Social Change, 168, 120766. doi:https://doi.org/10.1016/j.techfore.2021.120766
Bousdekis, A., Lepenioti, K., Apostolou, D., & Mentzas, G. (2021). A review of data-driven decision-making methods for industry 4.0 maintenance applications. Electronics, 10(7), 828.
Faiz, T. (2019). Multi-approaches on scrubbing data for medium-sized enterprises. Paper presented at the 2019 International Conference on Digitization (ICD).
Fan, W., & Geerts, F. (2022). Foundations of data quality management: Springer Nature.
Foroozan, S., Murad, M. A., Sharef, N., & Latiff, A. A. (2015). Improving sentiment classification accuracy of financial news using n-gram approach and feature weighting methods. Paper presented at the 2015 2nd International Conference on Information Science and Security (ICISS).
Jiang, T., Huang, P., & Zhou, K. (2020). Achieving high data reliability at low scrubbing cost via failure-aware scrubbing. Journal of Parallel and Distributed Computing, 144, 220-229.
Kejriwal, M., & Miranker, D. P. (2015). Sorted neighborhood for schema-free RDF data. Paper presented at the European Semantic Web Conference.
Li, M., Xie, Q., & Ding, Q. (2015). An improved data cleaning algorithm based on SNM. Paper presented at the Cloud Computing and Security: First International Conference, ICCCS 2015, Nanjing, China, August 13-15, 2015. Revised Selected Papers 1.
Liu, F., & Panagiotakos, D. (2022). Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Medical Research Methodology, 22(1), 287. doi:10.1186/s12874-022-01768-6
Maharana, K., Mondal, S., & Nemade, B. (2022). A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings, 3(1), 91-99.
Mishra, R. S., Mehta, K., & Rasiwasia, N. (2021). Scalable approach for normalizing ecommerce text attributes (SANTA). arXiv preprint arXiv:2106.09493.
Obiedat, R., Qaddoura, R., Al-Zoubi, A. M., Al-Qaisi, L., Harfoushi, O., Alrefai, M., & Faris, H. (2022). Sentiment Analysis of Customers’ Reviews Using a Hybrid Evolutionary SVMBased Approach in an Imbalanced Data Distribution. IEEE Access, 10, 22260-22273. doi:10.1109/ACCESS.2022.3149482
Ridzuan, F., & Wan Zainon, W. M. N. (2019). A Review on Data Cleansing Methods for Big Data. Procedia Computer Science, 161, 731-738. doi:https://doi.org/10.1016/j.procs.2019.11.177
Schuster, D., van Zelst, S. J., & van der Aalst, W. M. P. (2022). Utilizing domain knowledge in data-driven process discovery: A literature review. Computers in Industry, 137, 103612. doi:https://doi.org/10.1016/j.compind.2022.103612
Singh, N., & Chaudhari, N. S. (2016). N-gram approach for a URL similarity measure. Paper presented at the 2016 1st India International Conference on Information Processing (IICIP).
Wei, H., Yu, J. X., & Lu, C. (2017). String similarity search: A hash-based approach. IEEE Transactions on Knowledge and Data Engineering, 30(1), 170-184.
Yu, M., Li, G., Deng, D., & Feng, J. (2016). String similarity search and join: a survey. Frontiers of Computer Science, 10, 399-417.
DOI: https://doi.org/10.26760/elkomika.v13i1.57
Refbacks
- There are currently no refbacks.
_______________________________________________________________________________________________________________________
ISSN (print) : 2338-8323 | ISSN (electronic) : 2459-9638
Publisher:
Department of Electrical Engineering Institut Teknologi Nasional Bandung, Indonesia
Address: 20th Building Institut Teknologi Nasional Bandung PHH. Mustofa Street No. 23 Bandung 40124, Indonesia
Contact: +627272215 (ext. 206)
Email: jte.itenas@itenas.ac.id________________________________________________________________________________________________________________________
Jurnal ini terlisensi oleh Creative Commons Attribution-ShareAlike 4.0 International License.