Development of a Data Cleaning System for Consumer Master Data using Sorted Neighborhood and N-Gram Methods

YUSUF LESTANTO, RAHMA MUALIFA

Abstract


ABSTRAK

Penelitian ini mengembangkan sistem pembersihan data master menggunakan metode Sorted Neighborhood Method (SNM) dan N-gram untuk mendeteksi dan menghilangkan duplikasi serta menstandarkan format nama dan alamat. SNM menangani pra-pembersihan, menghapus karakter dan judul tertentu, dan membentuk token untuk perbandingan. N-gram menghitung kemiripan dengan nilai dan ambang batas yang ditentukan. Efektivitas metode dievaluasi menggunakan metrik recall, precision, dan F-measure pada dua set data: kecil dan besar. Ambang batas optimal, panjang token, dan nilai N-gram masing-masing adalah 0.7, 5, dan 2, menghasilkan nilai F-measure tertinggi. Hasilnya mengonfirmasi keberhasilan implementasi dan meningkatkan kualitas data. Identifikasi parameter optimal memberikan tolok ukur untuk upaya pembersihan data, berpotensi menyederhanakan proses dan mengurangi sumber daya pemeliharaan data.

Kata kunci: Pembersihan data, Deteksi duplikasi, Sorted Neighborhood, N-Gram, Kualitas data.

 

ABSTRACT

This study developed a data cleaning system for master data using the Sorted Neighborhood Method (SNM) and N-gram methods to detect and eliminate duplicates and standardize name and address formats. The proposed SNM algorithm handles precleaning tasks, removes specific characters and titles, and forms tokens for comparison. The N-gram algorithm calculates record similarity using user-defined N-gram values and thresholds. The effectiveness was evaluated using recall, precision, and F-measure metrics on small and large datasets. The optimal threshold, token length, and N-gram values were 0.7, 5, and 2, respectively, yielding the highest F-measure scores. The results confirm the successful implementation and improvement of data quality. Identifying optimal parameters provides a benchmark for future data-cleaning efforts, potentially streamlining processes and reducing resources.

Keywords: Data cleaning, duplicate Detection, Sorted Neighborhood, N-Gram, Data quality.


Keywords


Data cleaning; duplicate Detection; Sorted Neighborhood; N-Gram; Data quality

Full Text:

PDF

References


Aldoseri, A., Al-Khalifa, K. N., & Hamouda, A. M. (2023). Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges. Applied Sciences, 13(12), 7082. Retrieved from https://www.mdpi.com/2076-3417/13/12/7082

Awan, U., Shamim, S., Khan, Z., Zia, N. U., Shariq, S. M., & Khan, M. N. (2021). Big data analytics capability and decision-making: The role of data-driven insight on circular economy performance. Technological Forecasting and Social Change, 168, 120766. doi:https://doi.org/10.1016/j.techfore.2021.120766

Bousdekis, A., Lepenioti, K., Apostolou, D., & Mentzas, G. (2021). A review of data-driven decision-making methods for industry 4.0 maintenance applications. Electronics, 10(7), 828.

Faiz, T. (2019). Multi-approaches on scrubbing data for medium-sized enterprises. Paper presented at the 2019 International Conference on Digitization (ICD).

Fan, W., & Geerts, F. (2022). Foundations of data quality management: Springer Nature.

Foroozan, S., Murad, M. A., Sharef, N., & Latiff, A. A. (2015). Improving sentiment classification accuracy of financial news using n-gram approach and feature weighting methods. Paper presented at the 2015 2nd International Conference on Information Science and Security (ICISS).

Jiang, T., Huang, P., & Zhou, K. (2020). Achieving high data reliability at low scrubbing cost via failure-aware scrubbing. Journal of Parallel and Distributed Computing, 144, 220-229.

Kejriwal, M., & Miranker, D. P. (2015). Sorted neighborhood for schema-free RDF data. Paper presented at the European Semantic Web Conference.

Li, M., Xie, Q., & Ding, Q. (2015). An improved data cleaning algorithm based on SNM. Paper presented at the Cloud Computing and Security: First International Conference, ICCCS 2015, Nanjing, China, August 13-15, 2015. Revised Selected Papers 1.

Liu, F., & Panagiotakos, D. (2022). Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Medical Research Methodology, 22(1), 287. doi:10.1186/s12874-022-01768-6

Maharana, K., Mondal, S., & Nemade, B. (2022). A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings, 3(1), 91-99.

Mishra, R. S., Mehta, K., & Rasiwasia, N. (2021). Scalable approach for normalizing ecommerce text attributes (SANTA). arXiv preprint arXiv:2106.09493.

Obiedat, R., Qaddoura, R., Al-Zoubi, A. M., Al-Qaisi, L., Harfoushi, O., Alrefai, M., & Faris, H. (2022). Sentiment Analysis of Customers’ Reviews Using a Hybrid Evolutionary SVMBased Approach in an Imbalanced Data Distribution. IEEE Access, 10, 22260-22273. doi:10.1109/ACCESS.2022.3149482

Ridzuan, F., & Wan Zainon, W. M. N. (2019). A Review on Data Cleansing Methods for Big Data. Procedia Computer Science, 161, 731-738. doi:https://doi.org/10.1016/j.procs.2019.11.177

Schuster, D., van Zelst, S. J., & van der Aalst, W. M. P. (2022). Utilizing domain knowledge in data-driven process discovery: A literature review. Computers in Industry, 137, 103612. doi:https://doi.org/10.1016/j.compind.2022.103612

Singh, N., & Chaudhari, N. S. (2016). N-gram approach for a URL similarity measure. Paper presented at the 2016 1st India International Conference on Information Processing (IICIP).

Wei, H., Yu, J. X., & Lu, C. (2017). String similarity search: A hash-based approach. IEEE Transactions on Knowledge and Data Engineering, 30(1), 170-184.

Yu, M., Li, G., Deng, D., & Feng, J. (2016). String similarity search and join: a survey. Frontiers of Computer Science, 10, 399-417.




DOI: https://doi.org/10.26760/elkomika.v13i1.57

Refbacks

  • There are currently no refbacks.


 

_______________________________________________________________________________________________________________________

ISSN (print) : 2338-8323 | ISSN (electronic) : 2459-9638

Publisher:

Department of Electrical Engineering Institut Teknologi Nasional Bandung, Indonesia

Address: 20th Building  Institut Teknologi Nasional Bandung PHH. Mustofa Street No. 23 Bandung 40124, Indonesia

Contact: +627272215 (ext. 206)

Email: jte.itenas@itenas.ac.id________________________________________________________________________________________________________________________


Free counters!

Web

Analytics Made Easy - StatCounter

Statistic Journal

Jurnal ini terlisensi oleh Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License