Optimasi Teknologi WAV2Vec 2.0 menggunakan Spectral Masking untuk meningkatkan Kualitas Transkripsi Teks Video bagi Tuna Rungu
Sari
ABSTRAK
Teknologi Automatic Speech Recognition (ASR) telah berkembang pesat sebagai alat untuk meningkatkan aksesibilitas informasi bagi penyandang tuna rungu, terutama melalui video. WAV2Vec 2.0, salah satu teknologi ASR unggulan, efektif dalam transkripsi teks, namun kinerjanya menurun saat menghadapi noise. Penelitian ini bertujuan mengoptimalkan WAV2Vec 2.0 dengan menerapkan Spectral Masking untuk mengurangi noise tanpa mengorbankan kejelasan sinyal utama. Evaluasi dilakukan pada tiga jenis video: podcast, video dengan background noise, dan video dengan background music. Hasil menunjukkan penurunan Word Error Rate (WER) yang signifikan, sebesar 78.06% pada podcast dan 53.85% pada video dengan background noise. Hasil penelitian menunjukkan bahwa Spectral Masking efektif dalam meningkatkan akurasi transkripsi, menawarkan solusi inovatif untuk aksesibilitas tuna rungu dalam kondisi audio yang kompleks.
Kata kunci: noise reduction, spectral masking, tuna rungu, WAV2Vec 2.0
ABSTRACT
Automatic Speech Recognition (ASR) technology has rapidly evolved as a tool to enhance information accessibility for the hearing impaired, particularly through video content. WAV2Vec 2.0, a leading ASR technology, is effective in text transcription, but its performance degrades in the presence of noise. This study aims to optimize WAV2Vec 2.0 by applying Spectral Masking to reduce noise without compromising the clarity of the main signal. The evaluation was conducted on three types of videos: podcasts, videos with background noise, and videos with background music. The results show a significant reduction in Word Error Rate (WER), with a 78.06% decrease in podcasts and a 53.85% decrease in videos with background noise. These findings demonstrate that Spectral Masking effectively enhances transcription accuracy, offering an innovative solution for improving accessibility for the hearing impaired in complex audio conditions.
Keywords: noise reduction, spectral masking, tuna rungu, WAV2Vec 2.0
Kata Kunci
Teks Lengkap:
PDFReferensi
Andra, M. B., & Usagawa, T. (2020). Automatic Transcription and Captioning System for Bahasa Indonesia in Multi-Speaker Environment. 2020 5th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), (pp. 51–56).
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Retrieved from http://arxiv.org/abs/2006.11477
Chen, L.-W., & Rudnicky, A. (2023). Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 1–5).
Cryssiover, A., & Zahra, A. (2024). Speech recognition model design for Sundanese language using WAV2VEC 2.0. International Journal of Speech Technology, 27(1), 171–177.
Direktorat Analisis dan Pengembangan Statistik. (2022). Analisis Tematik Kependudukan Indonesia (Fertilitas Remaja, Kematian Maternal, Kematian Bayi, dan Penyandang Disabilitas). Badan Pusat Statistik.
Ferdiansyah, D., & Sri Kusuma Aditya, C. (2024). Implementasi Automatic Speech Recognition Bacaan Al-Qur’an Menggunakan Metode Wav2Vec 2.0 dan OpenAI-Whisper. Jurnal Teknik Elektro Dan Komputer TRIAC, 11(1), 2615–7764.
Getman, Y., Al-Ghezi, R., Grosz, T., & Kurimo, M. (2023). Multi-task wav2vec2 Serving as a Pronunciation Training System for Children. 9th Workshop on Speech and Language Technology in Education (SLaTE), (pp. 36–40).
Gondi, S. (2022). Wav2Vec2.0 on the Edge: Performance Evaluation. Retrieved from http://arxiv.org/abs/2202.05993
Jain, R., Barcovschi, A., Yiwere, M. Y., Bigioi, D., Corcoran, P., & Cucu, H. (2023). A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition. IEEE Access, 11, 46938–46948.
Javanmardi, F., Kadiri, S. R., & Alku, P. (2024). Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech. IEEE Journal of Biomedical and Health Informatics, 28(8), 4951–4962.
Kang, T., Han, S., Choi, S., Seo, J., Chung, S., Lee, S., Oh, S., & Kwak, I.-Y. (2024). Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0. Retrieved from http://arxiv.org/abs/2402.17127
Kozhirbayev, Z. (2023). Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper. Journal of Advances in Information Technology, 14(6), 1382–1389.
Loubser, A., De Villiers, P., & De Freitas, A. (2024). End-to-end automated speech recognition using a character based small scale transformer architecture. Expert Systems with Applications, 252, 124119
Mozilla. (2017). Common Voice Dataset. Retrieved from https://commonvoice.mozilla.org/en/datasets
Pascual, S., Bonafonte, A., & Serrà, J. (2017). SEGAN: Speech Enhancement Generative Adversarial Network. Retrieved from http://arxiv.org/abs/1703.09452
Ragano, A., Benetos, E., & Hines, A. (2022). Learning Music Representations with wav2vec 2.0. Retrieved from http://arxiv.org/abs/2210.15310
Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W., Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., … Bengio, Y. (2021). SpeechBrain: A General-Purpose Speech Toolkit. Retrieved from http://arxiv.org/abs/2106.04624
Sadeghi, M., Leglaive, S., Alameda-Pineda, X., Girin, L., & Horaud, R. (2020). Audio-Visual Speech Enhancement Using Conditional Variational Auto-Encoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, (pp. 1788–1800).
Smolik, T., Krupicka, R., & Klempir, O. (2024). Assessing Speech Intelligibility and Severity Level in Parkinson’s Disease Using Wav2Vec 2.0. 2024 47th International Conference on Telecommunications and Signal Processing (TSP), (pp. 231–234).
Tak, H., Todisco, M., Wang, X., Jung, J., Yamagishi, J., & Evans, N. (2022). Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. Retrieved from http://arxiv.org/abs/2202.12233
Tirta, L., Santoso, J., & Setyati, E. (2022). Pengenalan Lirik Lagu Otomatis Pada Video Lagu Indonesia Menggunakan Hidden Markov Model Yang Dilengkapi Music Removal. Journal of Information System,Graphics, Hospitality and Technology, 4(2), 86–94.
Tokyo Institute of Technology Multilingual Speech Corpus (TITML). (2008). TITML-IDN Dataset. Retrieved from https://research.nii.ac.jp/src/en/TITML-IDN.html
Undang - Undang Republik Indonesia Nomor 8 Tahun 2016 Tentang Penyandang Disabilitas, 1 (2016).
Wang, D., & Chen, J. (2018). Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), (pp. 1702–1726).
Yi, C., Wang, J., Cheng, N., Zhou, S., & Xu, B. (2020). Applying Wav2vec2.0 to Speech Recognition in Various Low-resource Languages. Retrieved from http://arxiv.org/abs/2012.12121
Yuliani, A. R., Amri, M. F., Suryawati, E., Ramdan, A., & Pardede, H. F. (2021). Speech Enhancement Using Deep Learning Methods: A Review. Jurnal Elektronika Dan Telekomunikasi, 21(1), 19.
DOI: https://doi.org/10.26760/elkomika.v12i4.877
Refbacks
- Saat ini tidak ada refbacks.
_______________________________________________________________________________________________________________________
ISSN (cetak) : 2338-8323 | ISSN (elektronik) : 2459-9638
diterbitkan oleh :
Teknik Elektro Institut Teknologi Nasional Bandung
Alamat : Gedung 20 Jl. PHH. Mustofa 23 Bandung 40124
Kontak : Tel. 7272215 (ext. 206) Fax. 7202892
Surat Elektronik : jte.itenas@itenas.ac.id________________________________________________________________________________________________________________________
Statistik Pengunjung
Jurnal ini terlisensi oleh Creative Commons Attribution-ShareAlike 4.0 International License.