Optimasi Teknologi WAV2Vec 2.0 menggunakan Spectral Masking untuk meningkatkan Kualitas Transkripsi Teks Video bagi Tuna Rungu

ACHMAD NOERCHOLIS, TITANIA DWIANDINI, FRANSISKA SISILIA MUKTI

Sari


ABSTRAK

Teknologi Automatic Speech Recognition (ASR) telah berkembang pesat sebagai alat untuk meningkatkan aksesibilitas informasi bagi penyandang tuna rungu, terutama melalui video. WAV2Vec 2.0, salah satu teknologi ASR unggulan, efektif dalam transkripsi teks, namun kinerjanya menurun saat menghadapi noise. Penelitian ini bertujuan mengoptimalkan WAV2Vec 2.0 dengan menerapkan Spectral Masking untuk mengurangi noise tanpa mengorbankan kejelasan sinyal utama. Evaluasi dilakukan pada tiga jenis video: podcast, video dengan background noise, dan video dengan background music. Hasil menunjukkan penurunan Word Error Rate (WER) yang signifikan, sebesar 78.06% pada podcast dan 53.85% pada video dengan background noise. Hasil penelitian menunjukkan bahwa Spectral Masking efektif dalam meningkatkan akurasi transkripsi, menawarkan solusi inovatif untuk aksesibilitas tuna rungu dalam kondisi audio yang kompleks.

Kata kunci: noise reduction, spectral masking, tuna rungu, WAV2Vec 2.0

 

ABSTRACT

Automatic Speech Recognition (ASR) technology has rapidly evolved as a tool to enhance information accessibility for the hearing impaired, particularly through video content. WAV2Vec 2.0, a leading ASR technology, is effective in text transcription, but its performance degrades in the presence of noise. This study aims to optimize WAV2Vec 2.0 by applying Spectral Masking to reduce noise without compromising the clarity of the main signal. The evaluation was conducted on three types of videos: podcasts, videos with background noise, and videos with background music. The results show a significant reduction in Word Error Rate (WER), with a 78.06% decrease in podcasts and a 53.85% decrease in videos with background noise. These findings demonstrate that Spectral Masking effectively enhances transcription accuracy, offering an innovative solution for improving accessibility for the hearing impaired in complex audio conditions.

Keywords: noise reduction, spectral masking, tuna rungu, WAV2Vec 2.0


Kata Kunci


noise reduction; spectral masking; tuna rungu; WAV2Vec 2.0

Teks Lengkap:

PDF

Referensi


Andra, M. B., & Usagawa, T. (2020). Automatic Transcription and Captioning System for Bahasa Indonesia in Multi-Speaker Environment. 2020 5th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), (pp. 51–56).

Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Retrieved from http://arxiv.org/abs/2006.11477

Chen, L.-W., & Rudnicky, A. (2023). Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 1–5).

Cryssiover, A., & Zahra, A. (2024). Speech recognition model design for Sundanese language using WAV2VEC 2.0. International Journal of Speech Technology, 27(1), 171–177.

Direktorat Analisis dan Pengembangan Statistik. (2022). Analisis Tematik Kependudukan Indonesia (Fertilitas Remaja, Kematian Maternal, Kematian Bayi, dan Penyandang Disabilitas). Badan Pusat Statistik.

Ferdiansyah, D., & Sri Kusuma Aditya, C. (2024). Implementasi Automatic Speech Recognition Bacaan Al-Qur’an Menggunakan Metode Wav2Vec 2.0 dan OpenAI-Whisper. Jurnal Teknik Elektro Dan Komputer TRIAC, 11(1), 2615–7764.

Getman, Y., Al-Ghezi, R., Grosz, T., & Kurimo, M. (2023). Multi-task wav2vec2 Serving as a Pronunciation Training System for Children. 9th Workshop on Speech and Language Technology in Education (SLaTE), (pp. 36–40).

Gondi, S. (2022). Wav2Vec2.0 on the Edge: Performance Evaluation. Retrieved from http://arxiv.org/abs/2202.05993

Jain, R., Barcovschi, A., Yiwere, M. Y., Bigioi, D., Corcoran, P., & Cucu, H. (2023). A WAV2VEC2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition. IEEE Access, 11, 46938–46948.

Javanmardi, F., Kadiri, S. R., & Alku, P. (2024). Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech. IEEE Journal of Biomedical and Health Informatics, 28(8), 4951–4962.

Kang, T., Han, S., Choi, S., Seo, J., Chung, S., Lee, S., Oh, S., & Kwak, I.-Y. (2024). Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0. Retrieved from http://arxiv.org/abs/2402.17127

Kozhirbayev, Z. (2023). Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper. Journal of Advances in Information Technology, 14(6), 1382–1389.

Loubser, A., De Villiers, P., & De Freitas, A. (2024). End-to-end automated speech recognition using a character based small scale transformer architecture. Expert Systems with Applications, 252, 124119

Mozilla. (2017). Common Voice Dataset. Retrieved from https://commonvoice.mozilla.org/en/datasets

Pascual, S., Bonafonte, A., & Serrà, J. (2017). SEGAN: Speech Enhancement Generative Adversarial Network. Retrieved from http://arxiv.org/abs/1703.09452

Ragano, A., Benetos, E., & Hines, A. (2022). Learning Music Representations with wav2vec 2.0. Retrieved from http://arxiv.org/abs/2210.15310

Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W., Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., … Bengio, Y. (2021). SpeechBrain: A General-Purpose Speech Toolkit. Retrieved from http://arxiv.org/abs/2106.04624

Sadeghi, M., Leglaive, S., Alameda-Pineda, X., Girin, L., & Horaud, R. (2020). Audio-Visual Speech Enhancement Using Conditional Variational Auto-Encoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, (pp. 1788–1800).

Smolik, T., Krupicka, R., & Klempir, O. (2024). Assessing Speech Intelligibility and Severity Level in Parkinson’s Disease Using Wav2Vec 2.0. 2024 47th International Conference on Telecommunications and Signal Processing (TSP), (pp. 231–234).

Tak, H., Todisco, M., Wang, X., Jung, J., Yamagishi, J., & Evans, N. (2022). Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. Retrieved from http://arxiv.org/abs/2202.12233

Tirta, L., Santoso, J., & Setyati, E. (2022). Pengenalan Lirik Lagu Otomatis Pada Video Lagu Indonesia Menggunakan Hidden Markov Model Yang Dilengkapi Music Removal. Journal of Information System,Graphics, Hospitality and Technology, 4(2), 86–94.

Tokyo Institute of Technology Multilingual Speech Corpus (TITML). (2008). TITML-IDN Dataset. Retrieved from https://research.nii.ac.jp/src/en/TITML-IDN.html

Undang - Undang Republik Indonesia Nomor 8 Tahun 2016 Tentang Penyandang Disabilitas, 1 (2016).

Wang, D., & Chen, J. (2018). Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), (pp. 1702–1726).

Yi, C., Wang, J., Cheng, N., Zhou, S., & Xu, B. (2020). Applying Wav2vec2.0 to Speech Recognition in Various Low-resource Languages. Retrieved from http://arxiv.org/abs/2012.12121

Yuliani, A. R., Amri, M. F., Suryawati, E., Ramdan, A., & Pardede, H. F. (2021). Speech Enhancement Using Deep Learning Methods: A Review. Jurnal Elektronika Dan Telekomunikasi, 21(1), 19.




DOI: https://doi.org/10.26760/elkomika.v12i4.877

Refbacks

  • Saat ini tidak ada refbacks.


_______________________________________________________________________________________________________________________

ISSN (cetak) : 2338-8323 | ISSN (elektronik) : 2459-9638

diterbitkan oleh :

Teknik Elektro Institut Teknologi Nasional Bandung

Alamat : Gedung 20 Jl. PHH. Mustofa 23 Bandung 40124

Kontak : Tel. 7272215 (ext. 206) Fax. 7202892

Surat Elektronik : jte.itenas@itenas.ac.id________________________________________________________________________________________________________________________

Statistik Pengunjung

Free counters!

Web

Analytics Made Easy - StatCounter

Lihat Statistik Jurnal

Jurnal ini terlisensi oleh Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License