HiVAD : A Voice Activity Detection Application Based on Deep Learning

MUHAMMAD HILMI FARIDH; ULIL SURTIA ZULPRATITA

doi:10.26760/elkomika.v9i4.856

HiVAD : A Voice Activity Detection Application Based on Deep Learning

MUHAMMAD HILMI FARIDH, ULIL SURTIA ZULPRATITA

Abstract

ABSTRAK

Dalam tulisan ini, deteksi aktivitas suara disajikan pada smartphone secara realtime dengan jaringan saraf konvolusional. Pengurangan waktu komputasi adalah masalah dari studi sebelumnya. Meskipun telah menggunakan pendekatan machine learning, masih banyak kekurangan dari penelitian sebelumnya. Citra sinyal suara dihasikan oleh spektrogram energi log-mel, kemudian citra sinyal suara diinputkan ke dalam deep learning CNN untuk mengklasifikasi suara manusia dan derau. HiVAD mengungguli persentase metode VAD lainnya yaitu G729B, Sohn, dan RF dari hasil tes yang ditunjukkan dengan akurasi rata-rata SHR sebesar 15,89%, 28,98%, 42,13% pada tingkat 0dB, 8,67%, 16,29%, 17,63% pada tingkat 5 dB, dan 1,35%, 7,72%, 5,14% pada tingkat 10 dB. Selain itu, mekanisme Multi-threading memungkinkan komputasi yang efisien untuk waktu secara realtime. Penelitian ini menunjukkan bahwa arsitektur CNN pada HiVAD secara signifikan meningkatkan akurasi deteksi aktivitas suara.

Kata kunci: aplikasi VAD, voice detection, deep learning, CNN

ABSTRACT

In this paper, the detection of sound activity is presented on smartphones in realtime with convolutional neural networks. Reduced computing time is a problem from previous studies. Despite the use of machine learning approaches, there are still many shortcomings from previous research. A log-mel energy spectrogram narrates the sound signal image. Then the sound signal image is inputted into CNN's deep learning to classify the human voice and noise. HiVAD outperformed the percentage of other VAD methods, namely G729B, Sohn, and RF from the test results shown with an average SHR accuracy of 15.89%, 28.98%, 42.13% at 0dB, 8.67%, 16.29% ,17.63% at 5 dB, and 1.35%, 7.72%, 5.14% at 10 dB. In addition, the Multi-threading mechanism enables efficient computing for real-time. This study shows that CNN's architecture on HiVAD significantly improves the accuracy of sound activity detection.

Keywords: VAD App, voice detection, deep learning, CNN

Keywords

VAD App; voice detection; deep learning; CNN

Full Text:

PDF

References

Brookes, M. (2019). VOICEBOX: Speech Processing Toolbox for MATLAB. Retrieved from http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

Chandra, A. (2018). Voice Activity Detection Sederhana Menggunakan Python. Retrieved from https://medium.com/warung-pintar/membuat-voice-activity-detection-menggunakanpython-d13763ea277f#

Dong, E., Liu, G., Zhou, Y., & Zhang, X. (2002). Applying support vector machines to voice activity detection. International Conference on Signal Processing Proceedings, ICSP, (pp. 1124â€“1127).

Jo, Q. H., Chang, J. H., Shin, J. W., & Kim, N. S. (2009). Statistical model-based voice activity detection using support vector machine. IET Signal Processing, 3(3), 205â€“210.

Kehtarnavaz, N., Sehgal, A., Parris, S., & Azarang, A. (2020). Smartphone-based real-time digital signal processing: Third edition. In Synthesis Lectures on Signal Processing (Vol. 11, Issue 2).

Kingma, D. P., & Jimmy, B. (2014). Adam: A Method for Stochastic Optimization. Retrieved from https://arxiv.org/abs/1412.6980

Krizhevsky, A., Sutskever, I., & E. Hinton, G. (2017). ImageNet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60(6), 1â€“1432.

Mathworks. (2017). G.729 Voice Activity Detectionâ€”MATLAB & Simulink. Retrieved from https://www.mathworks.com/help/dsp/examples/g-729-voice-activity-detection.html

Mesaros, Annamaria, Heittola, Toni, & Virtanen, T. (2017). TUT Acoustic scenes 2017. Zenodo. Retrieved from https://zenodo.org/record/400515#.YI0uhbUzbIU

Michaeltyson. (2017). TPCircularBuffer. Retrieved from https://github.com/michaeltyson/TPCircularBuffer

Obuchi. (2016). Framewise speech-nonspeech classification by neural networks for voice activity detection with statistical noise suppression. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2016, (pp. 5715â€“5719).

RamÃrez, J., YÃ©lamos, P., GÃ³rriz, J. M., Segura, J. C., & GarcÃa, L. (2006). Speech/non-speech discrimination combining advanced feature extraction and SVM learning. International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP, (pp. 1662â€“1665).

Rishi, S. (2019). Audio Classification Using CNN â€” An Experiment. Retrieved from https://medium.com/x8-the-ai-community/audio-classification-using-cnn-codingexample-f9cbd272269e

Saki, F., & Kehtarnavaz, N. (2016). Automatic switching between noise classification and speech enhancement for hearing aid devices. Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, 2016-Octob, (pp. 736â€“739).

Sehgal, A., & Kehtarnavaz, N. (2018). A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection. IEEE Access, 6, 9017â€“9026.

Superpowered. (2019). Superpowered. Android Audio SDK, Low Latency, Cross Platform, Free. Retrieved from https://superpowered.com/

Thad, H., & Keir, M. (2013). Recurrent neural networks for voice activity detection. EEE International Conference on Acoustics, Speech and Signal Processing 2013, (pp. 7378â€“7382).

Thomas, S., Sriram, G., George, S., & Hagen, S. (2014). Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , (pp. 2538â€“2542).

Yang, X., Tan, B., Ding, J., Zhang, J., & Gong, J. (2010). Comparative study on voice activity detection algorithm. Proceedings - International Conference on Electrical and Control Engineering, ICECE 2010, (pp. 599â€“602).

Zhang, X. L., & Wu, J. (2013). Deep belief networks based voice activity detection. IEEE Transactions on Audio, Speech and Language Processing, 21(4), 697â€“710.

Zohar, J., CÃ©sar, S., Jason, F., Yuxin, P., Hereman, N., & Adhish, T. (2018). Free Spoken Digit Dataset (FSDD). Retrieved from https://www.kaggle.com/joserzapata/free-spokendigit-dataset-fsdd.

DOI: https://doi.org/10.26760/elkomika.v9i4.856

Refbacks

There are currently no refbacks.

_______________________________________________________________________________________________________________________

ISSN (print) : 2338-8323 | ISSN (electronic) : 2459-9638

Publisher:

Department of Electrical Engineering Institut Teknologi Nasional Bandung, Indonesia

Address: 20th Building Institut Teknologi Nasional Bandung PHH. Mustofa Street No. 23 Bandung 40124, Indonesia

Contact: +627272215 (ext. 206)

Email: jte.itenas@itenas.ac.id ________________________________________________________________________________________________________________________

Free counters!

Statistic Journal

Jurnal ini terlisensi oleh Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me