A Hybrid Wavelet Scattering and Mel Spectrogram Feature with Deep Convolution Neural Network for Robust Spoken Digit Recognition

Irmawan irmawan; Suci Dwijayanti; Bhakti Yudho Suprapto

doi:10.25077/jnte.v14n3.1310.2025

Irmawan irmawan

University of Sriwijaya

Suci Dwijayanti

University of Sriwijaya

Bhakti Yudho Suprapto

University of Sriwijaya

Keywords

Spoken digit recognition, Deep CNN, Wavelet Time Scattering, MFCC, Biometric

Abstract

Spoken digit recognition (SDR) plays a critical role in biometric authentication and human–computer interaction, yet existing approaches often rely on small datasets, limited feature representations, or architectures prone to overfitting. To address these limitations, this study proposes a robust end-to-end pipeline that integrates Wavelet Time Scattering (WTS), Mel-Frequency Cepstral Coefficients (MFCC), and a 2D Deep Convolutional Neural Network (2D-CNN) to enhance the accuracy and generalization of SDR systems in realistic environments. The Free-Spoken Digit Dataset (FSDD), consisting of 3000 audio samples from speakers with diverse accents, was pre-processed using zero-padding normalization and transformed into high-resolution time–frequency spectrograms via WTS. The proposed CNN architecture, optimized through systematic experimentation on batch size and learning rate, demonstrated stable convergence and superior discriminative capability. Using a learning rate of 0.001 and a batch size of 50, the model achieved the highest performance with 99.2% accuracy, outperforming established methods including SVM, MFCC-LSTM, and Multiple RNN architectures. Comparative evaluations further revealed that the combined WTS–MFCC feature extraction significantly enhances spectral–temporal representation quality, contributing to improved classification precision across all digit classes. These findings demonstrate that the proposed WTS-MFCC-CNN framework not only advances SDR accuracy but also provides a scalable and computationally efficient approach suitable for real-world biometric, financial, and voice-controlled applications. The results highlight the potential of hybrid time–frequency representations integrated with deep architectures to set a new benchmark for robust spoken digit recognition.

References

S. Nasr, M. Quwaider, and R. Qureshi, "Text-independent Speaker Recognition using Deep Neural Networks," in 2021 International Conference on Information Technology (ICIT), 2021, pp. 517-52, doi: 10.1109/ICIT52682.2021.9491705.

A. Boles and P. Rad, "Voice biometrics: Deep learning-based voiceprint authentication system," in 2017 12th System of Systems Engineering Conference (SoSE), 2017, pp. 1-6, doi: 10.1109/SYSOSE.2017.7994971.

A. Irum and A. Salman, "Speaker verification using deep neural networks: A," vol. 9, no. 1, 2019, doi: 10.18178/ijmlc.2019.9.1.760

R. Qureshi, M. Nawaz, F. Y. Khuhawar, N. Tunio, M. J. I. J. o. A. C. S. Uzair, and Applications, "Analysis of ECG signal processing and filtering algorithms," vol. 10, no. 3, 2019, doi: 10.14569/ijacsa.2019.0100370.

R. Qureshi, S. A. R. Rizvi, S. H. A. Musavi, S. Khan, and K. Khurshid, "Performance analysis of adaptive algorithms for removal of low frequency noise from ECG signal," in 2017 International Conference on Innovations in Electrical Engineering and Computational Technologies (ICIEECT), 2017, pp. 1-5, doi: 10.1109/ICIEECT.2017.7916551.

D. Stoyanov et al., Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings. Springer, 2018.

R. Qureshi, M. Uzair, K. Khurshid, and H. J. P. R. Yan, "Hyperspectral document image processing: Applications, challenges and future prospects," vol. 90, pp. 12-22, 2019, doi: 10.1016/j.patcog.2019.01.026.

M. Sajjad and S. J. I. A. Kwon, "Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM," vol. 8, pp. 79861-79875, 2020, doi: 10.1109/ACCESS.2020.2990405.

M. Zhang, M. Diao, and L. J. I. A. Guo, "Convolutional neural networks for automatic cognitive radio waveform recognition," vol. 5, pp. 11074-11082, 2017, doi: 10.1109/ACCESS.2017.2716191.

O. Krestinskaya, I. Dolzhikova, and A. P. James, "Hierarchical temporal memory using memristor networks: A survey," vol. 2, no. 5, pp. 380-395, 2018, doi: 10.1109/TETCI.2018.2838124.

S. Becker, M. Ackermann, S. Lapuschkin, K.-R. Müller, and W. J. a. p. a. Samek, "Interpreting and explaining deep neural networks for classification of audio signals," 2018.

Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y. Bengio, and A. J. a. p. a. Courville, "Towards end-to-end speech recognition with deep convolutional neural networks," 2017, doi: 10.48550/arXiv.1701.02720.

T. B. Mokgonyane, T. J. Sefara, T. I. Modipa, M. M. Mogale, M. J. Manamela, and P. J. Manamela, "Automatic speaker recognition system based on machine learning algorithms," in 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA), 2019, pp. 141-146, doi: 10.1109/RoboMech.2019.8704837.

F. R. rahman Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, "Attention-based models for text-dependent speaker verification," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5359-5363, doi: 10.1109/ICASSP.2018.8461587,

S. Dey, P. Motlicek, S. Madikeri, and M. J. S. c. Ferras, "Template-matching for text-dependent speaker verification," vol. 88, pp. 96-105, 2017, doi: 10.1016/j.specom.2017.01.009.

W. Feng, N. Guan, Y. Li, X. Zhang, and Z. Luo, "Audio visual speech recognition with multimodal recurrent neural networks," in 2017 International Joint Conference on neural networks (IJCNN), 2017, pp. 681-688, doi: 10.1109/IJCNN.2017.7965918.

Y. Yu, X. Si, C. Hu, and J. J. N. c. Zhang, "A review of recurrent neural networks: LSTM cells and network architectures," vol. 31, no. 7, pp. 1235-1270, 2019, doi: 10.1162/neco_a_01199.

W. Zhang, M. Zhai, Z. Huang, C. Liu, W. Li, and Y. Cao, "Towards end-to-end speech recognition with deep multipath convolutional neural networks," in International Conference on Intelligent Robotics and Applications, 2019, pp. 332-341.

S. Basu, J. Chakraborty, and M. Aftabuddin, "Emotion recognition from speech using convolutional neural network with recurrent neural network architecture," in 2017 2nd International Conference on Communication and Electronics Systems (ICCES), 2017, pp. 333-336, doi: 10.1109/CESYS.2017.8321292.

Z. J. R. F. Jackson, "Free spoken digit dataset (fsdd)," vol. 1, p. 2020, 2016.

S. Otte, P. Rubisch, and M. V. Butz, "Gradient-based learning of compositional dynamics with modular RNNs," in International Conference on Artificial Neural Networks, 2019, pp. 484-496.

F. M. Bayer, A. J. Kozakevicius, and R. J. J. S. P. Cintra, "An iterative wavelet threshold for signal denoising," vol. 162, pp. 10-20, 2019, doi: 10.1016/j.sigpro.2019.04.005.

W. Liu and W. J. I. A. Chen, "Recent advancements in empirical wavelet transform and its applications," vol. 7, pp. 103770-103780, 2019, doi: 10.1109/ACCESS.2019.2930529.

R. V. Sharan and T. J. Moir, "Time-frequency image resizing using interpolation for acoustic event recognition with convolutional neural networks," in 2019 IEEE International Conference on Signals and Systems (ICSigSys), 2019, pp. 8-11, doi: 10.1109/ICSIGSYS.2019.8811088.

K.-L. Chung, T.-C. Leung, T.-Y. Liu, and Y.-C. J. I. A. Tseng, "A Cubic Convolution Interpolation-Based Chroma Subsampling Method for Bayer and RGBW CFA Raw Images," vol. 10, pp. 22687-22699, 2022, doi: 10.1109/ACCESS.2022.3154487.

A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. J. I. a. Shaalan, "Speech recognition using deep neural networks: A systematic review," vol. 7, pp. 19143-19165, 2019, doi: 10.1109/ACCESS.2019.2896880.

Q. Li et al., "MSP-MFCC: Energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications," vol. 8, pp. 48720-48730, 2020, doi: 10.1109/ACCESS.2020.2979799.

N. Naka and V. Ruoppila, "Linear prediction coefficient conversion device and linear prediction coefficient conversion method," ed: Google Patents, 2018, doi: 10.1016/j.patrec.2017.03.004.

E. Tatulli and T. Hueber, "Feature extraction using multimodal convolutional neural networks for visual speech recognition," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2971-2975, doi: 10.1109/ICASSP.2017.7952701.

S. Roy, N. Das, M. Kundu, and M. J. P. R. L. Nasipuri, "Handwritten isolated Bangla compound character recognition: A new benchmark using a novel deep learning approach," vol. 90, pp. 15-21, 2017, doi: 10.1016/j.patrec.2017.03.004.

T. J. Jun, H. M. Nguyen, D. Kang, D. Kim, D. Kim, and Y.-H. J. a. p. a. Kim, "ECG arrhythmia classification using a 2-D convolutional neural network," 2018, doi: 10.48550/arXiv.1804.06812.

O. F. Reyes-Galaviz, W. Pedrycz, Z. He, N. J. J. D. Pizzi, and K. Engineering, "A supervised gradient-based learning algorithm for optimized entity resolution," vol. 112, pp. 106-129, 2017, doi: 10.1016/j.datak.2017.10.004.

Y. Wang, X. Tao, X. Qi, X. Shen, and J. J. A. i. n. i. p. s. Jia, "Image inpainting via generative multi-column convolutional neural networks," vol. 31, 2018.

W. Yin et al., "Self-adjustable domain adaptation in personalized ECG monitoring integrated with IR-UWB radar," vol. 47, pp. 75-87, 2019, doi: 10.1016/j.bspc.2018.08.002.

Y. F. Utomo, E. C. Djamal, F. Nugraha, and F. Renaldi, "Spoken word and speaker recognition using MFCC and multiple recurrent neural networks," in 2020 7th international conference on electrical engineering, computer sciences and informatics (EECSI), 2020, pp. 192-197, doi: 10.23919/EECSI50503.2020.9251870.

M. Jain, S. Narayan, P. Balaji, A. Bhowmick, and R. K. J. a. p. a. Muthu, "Speech emotion recognition using support vector machine," 2020, doi: 10.48550/arXiv.2002.07590.

T. Zia and U. J. I. J. o. S. T. Zahid, "Long short-term memory recurrent neural network architectures for Urdu acoustic modeling," vol. 22, no. 1, pp. 21-30, 2019, doi: 10.48550/arXiv.1402.1128.

R. V. Sharan, "Spoken digit recognition using wavelet scalogram and convolutional neural networks," in 2020 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 2020, pp. 101-105: IEEE, doi: 10.1109/RAICS51191.2020.9332505.

H. Ba, "Spoken Digit Classification: A Method Using Convolutional Neural Network and Mixed Feature.", doi: 10.18178/wcse.2021.02.002.

A. S. M. B. Wazir and J. H. Chuah, "Spoken arabic digits recognition using deep learning," in 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), 2019, pp. 339-344: IEEE, doi: 10.1109/I2CACIS.2019.8825004.

PDF

Published

Dec 12, 2025

DOI https://doi.org/10.25077/jnte.v14n3.1310.2025

How to Cite

irmawan, I., Dwijayanti, S., & Suprapto, B. Y. (2025). A Hybrid Wavelet Scattering and Mel Spectrogram Feature with Deep Convolution Neural Network for Robust Spoken Digit Recognition. Jurnal Nasional Teknik Elektro, 14(3), 151–161. https://doi.org/10.25077/jnte.v14n3.1310.2025

Issue

Vol 14, No 3: November 2025

Section

Control

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

The authors declare that:

1. This paper has not been published in the same form elsewhere.

2. It will not be submitted anywhere else for publication prior to acceptance/rejection by this Journal.

3. A copyright permission is obtained for materials published elsewhere and which require this permission for reproduction.

A Hybrid Wavelet Scattering and Mel Spectrogram Feature with Deep Convolution Neural Network for Robust Spoken Digit Recognition

Keywords

Abstract

References

Most read articles by the same author(s)

Similar Articles

Similar Articles

Perbandingan Kinerja Support Vector Machine (SVM) Dalam Mengenali Wajah Menggunakan SURF DAN GLCM

Analisa Kinerja Automatic Voltage Regulator Dalam Domain Waktu Menggunakan Metoda Ziegler-Nichols Dengan Pendekatan First Order Plus Dead Time

Wavelet Analysis and Radial Basis Function Neural Network Based Stability Status Prediction Scheme

Berauti Spectral Subtraction dengan Gaussian Window untuk Peningkatan Akurasi Pengenalan Ucapan Berderau

Simulasi Unjuk Kerja Discrete Wavelet Transform (DWT) dan Discrete Cosine Transform (DCT) untuk Pengolahan Sinyal Radar di Daerah yang Ber-Noise Tinggi

Pemodelan dan Prediksi Daya Output Photovoltaic secara Real Time Berbasis Mikrokontroler

Pengenalan Aksara Jawi Tulisan Tangan Menggunakan Freemen Chain Code (FCC), Support Vector Machine (SVM) dan Aturan Pengambilan Keputusan

Benda Referensi sebagai Acuan Penyederhanaan untuk Deteksi Benda pada Kondisi Terhalang dengan Metoda Support Machine

Analisis Performansi Algoritma Penjadwalan Log Rule dan Frame Level Schedule Skenario Multicell Pada Layer Mac LTE

Sistem Deteksi Petir Multistation Dengan Metode Time Of Arrival

##plugins.themes.bootstrap3.article.main##

Keywords

Abstract

References

##plugins.themes.bootstrap3.article.sidebar##

##plugins.themes.bootstrap3.article.details##

Most read articles by the same author(s)

Similar Articles