Query-by-Example Spoken Term Detection: A Systematic Review

Manisha Naik Gaonkar, Veena Thenakanidiyoor, A. D. Dileep

Citation :

Manisha Naik Gaonkar, Veena Thenakanidiyoor, A. D. Dileep, "Query-by-Example Spoken Term Detection: A Systematic Review," International Journal of Electronics and Communication Engineering, vol. 12, no. 7, pp. 119-136, 2025. Crossref, https://doi.org/10.14445/23488549/IJECE-V12I7P110

Abstract

Massive collections of multimedia data have been created as a result of the Internet’s recent exponential expansion. Effective management of these repositories necessitates matching spoken queries with the audio content of these videos. One powerful technique that has emerged to address this issue is QbE-STD. Nevertheless, QbE-STD encounters many difficulties, including age sensitivity, dialect differences, and computational complexity. The objective of this study is to address these issues. In this area, review publications are scarce and mostly lack an in-depth study of feature representations and matching techniques. This paper presents a detailed analysis of different approaches and developments in QbE-STD. It covers feature representations, similarity metrics, matching methods, datasets, evaluation measures, and benchmarking platforms. The paper delves into the intricacies of various feature representations and scrutinizes similarity metrics. These metrics are analyzed for their advantages and disadvantages in computing a matching matrix between a query and an utterance. Furthermore, the paper highlights how machine learning and deep learning architectures are increasingly integrated into QbE-STD. Finally, the paper discusses a few challenges associated with QbE-STD, which provide an opportunity for future research in this field.

Keywords

Convolutional Neural Network, Spoken term detection, Query-by-Example Spoken Term Detection, Keyword spotting, Audio search.

References

[1] Leena Mary, and G. Deekshitha, Searching Speech Databases: Features, Techniques and Evaluation Measures, Springer International Publishing, pp. 1-76, 2018.
[Google Scholar] [Publisher Link]
[2] J.S. Bridle, “An Efficient Elastic-Template Method for Detecting Given Words in Running Speech,” British Acoustical Society Spring Meeting, pp. 1-4, 1973.
[Google Scholar]
[3] A. Higgins, and R. Wohlford, “Keyword Recognition Using Template Concatenation,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Tampa, FL, USA, vol. 10, pp. 1233-1236, 1985.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Roy Wallace, Robbie Vogt, and Sridha Sridharan, “A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation,” Interspeech, pp. 2385-2388, 2007.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Gautam Mantena, Sivanand Achanta, and Kishore Prahallad, “Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 5, pp. 946-955, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Kishan Thambiratnam, and Sridha Sridharan, “Rapid Yet Accurate Speech Indexing Using Dynamic Match Lattice Spotting,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 346-357, 2007.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Timothy J. Hazen, Wade Shen, and Christopher White, “Query-by-Example Spoken Term Detection using Phonetic Posteriorgram Templates,” 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, Moreno, Italy, pp. 421-426, 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[8] James Glass et al., “Analysis and Processing of Lecture Audio Data: Preliminary Investigations,” Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004, Boston, Massachusetts, USA, pp. 9-12, 2004.
[Google Scholar] [Publisher Link]
[9] S. Davis, and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Drisya Vasudev et al., “Query-by-Example Spoken Term Detection using Bessel features,” 2015 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), Kozhikode, India, pp. 1-4, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Haipeng Wang et al., “Using Parallel Tokenizers with DTW Matrix Combination for Low-Resource Spoken Term Detection,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, pp. 8545-8549, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[12] L.J. Rodriguez-Fuentes, and Mikel Penagarikano, “MediaEval 2013 Spoken Web Search Task: System Performance Measures,” Technical Report Department of Electricity and Electronics, University of the Basque Country, pp. 1-14, 2013.
[Google Scholar] [Publisher Link]
[13] Dhananjay Ram, Lesly Miculicich, and Hervé Bourlard, “CNN-based Query by Example Spoken Term Detection,” Interspeech, pp. 92-96, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Vikram Gupta et al., “A Language-Independent Approach to Audio Search,” Interspeech, pp. 1125-1128, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Dhananjay Ram, Lesly Miculicich, and Hervé Bourlard, “Multilingual Bottleneck Features for Query by Example Spoken Term Detection,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, pp. 621-628, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Gautam Mantena, and Kishore Prahallad, “IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck Features for Query-by-Example Spoken Term Detection,” MediaEval 2013 Workshop, Barcelona, Spain, pp. 1-2, 2013.
[Google Scholar] [Publisher Link]
[17] Yaodong Zhang, and James R. Glass, “Unsupervised Spoken Keyword Spotting via Segmental DTW on Gaussian Posteriorgrams,” 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, Moreno, Italy, pp. 398-403, 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[18] M. Muller, Dynamic Time Warping, Information Retrieval for Music and Motion, pp. 69-84, 2007.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Maulik C. Madhavi, and Hemant A. Patil, “Design of Mixture of GMMs for Query-by-Example Spoken Term Detection,” Computer Speech & Language, vol. 52, pp. 41-55, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Neil Shah et al., “Query-By-Example Spoken Term Detection Using Generative Adversarial Network,” 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, pp. 644-648, 2020.
[Google Scholar] [Publisher Link]
[21] Dhananjay Ram, Lesly Miculicich, and Hervé Bourlard, “Neural Network Based End-to-End Query by Example Spoken Term Detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1416-1427, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Hongjie Chen et al., “Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection,” Interspeech, pp. 923-927, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Emre Yilmaz, Julien van Hout, and Horacio Franco, “Noise-Robust Exemplar Matching for Rescoring Query-by-Example Search,” 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, pp. 1-7, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Julien van Hout et al., “Tackling Unseen Acoustic Conditions in Query-by-Example Search Using Time and Frequency Convolution for Multilingual Deep Bottleneck Features,” 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, pp. 48-54, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Miroslav Skácel, and Igor Szöke, “But Quesst 2015 System Description,” CEUR Workshop Proceedings, pp. 1-3, 2015.
[Google Scholar] [Publisher Link]
[26] Yougen Yuan et al., “Pairwise Learning Using Multi-Lingual Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp. 5645-5649, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Yougen Yuan et al., “Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings with Temporal Context,” IEEE Access, vol. 7, pp. 67656-67665, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Yougen Yuan et al., “Fast Query-by-Example Speech Search Using Attention-Based Deep Binary Embeddings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1988-2000, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Hyungjun Lim et al., “CNN-based Bottleneck Feature for Noise Robust Query-by-Example Spoken Term Detection,” 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, pp. 1278-1281, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Abhimanyu Popli, and Arun Kumar, “Multilingual Query-by-Example Spoken Term Detection in Indian Languages,” International Journal of Speech Technology, vol. 22, pp. 131-141, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Gautam Mantena, and Kishore Prahallad, “Use of Articulatory Bottle-Neck Features for Query-by-Example Spoken Term Detection in Low Resource Scenarios,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, pp. 7128-7132, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Abhimanyu Popli, and Arun Kumar, “Capturing Indian Phonemic Diversity with Multiple Posteriorgrams for Multilingual Query-by-Example Spoken Term Detection,” 2017 Twenty-Third National Conference on Communications (NCC), Chennai, India, pp. 1-6, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Murong Ma et al., “Acoustic Word Embedding System for Code-Switching Query-by-Example Spoken Term Detection,” 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), Hong Kong, pp. 1-5, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Ziwei Zhu et al., “Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection,” Interspeech, pp. 102-106, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Takumi Kurokawa, Atsuhiko Kai, and Hiroki Kondo, “Effects of End-to-end ASR and Score Fusion Model Learning for Improved Query-by-example Spoken Term Detection,” 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, pp. 654-661, 2020.
[Google Scholar] [Publisher Link]
[36] Jan Švec, Luboš Šmídl, and Jan Lehečka, “Transformer-based Encoder-Encoder Architecture for Spoken Term Detection.” Asian Conference on Pattern Recognition, Kitakyushu, Japan, pp. 346-357, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Chia-Wei Ao, and Hung-Yi Lee, “Query-by-Example Spoken Term Detection Using Attention-Based Multi-Hop Networks,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, pp. 6264-6268, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Kun Zhang et al., “Query-by-Example Spoken Term Detection using Attentive Pooling Networks,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, pp. 1267-1272, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Maulik C. Madhavi, and Hemant A. Patil, “Vocal Tract Length Normalization using a Gaussian Mixture Model Framework for Query-by-Example Spoken Term Detection,” Computer Speech & Language, vol. 58, pp. 175-202, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Yaodong Zhang et al., “Resource Configurable Spoken Query Detection using Deep Boltzmann Machines,” 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, pp. 5161-5164, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[41] P. Sudhakar, K. Sreenivasa Rao, and Pabitra Mitra, “A Novel Zero-Resource Spoken Term Detection Using Affinity Kernel Propagation with Acoustic Feature Map,” SN Computer Science, vol. 4, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Akanksha Singh, Vipul Arora, and Yi-Ping Phoebe Chen, “An Efficient TF-IDF based Query by Example Spoken Term Detection,” 2024 IEEE Conference on Artificial Intelligence (CAI), Singapore, pp. 170-175, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Prajyot Naik et al., “Kernel-based Matching and a Novel Training Approach for CNN-Based QbE-STD,” 2020 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, pp. 1-5, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[44] H. Sakoe, and S. Chiba, “Dynamic Programming Algorithm Optimization for Spoken Word Recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43-49, 1978.
[CrossRef] [Google Scholar] [Publisher Link]
[45] Ravi Shankar, C.M. Vikram, and S.R.M. Prasanna, “Spoken Keyword Detection using Joint DTW-CNN,” Interspeech, pp. 117-121, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Yougen Yuan et al., “Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search,” arXiv Preprint, pp. 97-101, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[47] Luis J. Rodriguez-Fuentes et al., “High-Performance Query-by-Example Spoken Term Detection on the SWS 2013 Evaluation,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, pp. 7819-7823, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Maulik C. Madhavi, and Hemant A. Patil, “Partial Matching and Search Space Reduction for QbE-STD,” Computer Speech & Language, vol. 45, pp. 58-82, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[49] Xavier Anguera, and Miquel Ferrarons, “Memory Efficient Subsequence DTW for Query-by-Example Spoken Term Detection,” 2013 IEEE International Conference on Multimedia and Expo (ICME), San Jose, CA, USA, pp. 1-6, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[50] Dhananjay Ram, Afsaneh Asaei, and Hervé Bourlard, “Sparse Subspace Modeling for Query by Example Spoken Term Detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 6, pp. 1130-1143, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[51] R. Kishore Kumar, Lokendra Birla, and K. Sreenivasa Rao, “A Robust Unsupervised Pattern Discovery and Clustering of Speech Signals,” Pattern Recognition Letters, vol. 116, pp. 254-261, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[52] Nadia Benati, and Halima Bahi, “Self-Supervised Spoken Term Detection for Query by Example,” International Information and Engineering Technology Association, pp. 1175-1181, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[53] John S. Garofolo et al., “TIMIT Acoustic-Phonetic Continuous Speech Corpus,” Linguistic Data Consortium, 1993.
[CrossRef] [Google Scholar] [Publisher Link]
[54] MediaEval, The 2011 Spoken Web Search Task, 2011. [Online]. Available: http://www.multimediaeval.org/mediaeval2011/SWS2011/
[55] Florian Metze et al., “The Spoken Web Search Task at MediaEval 2011,” 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, pp. 5165-5168, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[56] MediaEval, The 2012 Spoken Web Search Task, 2012. [Online]. Available: http://www.multimediaeval.org/mediaeval2012/sws2012/
[57] Sri Harsha Dumpala et al., “Analysis of Constraints on Segmental DTW for the Task of Query-by-Example Spoken Term Detection,” 2015 Annual IEEE India Conference (INDICON), New Delhi, India, pp. 1-6, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[58] MediaEval, The 2013 Spoken Web Search Task, 2013. [Online]. Available: http://www.multimediaeval.org/mediaeval2013/sws2013/
[59] Xavier Anguera et al., “Query-by-Example Spoken Term Detection Evaluation on Low-Resource Languages,” Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, Petersburg, Russia, pp. 24-31, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[60] MediaEval, The 2014 Query by Example Search on Speech (QUESST), 2014. [Online]. Available: http://www.multimediaeval.org/mediaeval2014/quesst2014/
[61] MediaEval, The 2015 Query by Example Search on Speech (QUESST), 2015. [Online]. Available: http://www.multimediaeval.org/mediaeval2015/quesst2015/
[62] H. Tulsiani, and P. Rao, “The IIT-B Query-by-Example System for MediaEval 2015,” MediaEval, pp. 1-3, 2015.
[Google Scholar] [Publisher Link]
[63] Jingyong Hou, “The NNI Query-by-Example System for MediaEval 2015,” MediaEval, pp. 1-3, 2014.
[Google Scholar] [Publisher Link]
[64] Tanja Schultz, “GlobalPhone: A Multilingual Speech and Text Database Developed at Karlsruhe University,” Proceedings of the 7th International Conference on Spoken Language Processing, pp. 345-348, 2002.
[CrossRef] [Google Scholar] [Publisher Link]
[65] Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe, “GlobalPhone: A Multilingual Text and Speech Database in 20 Languages,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, pp. 8126-8130, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[66] I. McCowan et al., “The AMI Meeting Corpus,” Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, pp. 137-140, 2005.
[Google Scholar] [Publisher Link]
[67] AMI Corpus Overview, 2006. [Online]. Available: https://groups.inf.ed.ac.uk/ami/corpus/overview.shtml
[68] Lwazi English ASR Corpus. Lwazi, 2013. [Online]. Available: https://repo.sadilar.org/items/28063254-e197-40c7-a488-f904a68550a8
[69] Charl van Heerden, Neil Kleynhans, and Marelie Davel, “Improving the Lwazi ASR Baseline,” Interspeech, pp. 3534-3538, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[70] James Glass et al., “Recent Progress in the MIT Spoken Lecture Processing Project,” Interspeech, pp. 2553-2556, 2007.
[Google Scholar] [Publisher Link]
[71] John J. Godfrey, and Edward Holliman, “Switchboard-1 Release 2 LDC97S62,” Linguistic Data Consortium, 1993.
[CrossRef] [Google Scholar] [Publisher Link]
[72] A. Martin et al., “The DET Curve in Assessment of Detection Task Performance,” Technical Report, Defense Technical Information Center, pp. 1895-1898, 1997.
[Google Scholar] [Publisher Link]
[73] Gautam Mantena, and Kishore Prahallad, “Use of GPU and Feature Reduction for Fast Query-by-Example Spoken Term Detection,” Proceedings of the 11th International Conference on Natural Language Processing, Goa, India, pp. 56-62, 2014.
[Google Scholar] [Publisher Link]
[74] Jonathan G. Fiscus, Jerome G. Ajot, and John S. Garofolo, “Results of the 2006 Spoken Term Detection Evaluation,” Procedding of ACM Special Interest Group in Information Retrieval, pp. 45-50, 2007.
[Google Scholar] [Publisher Link]
[75] Open Keyword Search Evaluation, NIST, 2006. [Online]. Available: https://www.nist.gov/itl/iad/mig/open-keyword-search-evaluation
[76] Albayzin Evaluation, 2016. [Online]. Available: https://iberspeech2016.inesc-id.pt/index.php/albayzin-evaluation/
[77] Javier Tejedor et al., “The Multi-Domain International Search on Speech 2020 Albayzin Evaluation: Overview, Systems, Results, Discussion, and Post-Evaluation Analyses,” Applied Sciences, vol. 11, no. 18, pp. 1-39, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[78] Javier Tejedor et al., “Albayzin 2018 Spoken Term Detection Evaluation: A Multi-Domain International Evaluation in Spanish,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2019, pp. 1-37, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[79] P. Sudhakar, K. Sreenivasa Rao, and Pabitra Mitra, “Unsupervised Spoken Term Discovery using Pseudo Lexical Induction,” International Journal of Speech Technology, vol. 26, pp. 801-816, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[80] Pantid Chantangphol, Theerat Sakdejayont, and Tawunrat Chalothorn, “Enhancing Word Discrimination and Matching in Query-by-Example Spoken Term Detection with Acoustic Word Embeddings,” Proceedings of the 6th International Conference on Natural Language and Speech Processing, pp. 293-302, 2023.
[Google Scholar] [Publisher Link]
[81] Manisha Naik Gaonkar et al., “Exploring the Effectiveness of Feature Reduction and Kernel-Based Matching for Query-by- Example Spoken Term Detection Using CNN,” IEEE Access, vol. 12, pp. 194462-194474, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[82] Manisha Naik Gaonkar, Veena Thenkanidiyoor, and Aroor Dinesh Dileep, “A Parallel Computing approach to CNN-based QbE-STD using Kernel-Based Matching,” Journal of Supercomputing, vol. 81, 2024.
[CrossRef] [Google Scholar] [Publisher Link]