Search

Browse Subject Areas

For Authors

Submit a Proposal

Automatic Speech Recognition and Translation for Low-Resource Languages

Edited by L. Ashok Kumar, D. Karthika Renuka, Bharathi Raja Chakravarthi and Thomas Mandl
Copyright: 2024   |   Expected Pub Date:2023/12/30
ISBN: 9781394213580  |  Hardcover  |  
492 pages

One Line Description
This book is a comprehensive exploration into the cutting-edge research, methodologies, and advancements in addressing the unique challenges associated with ASR and translation for low-resource languages.

Audience
The book targets researchers and professionals in the fields of natural language processing, computational linguistics, and speech technology. It will also be of interest to engineers, linguists, and individuals in industries and organizations working on cross-lingual communication, accessibility, and global connectivity.

Description
Automatic Speech Recognition and Translation for Low Resource Languages contains groundbreaking research from experts and researchers sharing innovative solutions that address language challenges in low-resource environments. The book begins by delving into the fundamental concepts of ASR and translation, providing readers with a solid foundation for understanding the subsequent chapters. It then explores the intricacies of low-resource languages, analyzing the factors that contribute to their challenges and the significance of developing tailored solutions to overcome them.
The chapters encompass a wide range of topics, ranging from both the theoretical and practical aspects of ASR and translation for low-resource languages. The book discusses data augmentation techniques, transfer learning, and multilingual training approaches that leverage the power of existing linguistic resources to improve accuracy and performance. Additionally, it investigates the possibilities offered by unsupervised and semi-supervised learning, as well as the benefits of active learning and crowdsourcing in enriching the training data. Throughout the book, emphasis is placed on the importance of considering the cultural and linguistic context of low-resource languages, recognizing the unique nuances and intricacies that influence accurate ASR and translation. Furthermore, the book explores the potential impact of these technologies in various domains, such as healthcare, education, and commerce, empowering individuals and communities by breaking down language barriers.

Back to Top
Author / Editor Details
Ashok Kumar, PhD, is a professor in the Department of Electrical and Electronics Engineering, PSG of Technology, Tamil Nadu, India. He has published more than 175 papers in international and national journals and received 26 awards for his PhD project on wearable electronics at national and international levels. He has created eight Centres of Excellence at PSG in collaboration with government agencies and industries such as the Centre for Audio Visual Speech Recognition and the Centre for Excellence in Solar Thermal Systems. Twenty-three out of 27 of his products have been technologically transferred to government funding agencies.

D. Karthika Renuka, PhD, is a professor at PSG of Technology, Tamil Nadu, India. Her main areas of study focus on data mining, evolutionary algorithms, and machine learning. She is a recipient of the Indo-U.S. Fellowship for Women in STEMM. She has organized two international conferences on The Innovation of Computing Techniques and Information Processing and Remote Computing.

Bharathi Raja Chakravarthi, PhD, is an assistant professor in the School of Computer Science, University of Galway, Ireland. His studies focus on multimodal machine learning, abusive/offensive language detection, bias in natural language processing tasks, inclusive language detection, and multilingualism. He has published many papers in international journals and conferences. He is an associate editor of the journal Expert System with Application and an editorial board member for Computer Speech & Language.

Thomas Mandl, PhD, is a professor of Information Science and Language Technology, University of Hildesheim, Germany. His research interests include information retrieval, human-computer interaction, and internationalization of information technology and he has published more than 300 papers on these topics. He coordinated tracks at the Cross Language Evaluation Forum (CLEF), the European information retrieval evaluation initiative. Thomas Mandl is the co-chair at FIRE, the evaluation initiative for Indian languages, since 2020 and coordinates the HASOC track on hate speech detection.

Back to Top

Table of Contents
Foreword
Preface
Acknowledgement
1. A Hybrid Deep Learning Model for Emotion Conversion in Tamil Language

Satrughan Kumar Singh, Muniyan Sundararajan and Jainath Yadav
1.1 Introduction
1.2 Dataset Collection and Database Preparation
1.3 Pre-Trained CNN Architectural Models
1.3.1 VGG16
1.3.2 ResNet50
1.4 Proposed Method for Emotion Transformation
1.4.1 Architecture of Emotion Conversion Framework
1.4.2 Feature Alignment
1.4.3 Mapping of Excitation Source Parameter
1.5 Synthesized Speech Evaluation
1.5.1 Objective Evaluation
1.5.2 Subjective Evaluation
1.6 Conclusion
References
2. Attention-Based End-to-End Automatic Speech Recognition System for Vulnerable Individuals in Tamil
S. Suhasini, B. Bharathi and Bharathi Raja Chakravarthi
2.1 Introduction
2.2 Related Work
2.3 Dataset Description
2.4 Implementation
2.4.1 ASR—Transformer Based
2.4.2 ASR—Attention Based
2.4.3 Steps Involved to Develop End-to-End Speech Recognizer Using SpeechBrain
2.4.3.1 Data Preparation
2.4.3.2 Tokenizer Training
2.4.3.3 Language Model Training
2.4.3.4 Speech Recognizer Training with Attention-Based Mechanism
2.4.3.5 ASR Model Built
2.5 Results and Discussion
2.6 Conclusion
References
3. Speech-Based Dialect Identification for Tamil
Archana J.P. and B. Bharathi
3.1 Introduction
3.2 Literature Survey
3.3 Proposed Methodology
3.3.1 Architecture Diagram
3.3.2 Dataset Creation
3.3.3 Feature Extraction
3.3.3.1 Mel-Frequency Cepstral Coefficients (MFCC)
3.3.3.2 Steps for MFCC Extraction
3.3.4 Classification
3.3.4.1 Gaussian Mixture Model
3.4 Experimental Setup and Results
3.5 Conclusion
References
4. Language Identification Using Speech Denoising Techniques: A Review
Amal Kumar, Piyush Kumar Singh and Jainath Yadav
4.1 Introduction
4.1.1 Background and Context
4.1.2 Related Works and Motivation
4.2 Speech Denoising and Language Identification
4.2.1 Speech Denoising
4.2.2 Language Identification
4.3 The Noisy Speech Signal is Denoised Using Temporal and Spectral Processing
4.4 The Denoised Signal is Classified to Identify the Language Spoken Using Recent Machine Learning Algorithm
4.5 Conclusion
References
5. Domain Adaptation-Based Self-Supervised ASR Models for Low-Resource Target Domain
L. Ashok Kumar, D. Karthika Renuka, Naveena K. S. and Sree Resmi S.
5.1 Introduction
5.2 Literature Survey
5.3 Dataset Description
5.3.1 LibriSpeech
5.3.2 NPTEL Sample Dataset
5.3.3 Real Time Dataset
5.4 Self-Supervised ASR Model
5.4.1 wav2vec 2.0 Model
5.5 Domain Adaptation for Low-Resource Target Domain
5.5.1 Data Augmentation
5.5.2 Data Preprocessing
5.5.3 Freeze Feature Encoder
5.5.4 Fine Tuning
5.6 Implementation of Domain Adaptation on wav2vec2 Model for Low-Resource Target Domain
5.6.1 Dataset Preparation
5.6.2 Preprocessing and Tokenizing
5.6.3 Freeze Feature Encoder
5.6.4 Fine Tuning
5.7 Results Analysis
5.7.1 E-Content
5.7.2 NPTEL
5.7.3 E-Content with NPTEL
5.8 Conclusion
Acknowledgements
References
6. ASR Models from Conventional Statistical Models to Transformers and Transfer Learning
Elizabeth Sherly, Leena G. Pillai and Kavya Manohar
6.1 Introduction
6.2 Preprocessing
6.3 Feature Extraction
6.3.1 Mel-Frequency Cepstral Coefficients (MFCCs)
6.3.2 Linear Predictive Coding (LPC)
6.3.3 Perceptual Linear Prediction (PLP)
6.4 Generative Models for ASR
6.4.1 Components in Statistical Generative ASR
6.4.2 GMM-HMM Acoustic Models
6.4.3 N-Gram Language Model
6.4.4 Pronunciation Lexicon
6.4.5 Decoder
6.5 Discriminative Models for ASR
6.5.1 Support Vector Machine (SVM)
6.5.2 Deep Learning in ASR
6.6 Deep Architectures for Low-Resource Languages
6.6.1 Seq2seq Deep Neural Network
6.6.1.1 Attention Mechanism
6.6.2 Transformer Model with Attention
6.6.3 Transfer Learning System
6.7 The DNN-HMM Hybrid System
6.8 Summary
References
7. Syllable-Level Morphological Segmentation of Kannada and Tulu Words
Asha Hegde and Hosahalli Lakshmaiah Shashirekha
7.1 Introduction
7.1.1 Challenges in MS of Kannada and Tulu Words
7.1.2 Motivation
7.2 Related Work
7.3 Corpus Construction and Annotation
7.4 Methodology
7.4.1 Feature Extraction
7.4.2 Model Construction
7.5 Experiments and Results
7.6 Conclusion and Future Work
References
8. A New Robust Deep Learning-Based Automatic Speech Recognition and Machine Transition Model for Tamil and Gujarati
Monesh Kumar M. K., Valliammai V., Geraldine Bessie Amali D. and Mathew M. Noel
8.1 Introduction
8.2 Literature Survey
8.3 Proposed Architecture
8.3.1 Speech Recognition
8.3.2 Text Translation Model
8.3.3 Text-to-Speech Conversion
8.4 Experimental Setup
8.5 Results
8.6 Conclusion
References
9. Forensic Voice Comparison Approaches for Low-Resource Languages
Kruthika S.G., Trisiladevi C. Nagavi and P. Mahesha
9.1 Introduction
9.1.1 Aim and Objectives of Forensic Voice Comparison for Low-Resource Languages
9.1.2 Forensic Voice Comparison Scenario on Low-Resource Language Datasets
9.1.3 Forensic Voice Comparison Methodology
9.2 Challenges of Forensic Voice Comparison
9.3 Motivation
9.4 Review on Forensic Voice Comparison Approaches
9.4.1 Tools and Frameworks for Forensic Voice Comparison
9.5 Low-Resource Language Datasets
9.5.1 Kannada Corpus
9.5.2 Malayalam Corpus
9.5.3 Tamil Corpus
9.5.4 Telugu Corpus
9.5.5 Low-Resource Language Dataset Research Lab Centers
9.6 Applications of Forensic Voice Comparison
9.7 Future Research Scope
9.8 Conclusion
References
10. CoRePooL—Corpus for Resource-Poor Languages: Badaga Speech Corpus
Barathi Ganesh H.B., Jyothish Lal G., Jairam R., Soman K.P., Kamal N.S. and Sharmila B.
10.1 Introduction
10.1.1 Motivation and Related Literature
10.1.1.1 Corpus
10.1.1.2 Foundation Models
10.2 CoRePooL
10.3 Benchmarking
10.3.1 Speech-to-Text
10.3.2 Text-to-Speech
10.3.3 Gender Identification
10.3.4 Speaker Identification
10.3.5 Translation
10.4 Conclusion
Acknowledgment
References
11. Bridging the Linguistic Gap: A Deep Learning-Based Imageto-Text Converter for Ancient Tamil with Web Interface
S. Umamaheswari, G. Gowtham and K. Harikumar
11.1 Introduction
11.2 The Historical Significance of Ancient Tamil Scripts
11.3 Realization Process
11.4 Dataset Preparation
11.4.1 Dataset Augmentation Process
11.4.2 Image Pre-Processing
11.4.3 Character Segmentation Process
11.4.3.1 Bounding Box Method
11.5 Convolution Neural Network
11.5.1 Convolution Layer
11.5.2 Pooling Layer
11.5.3 Fully Connected Layer
11.6 Webpage with Multilingual Translator
11.6.1 Why is the Web Interface Used Here?
11.6.2 Google Translator API
11.6.3 Build the Website for Deploying the CNN Model
11.7 Results and Discussions
11.8 Conclusion and Future Work
References
12. Voice Cloning for Low-Resource Languages: Investigating the Prospects for Tamil
Vishnu Radhakrishnan, Aadharsh Aadhithya A., Jayanth Mohan, Visweswaran M., Jyothish Lal G. and Premjith B.
12.1 Introduction
12.2 Literature Review
12.3 Dataset
12.3.1 Tamil Mozilla Common Voice Corpus
12.4 Methodology
12.4.1 Architecture of the Method
12.5 Results and Discussion
12.6 Conclusion
References
13. Transformer-Based Multilingual Automatic Speech Recognition (ASR) Model for Dravidian Languages
Divi Eswar Chowdary, Rahul Ganesan, Harsha Dabbara, G. Jyothish Lal and Premjith B.
13.1 Introduction
13.2 Literature Review
13.3 Dataset Description
13.4 Methodology
13.4.1 Data Preprocessing
13.4.2 Proposed Model Architecture
13.4.3 Word Error Rate
13.5 Experimentation Results and Analysis
13.6 Conclusion
References
14. Language Detection Based on Audio for Indian Languages
Amogh A. M., A. Hari Priya, Thanvitha Sai Kanchumarti, Likhitha Ram Bommilla and Rajeshkannan Regunathan
14.1 Introduction
14.2 Literature Review
14.3 Language Detector System
14.3.1 Workflow
14.3.1.1 Data Collection
14.3.1.2 Data Preprocessing
14.3.1.3 Feature Extraction
14.3.1.4 Model Creation/Training
14.3.1.5 Evaluation and Analyzing Results
14.3.2 Overview of a Sample CNN Architecture
14.3.3 Proposed CNN Architecture
14.3.4 Pseudocode
14.3.4.1 Traversing Dataset
14.3.4.2 Methodology Applied on Sample Data
14.3.4.3 Function feature_mfcc_extract (audio file_name,lang)
14.3.4.4 Feature Extraction
14.3.4.5 Train Test Split
14.3.4.6 Model
14.4 Experiments and Outcomes
14.4.1 Dataset
14.4.2 Applied Methodology on Sample Data
14.4.3 Sample Output
14.4.3.1 Features Extracted
14.4.3.2 Model Fitting
14.4.3.3 Model Predicting
14.5 Conclusion
References
15. Strategies for Corpus Development for Low-Resource Languages: Insights from Nepal
Bal Krishna Bal, Balaram Prasain, Rupak Raj Ghimire and Praveen Acharya
15.1 Low-Resource Languages and the Constraints
15.2 Language Resources Map for the Languages of Nepal
15.2.1 Language Situation of Nepal
15.2.2 The Constitutional Status of the Languages of Nepal
15.2.3 Summary of Linguistic Activities
15.2.4 Potential Linguistic Tasks for Low-Resourced
15.3 Unicode Inception and Advent in Nepal
15.4 Speech and Translation Initiatives
15.5 Corpus Development Efforts—Sharing Our Experiences
15.5.1 Speech
15.5.1.1 Speech Corpus
15.5.1.2 Phonetically Balanced Corpus
15.5.2 Machine Translation
15.5.2.1 Corpus
15.5.2.2 Corpus Used in Machine Translation Systems
15.5.2.3 Challenges and Opportunities
15.6 Constraints to Competitive Language Technology Research for Nepali and Nepal’s Languages
15.6.1 Lack of Consolidation of Works
15.6.2 Lack of Funding Support From the Government for Language Technologies
15.6.3 Lack of Primary Datasets/Corpora on Language Technologies
15.6.4 Lack of Natural Language Processing Tools
15.7 Roadmap for the Future
15.7.1 Short Term
15.7.2 Medium Term and Long Term
15.8 Conclusion
References
16. Deep Neural Machine Translation (DNMT): Hybrid Deep Learning Architecture-Based English-to-Indian Language Translation
Nivaashini M., Priyanka G. and Aarthi S.
16.1 Introduction
16.2 Literature Survey
16.2.1 Problem Statements
16.2.2 Novel Contribution of the Proposed Model
16.3 Background
16.3.1 Corpora Collection
16.3.2 Language Description
16.3.3 Neuronal Machine Translation (NMT)
16.3.4 Deep Learning Architecture
16.4 Proposed System
16.4.1 Corpora Used
16.4.2 Preprocessing
16.4.3 Feature Extraction Using DBN
16.4.4 Hybrid Deep Neural Machine Translation Model
16.5 Experimental Setup and Results Analysis
16.5.1 Building Baseline Models
16.5.2 Creating Back Translation Models
16.5.3 MT Model Parameters
16.5.4 Evaluation Metrics
16.5.5 Experimental Architecture
16.5.6 Automatic Evaluation
16.5.7 Manual Evaluation
16.5.8 Output Analysis
16.5.9 Comparison with Google Translate
16.6 Conclusion and Future Work
References
17. Multiview Learning-Based Speech Recognition for Low-Resource Languages
Aditya Kumar and Jainath Yadav
17.1 Introduction
17.1.1 Automatic Speech Recognition
17.1.2 Issues Related to Low-Resource Languages in ASR
17.1.3 Multiview Learning
17.2 Approaches of Information Fusion in ASR
17.2.1 Acoustic with Lip Visual Fusion Approach for LRLs
17.2.2 Cross-Lingual Transfer Learning Approach for LRLs
17.2.3 Acoustic and Linguistic Encoder Fusion for LRLs
17.2.4 LRL Recognition Using a Combination of Different Features
17.2.5 Joint Decoding of Tandem and Combined Systems Approach for Low-Resource Setting
17.2.6 High-Resource Pre-Training Approach to Improve Low-Resource ASR
17.2.7 Multilingual Representations and Transfer Learning Approach for Low-Resource ASR
17.2.8 Multimodal Fusion Approach for LRL
17.2.9 Multimodal Fusion Approach for Machine Translation in Low-Resource Settings
17.2.10 Multilingual Acoustic Model Fusion Approach for Low-Resource Setting 393
17.3 Partition-Based Multiview Learning
17.3.1 Partitioning Methods for Audio Data
17.3.2 Ensemble Process of the Partitioned Audio Data
17.3.3 Feature Set Partition-Based Multiview Learning
17.4 Data Augmentation Techniques
17.5 Conclusion
References
18. Automatic Speech Recognition Based on Improved Deep Learning
Kingston Pal Thamburaj and Kartheges Ponniah
18.1 Introduction
18.2 Literature Review
18.3 Proposed Methodology
18.3.1 Preprocessing Using LMS Adaptive Filter
18.3.2 Feature Extraction
18.3.3 Classification Using Improved Deep Learning (IDL)
18.3.3.1 Recurrent Neural Network (RNN)
18.3.3.2 Modified Social Spider Optimization Algorithm (MSSOA)
18.4 Results and Discussion
18.5 Conclusion
References
19. Comprehensive Analysis of State-of-the-Art Approaches for Speaker Diarization
Trisiladevi C. Nagavi, Samanvitha S., Shreya Sudhanva, Sukirth Shivakumar and Vibha Hullur
19.1 Introduction
19.2 Generic Model of Speaker Diarization System
19.2.1 Preprocessing
19.2.2 Feature Extraction
19.2.3 Segmentation
19.2.4 Embedding
19.2.5 Clustering
19.3 Review of Existing Speaker Diarization Techniques
19.3.1 Datasets
19.3.2 Review of Existing Speaker Diarization Systems
19.4 Challenges
19.4.1 Dynamic Speaker Diarization
19.4.2 Speaker Overlap
19.4.3 Domain Mismatch
19.4.4 When Speakers are Partially Known
19.4.5 Audiovisual Modeling
19.5 Applications
19.5.1 Law Enforcement
19.5.1.1 Judicial System
19.5.1.2 Criminal Investigation
19.5.2 Education
19.5.3 Business
19.5.3.1 Marketing
19.5.3.2 Finance
19.6 Conclusion
References
20. Spoken Language Translation in Low-Resource Language
S. Shoba, Sasithradevi A. and S. Deepa
20.1 Introduction
20.2 Related Work
20.2.1 Challenges
20.2.2 Overview of End-to-End Speech Translation System
20.3 MT Algorithms
20.3.1 Deep Learning MT System
20.3.2 GAN Network MT System
20.3.3 word2vec Predictive Deep Learning Model
20.3.4 Universal MT System
20.4 Dataset Collection
20.4.1 Parallel Data
20.4.2 Monolingual Data
20.4.3 Bilingual Data
20.4.4 Real Data
20.5 Conclusion
References
Index

Back to Top



Description
Author/Editor Details
Table of Contents
Bookmark this page