ASR for Documenting Endangered Languages

Publications

Zoey Liu, Crystal Richardson, Richard Hatcher, and Emily Prud'hommeaux. 2022. Not always about you: Prioritizing community needs when developing endangered language technology. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 3933-3944.

Ronit Damiana†, Christopher Homan, and Emily Prud'hommeaux. 2022. Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 4890--4894.

Zoey Liu*, Justin Spence, and Emily Prud’hommeaux. 2022. Enhancing Documentation of Hupa with Automatic Speech Recognition. In Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-5), 187-192.

Zoey Liu* and Emily Prud'hommeaux. 2022. Data-driven model generalizability in crosslinguistic low-resource morphological segmentation. Transactions of the Association for Computational Linguistics (TACL) 10:393-413.

Emily Prud'hommeaux, Robbie Jimerson, Richard Hatcher, Karin Michelson. 2021. Automatic speech recognition for supporting endangered language documentation. Language Documentation and Conservation, 15:491-513.

Ethan Morris, Robbie Jimerson, and Emily Prud'hommeaux. 2021. One size does not fit all in resource-constrained ASR. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), 4354-4358.

Zoey Liu, Robbie Jimerson, and Emily Prud'hommeaux. 2021. Morphological Segmentation for Seneca. In Proceedings of the First Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), 90-101.

Bao Thai, Robbie Jimerson, Raymond Ptucha, and Emily Prud'hommeaux. 2020. Fully Convolutional ASR for Less-Resourced Endangered Languages. In Proceedings of the Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU-CCURL), 126-130.

Bao Thai, Robbie Jimerson, Dominic Arcoraci, Emily Prud'hommeaux, and Raymond Ptucha. 2019. Synthetic data augmentation for improving low-resource ASR. In Proceedings of the IEEE Western New York Image and Signal Processing Workshop, 1-9. Best paper award, runner-up.

Robbie Jimerson, Kruthika Simha, Raymond Ptucha, and Emily Prud’hommeaux. 2018. Improving ASR output for endangered language documentation. In Proceedings of the 6th Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU), 182-186.

Robbie Jimerson and Emily Prud'hommeaux. 2018. ASR for documenting acutely under-resourced indigenous languages. In Proceedings of the 2018 Language Resources and Evaluation Conference (LREC), 4161-4166.

Refereed Conference Presentations

Ethan Morris, Robbie Jimerson, and Emily Prud'hommeaux. 2021. Investigating the utility of custom ASR architectures for existing African language corpora. The Second Workshop on African Natural Language Processing (AfricaNLP), virtual.

Richard Hatcher, Robbie Jimerson, Emily Prud'hommeaux, and Karin Michelson. 2021. Corpus Phonetic Investigation into Seneca Accentuation. International Conference on Language Documentation and Conservation (ICLDC-7), virtual.

Robbie Jimerson. 2020. Automatic speech recognition with subword language models for an under-resourced indigenous language. NeurIPS Indigenous in Machine Learning Workshop, virtual.

Richard Hatcher and Robbie Jimerson. 2020. 19th Century Seneca in the works of Asher Wright. Winter Meeting of the Society for the Study of the Indigenous Languages of the Americas (SSILA), New Orleans, Louisiana.

Emily Prud'hommeaux, Robert Jimerson, Richard Hatcher, Raymond Ptucha, and Karin Michelson. 2019. On the promise and pitfalls of repurposing existing language technologies for endangered language documentation. Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, UNESCO, Paris.

Robbie Jimerson, Richard Hatcher, Raymond Ptucha, and Emily Prud’hommeaux. 2019. Speech technology for supporting community-based endangered language documentation. International Conference on Language Documentation & Conservation (ICLDC), Honolulu.

Richard Hatcher, Robbie Jimerson, Whitney Nephew, Mike Jones, Julia Cordani, Linnea Cremean, and Emily Prud’hommeaux. 2019. Additional ways of integrating community-based language documentation and language revitalization. International Conference on Language Documentation & Conservation (ICLDC), Honolulu.

Invited Talks

Robbie Jimerson. 2022. ACL Keynote Panelist: "How Can We Support Linguistic Diversity?" 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), Dublin, Ireland.

Emily Prud'hommeaux. 2022. Integrating speech technology in the endangered language documentation pipeline. CLingDing Colloquium, Indiana University, virtual.

Emily Prud'hommeaux. 2022. Integrating machine learning in the language documentation pipeline. Seminaire du LLACAN, Centre National de la Recherche Scientifique (CNRS), Paris.

Robbie Jimerson and Emily Prud'hommeaux. 2021. Special Interest Group on Endangered Languages (SIGEL) Speaker Series. Panel on Automatic Speech Recognition in Native American Languages, virtual.

Robbie Jimerson. 2021. Automatic Speech Recognition for the Seneca Language. Symposium on American Indian Languages (SAIL), virtual.

Emily Prud'hommeaux. 2019. ASR-assisted endangered language documentation: Possibilities and challenges. Workshop on Language Technology for Language Documentation and Revitalization, Carnegie Mellon University, Pittsburgh.

Emily Prud'hommeaux. 2018. Strengthening and connecting communities with automatic speech recognition. IEEE Western NY Image and Signal Processing Workshop, Rochester NY.

Theses and Dissertations

Richard Hatcher. 2022. Doctoral dissertation, Linguistics, University at Buffalo.

Vigneshwar Lakshminarayanan. 2022. Impact of Noise in Automatic Speech Recognition for Low-Resourced Languages. Master's thesis, Computer Engineering, Rochester Institute of Technology.

Ronit Damania. 2021. Data augmentation for automatic speech recognition for low resource languages. Master's thesis, Computer Science, Rochester Institute of Technology.

Ethan Morris. 2021. Automatic speech recognition for low-resource and morphologically complex languages. Master's thesis, Computer Engineering, Rochester Institute of Technology.

Bao Thai. 2019. Deepfake detection and low-resource language speech recognition using deep learning. Master's thesis, Computer Engineering, Rochester Institute of Technology.

Data

To request access to the audio clips and transcriptions used in Prud'hommeaux et al. 2021, please email Dr. Emily Prud'hommeaux (prudhome@bc.edu) stating your request and including documentation that you have completed your institution's human subjects research training.

A subset of the data used to train our ASR and morphological segmentation models will be archived upon completion of the project in the Native American Language Collection at the Sam Noble Museum at the University of Oklahoma. Instructions on accessing this data will be provided here after archiving is complete.