NSF: Estimating Articulatory Constriction Place and Timing from Speech Acoustics

Faculty

Carol Espy-Wilson, , Suzanne Boyce (U Cincinnati), Mark Tiede (Haskins Laboratories)

Funding Agency

National Science Foundation

Year

2022

Descriptions

Professor Carol Espy-Wilson (ECE/ISR) is the principal investigator for a new three-year, $500K National Science Foundation collaborative research award, “Estimating Articulatory Constriction Place and Timing from Speech Acoustics.” Espy-Wilson will be working with Suzanne Boyce of the University of Cincinnati and Mark Tiede of Haskins Laboratories, Inc.

This project focuses on a new approach for using speech recordings to study speaker pronunciation habits—that is, the way speakers systematically coordinate the articulatory movements of their lips, jaw, tongue, glottis and soft palate to produce words and sentences. These articulatory habits differ between individuals, and across languages and dialects of the same language, accounting for many aspects of foreign accents, speech disorders and speaking styles.

Previous studies of these habits have required specialized equipment to observe articulator movements. The aim of this project is to further develop and improve the researchers’ existing "speech inversion" tool so that it can accurately read acoustic recordings of speech and—using machine learning—recover details of the magnitude and timing of articulatory movements directly from the speech signal.

To date, the researchers' tool has successfully recovered movements of the tongue and lips. The current project extends the tool’s functionality to encompass nasality (soft palate) and voicing (glottis).

Specialized neural network models will be trained to relate features of the acoustic signal to separately acquired ground-truth nasal vs. oral outflow signals and concurrent electroglottography. The team will train the neural network models using a new collection of acoustic and articulatory data drawn from speakers of American English. The data include co-collected audio, nasal, voicing, and articulatory movement, and will serve as “ground truth” for training and assessing the capabilities of the fully trained speech inversion system. The tool will be validated and tested using data from speakers of languages with patterns of articulatory habits known to differ from English, including Canadian French and Russian.

When successfully validated, the speech inversion tool will be useful for identifying medical issues that affect speech movement organization, such as dysarthria—a disruption of oral/laryngeal timing associated with brain damage. In addition, incorporating estimates of articulation may aid in the tracking changes resulting from medical conditions such as depression and schizophrenia.

More generally, the ability to rapidly and easily analyze articulatory movements obtained from audio recordings alone has the potential substantially improve Automated Speech Recognition (ASR) systems; to assist scholars, forensic scientists, and clinical professionals studying the speech of communities in rural or under-resourced areas; and to help document endangered languages.

Top