Meta's newest dataset will train speech recognition engines on 'clusters' of speakers

It's shown to improve ASR performance by 10 percent, especially for Boomers.

kyonntra via Getty Images

It is 2023 and, sorry, Siri somehow still didn’t catch that. Despite the tsunami of advancements generative AI systems have enjoyed in recent months, the synthetic assistants on our mobile devices remain nearly as hard of hearing as they were in 2011. A newly developed dataset from Meta AI, however, promises to improve the performance of such automatic speech recognition (ASR) tools by clustering speech at the “utterance level.”

Meta has long sought to improve its ASRs’ performance, teaching them to train without the aid of transcripts, recognize more than 4,000 spoken languages and even read lips at a higher proficiency than human experts. However, many of the datasets used to train ASR models are organized by demographic — age group, gender, nationality, English accent — which limit the variation of pronunciations that models are trained on, ultimately hindering their function in understanding a broad cross section of users.

To get around this, Meta AI has developed a dataset that instead relies on an utterance clustering method. “Instead of dividing a dataset based on speakers’ demographic information … our proposed algorithm clusters speech at the utterance level,” the Meta AI team explained in Wednesday’s blog post. “A single cluster will contain similar utterances from a diverse group of speakers. We can then train our model using the various clusters and use fairness datasets to measure how the model impacts outcomes across different demographic groups.”

Meta’s resulting dataset includes just over 27,000 command utterances collected from 595 paid US volunteers. Their utterances revolve around seven main themes — music, capture, utilities, notification control, messaging, calling and dictation — that other researchers can then use to train their own models and digital assistants on. Prompts included asking the speakers how they’d voice search for a song or make plans with friends and deciding where to meet up.

To evaluate this new system, Meta first trained a model on publicly-available, English-language Facebook videos. Researchers then evaluated that model using two other datasets: Casual Conversations v1, which Meta released in 2021, and a “de-identified dataset collected from a data supplier for ASR,” which includes 48,000 spoken utterances from 867 individuals.

The initial results proved promising, with model performance improvements “on all demographic groups in our evaluation datasets, though by far the largest gains are with respect to more inclusivity of accents,” per the blog. Overall, ASR performance increased by 10 percent using the clustering method, with large gains coming from the age 66-85 crowd as well, a traditionally underrepresented demographic in the voice command space.

“Our proposed algorithm is part of Meta’s long-term focus on responsible AI and just one part of our holistic approach to address fairness issues,” the researchers wrote. Looking ahead, the team is exploring adapting the system to other languages.