Latest in Entertainment

Image credit: America's Got Talent

YouTube automates sound effect captions with AI

Its AI can detect laughter, applause and music for the deaf or hard of hearing.
Steve Dent, @stevetdent
03.24.17 in AV
430 Shares
Share
Tweet
Share
Save

Sponsored Links

America's Got Talent

YouTube has used algorithms to automatically caption speech for eight years now in an effort to make its billions of videos more accessible for the deaf and hard of hearing. While the feature was pretty rough at first, it has significantly improved it over time, getting "closer and closer to human transcription error rates," Google said in its developers blog. Since speech is just one part of the audio picture, though, YouTube has launched automatic sound effect captioning for the first time.

For now, the system can just show three classes of sounds: Applause, music and laughter. "These were among the most frequent manually captioned sounds, and they can add meaningful context for viewers who are deaf and hard of hearing," the company wrote.

As with the automatic captions, Google uses machine learning to pick out sounds and display them as text. It developed a "deep neural network (DNN)" model for ambient sound, and trained it with "thousands of hours of videos" to get the best results. The toughest part, it wrote in a technical blog, was separating and displaying events that tend to occur at the same, like laughter and applause.

You can see what that looks like in the clip from America's Got Talent below. The sound effects are merged with the automatic speech recognition and "shown as part of the standard automatic captions," much as you'd see in a close-captioned TV show.

YouTube's team said its aware that the captions are "simplistic," but adding features will be easier as it has built a solid back end foundation. In the future, it'll introduce common sounds like barking, knocking or ringing. That will pose new challenges, as the AI will need to figure out if a ringing sound is coming from an alarm, phone or doorbell, for example.

It'll be worth the effort, though, as Google says that two-thirds of participants in a study found that sound effect captions enhance the video experience. And while it's bound to make mistakes no matter how good it gets (even humans are only about 95 percent accurate), users think that the odd error won't detract from the benefits.

All products recommended by Engadget are selected by our editorial team, independent of our parent company. Some of our stories include affiliate links. If you buy something through one of these links, we may earn an affiliate commission.
Comment
Comments
Share
430 Shares
Share
Tweet
Share
Save

Popular on Engadget

Engadget's Guide to Privacy

Engadget's Guide to Privacy

View
Samsung asks users to be extra careful with the Galaxy Fold

Samsung asks users to be extra careful with the Galaxy Fold

View
Uber sues NYC over vehicle caps

Uber sues NYC over vehicle caps

View
Australia will help NASA go to the Moon and Mars

Australia will help NASA go to the Moon and Mars

View
Apple gets US approval for Mac Pro tariff exemptions

Apple gets US approval for Mac Pro tariff exemptions

View

From around the web

Page 1Page 1ear iconeye iconFill 23text filevr