YouTube has used algorithms to automatically caption speech for eight years now in an effort to make its billions of videos more accessible for the deaf and hard of hearing. While the feature was pretty rough at first, it has significantly improved it over time, getting "closer and closer to human transcription error rates," Google said in its developers blog. Since speech is just one part of the audio picture, though, YouTube has launched automatic sound effect captioning for the first time.
For now, the system can just show three classes of sounds: Applause, music and laughter. "These were among the most frequent manually captioned sounds, and they can add meaningful context for viewers who are deaf and hard of hearing," the company wrote.
As with the automatic captions, Google uses machine learning to pick out sounds and display them as text. It developed a "deep neural network (DNN)" model for ambient sound, and trained it with "thousands of hours of videos" to get the best results. The toughest part, it wrote in a technical blog, was separating and displaying events that tend to occur at the same, like laughter and applause.
You can see what that looks like in the clip from America's Got Talent below. The sound effects are merged with the automatic speech recognition and "shown as part of the standard automatic captions," much as you'd see in a close-captioned TV show.
YouTube's team said its aware that the captions are "simplistic," but adding features will be easier as it has built a solid back end foundation. In the future, it'll introduce common sounds like barking, knocking or ringing. That will pose new challenges, as the AI will need to figure out if a ringing sound is coming from an alarm, phone or doorbell, for example.
It'll be worth the effort, though, as Google says that two-thirds of participants in a study found that sound effect captions enhance the video experience. And while it's bound to make mistakes no matter how good it gets (even humans are only about 95 percent accurate), users think that the odd error won't detract from the benefits.
All products recommended by Engadget are selected by our editorial team, independent of our parent company. Some of our stories include affiliate links. If you buy something through one of these links, we may earn an affiliate commission.
Popular on Engadget
Samsung, Stanford make a 10,000PPI display that could lead to 'flawless' VR