Steve Claridge: Can automated captioning ever work?

Posted on November 9, 2012

Way back in 2009 I was ludicrously excited about Google’s effort to provide automatic captioning on Youtube videos. I thought this was going to usher in a new era of across-the-board captioning for all web videos. If Google could produce quality captions automatically and were then prepared to offer that service to all publishers of online video we would have the pick of any video on the Web.

Didn’t happen.

Here we are, almost three years later, still with poor captions on YouTube. Take this Obama video as an example. I can’t really follow that using subtitles, there’s too many missing words and a fair share of wrong ones too. In 2012 captions still have to be created by humans.

I don’t want to get to down on Google because I know how hard speech to text is. We’ve had good speech to text systems for a number of years that use a training phase: you talk to the software and it learns your voice. After learning what you sound like a computer does a very good job of converting your text to speech, but the Google/Youtube task is much harder. 72 hours of video are uploaded to YouTube every minute! There’s no way there’s enough time to teach a computer what all the people in all those videos sound like.

The Obama video should actually be one of the simpler videos to transcribe because, for most of it, there is only one person talking at a time. Things get even harder when multiple people are talking together.

Could automated captioning ever work?

To be fair to Google, what they have right now is hugely impressive from a technical point of view, to be able to interpret any speech without training on specific voices is a big step. But it’s clearly not enough, it’s not making it easy for us to read along.

I don’t believe that truly automated captioning with enough quality will happen any time soon. The only way I can see this working is to train Google’s computers to understand speech on a grand scale; and to do that we would need a lot of volunteers to manually caption a lot of videos. If we took a sample of, say, one million videos and captioned them all we would have a vast amount of training data for Google to use to automate captioning on other videos. You’d have to train on all different languages, different accents, slang words, people shouting, people whispering, female voices, male voices, yelling – the list goes on and on.

Could that even work? I don’t know. Google has indexed the text of the Web, it has also indexed a vast amount of images and videos, can they index sounds too?

Steve Claridge has been wearing hearing aids for over 30 years. What started off as a minor hearing loss at the age of five is now a severe one, but his hearing aids help a lot. He blogs about all this at

The Limping Chicken is supported by Deaf media company Remark!, provider of sign language services Deaf Umbrella, and the RAD Deaf Law Centre.

The Limping Chicken is the world's most popular Deaf blog, and is edited by Deaf journalist and filmmaker Charlie Swinbourne. 

Find out how to write for us by clicking here, how to follow us by clicking here, and read our disclaimer here.

The site exists thanks to our supporters. Check them out below:


Posted in: steve claridge