Steve Claridge: Can automated captioning ever work?

Posted on November 9, 2012 by Editor

Way back in 2009 I was ludicrously excited about Google’s effort to provide automatic captioning on Youtube videos. I thought this was going to usher in a new era of across-the-board captioning for all web videos. If Google could produce quality captions automatically and were then prepared to offer that service to all publishers of online video we would have the pick of any video on the Web.

Didn’t happen.

Here we are, almost three years later, still with poor captions on YouTube. Take this Obama video as an example. I can’t really follow that using subtitles, there’s too many missing words and a fair share of wrong ones too. In 2012 captions still have to be created by humans.

I don’t want to get to down on Google because I know how hard speech to text is. We’ve had good speech to text systems for a number of years that use a training phase: you talk to the software and it learns your voice. After learning what you sound like a computer does a very good job of converting your text to speech, but the Google/Youtube task is much harder. 72 hours of video are uploaded to YouTube every minute! There’s no way there’s enough time to teach a computer what all the people in all those videos sound like.

The Obama video should actually be one of the simpler videos to transcribe because, for most of it, there is only one person talking at a time. Things get even harder when multiple people are talking together.

Could automated captioning ever work?

To be fair to Google, what they have right now is hugely impressive from a technical point of view, to be able to interpret any speech without training on specific voices is a big step. But it’s clearly not enough, it’s not making it easy for us to read along.

I don’t believe that truly automated captioning with enough quality will happen any time soon. The only way I can see this working is to train Google’s computers to understand speech on a grand scale; and to do that we would need a lot of volunteers to manually caption a lot of videos. If we took a sample of, say, one million videos and captioned them all we would have a vast amount of training data for Google to use to automate captioning on other videos. You’d have to train on all different languages, different accents, slang words, people shouting, people whispering, female voices, male voices, yelling – the list goes on and on.

Could that even work? I don’t know. Google has indexed the text of the Web, it has also indexed a vast amount of images and videos, can they index sounds too?

Steve Claridge has been wearing hearing aids for over 30 years. What started off as a minor hearing loss at the age of five is now a severe one, but his hearing aids help a lot. He blogs about all this at www.hearingaidknow.com.

The Limping Chicken is supported by Deaf media company Remark!, provider of sign language services Deaf Umbrella, and the RAD Deaf Law Centre.

Enjoying our eggs? Support The Limping Chicken:

The Limping Chicken is the world's most popular Deaf blog, and is edited by Deaf journalist, screenwriter and director Charlie Swinbourne.

Our posts represent the opinions of blog authors, they do not represent the site's views or those of the site's editor. Posting a blog does not imply agreement with a blog's content. Read our disclaimer here and read our privacy policy here.

Find out how to write for us by clicking here, and how to follow us by clicking here.

The site exists thanks to our supporters. Check them out below:

BSL Zone: TV programmes in BSL for the Deaf community
Deaf Umbrella: sign language interpreting - find out about listening fatigue.
Bellman & Symfon: home alerting solutions, including the mobile phone transceiver!
SignVideo: Instant BSL video interpreting online
SignHealth: check out the BSL health library
999 BSL: call 999 in an emergency, in BSL
Appa: Looking for RSLIs, TSLIs and CSWs, apply here!
Mary Hare School: Education for deaf children and young people
Signly: Adding BSL to websites
Signworld: Learn BSL online!
DCAL: world-class research into deafness, cognition and language
Action Deafness: “A Deaf-Led Charity” – interpreting & community support services
Sign Solutions: Instant access to Interpreters, training and BSL translation nationwide
InterpretersLive: On demand BSL video interpretation
Lipspeaker UK: specialist lipspeaking support
Performance Interpreting: BSL interpreting at concerts
National Deaf Children's Society: The leading charity for deaf children

Tagged: access, automated captioning, online videos, subtitles

Posted in: steve claridge

11 Responses “Steve Claridge: Can automated captioning ever work?” →

Andy

November 9, 2012

Accurate speech to text recognition is something of a Holy Grail for computer programmers. It is perfectly possible to render text into speech because we have the mechanics of synthesising speech well worked out. Common speech can be divided into 14 parts if I remember rightly. Complete speech is simply a joining up of all those parts to form sounds which are then interpreted as words.
This is possible on your own computer and when I was studying at University we had a “talking head” avatar called Baldi who could be programmed to reproduce very lifelike speech. He also had lifelike mouth movements and I have often wondered about his potential as a mechanised lipreading tutor. Baldi is a free download from the Combined Speech and Language Unit.
Research: http://reference.kfupm.edu.sa/content/t/o/tools_for_research_and_education_in_spee_104604.pdf

Note that I am discussing making computers talk. Well the problem is that doing it the other way round and making computers understand speech is much much harder. If you had two computers running Baldi they would be able to talk to and “understand” each other. Because both have mechanised speech. Speech recognition systems are good at understanding robot speech but quite poor at recognising human speech, because it isn’t consistent. This is the HUGE obstacle to making speech recognition work. Crack this one and fame and fortune are yours.

As you have said you can use things like Dragon Dictate quite successfully. I am part of a deaf club that experimented with Dragon Dictate last year. They had this idea of “subtitling” everything that was said at their meetings and projecting it on a screen. But it was very difficult to make it work and although one member spent hours training her voice it still did not work too well, lots of errors. And that is Dragon Dictate, state of the art software.

Will it ever be possible? Well the problem is that people don’t speak consistently. People have accents, speech impediments and a tendency to all shout at once. Current technology can’t deal with all that. The dreadful subtitles we get on live BBC are the result of the best voice recognition software possible and yet they are still a problem.
Given enough computing power it might be possible to find some way of accommodating these quirks but at the moment it can’t be done.
Robert Mandara

November 9, 2012

Thought-provoking article! There is one mistake where you have said “text to speech” instead of “speech to text”. But your mistake gave me a horrifying idea!

If speech to text conversion gets close to perfect (I think it will eventually), how long will it be before hearing aids are designed to work on a speech to speech basis? The hearing aid would work out what was being said (in any accent) and then repeat it in a clearer voice that the hearing aid wearer could easily understand. The probable time delay could be a massive obstacle to that.

More likely we will see speech to text apps on our phones first. That could be really cool and useful.
barakta

November 9, 2012

I’m one of those for whom Dragon Dictate does not work despite my apparent “normal” speech. I find people don’t believe me when I say my partner can talk to a me-trained-Dragon and make it better than I can. My speech is probably not very consistent even if it is good, people just don’t notice.

I did my degree in information management and did modules on speech to text and it was ~96% accurate in 2004 and the problem is it has kind of stalled because getting more accuracy than that is hard, yet 95%ish isn’t enough, it’ll mishear “it and its” or “tree and trees” and change the sense and context of everything. I sadly have to explain this all to people a lot who think they can record speakers and Dragon will magically transcribe it.

I wonder if for famous people like Obama it would be possible for someone to train specific vids of his then tag them “Barack Obama” and have Google’s Youtube being able to extrapolate training data. If enough people could provide corrections that could go into the correction collection. People could train their own voice and see if captions were more accurate thereafter.

I find YouTube’s captions offensively annoying as SO many people tell me “Oh there are captions” not realising or believing that they are SO bad they are unusable. It’s that faux accessibility where when mean little deafies say “No, this is not accurate enough” people throw a flounce cos they believed it was.
Fil McIntyre (@BRITE_Fil)

November 9, 2012

I agree that YouTube’s auto captions are terrible. However there is an underused option which enables straightforward, accurate captioning.
You upload a transcript of the speech (so yes unfortunately someone still does need to type it out!). There is no need to add any timings.
YouTube will use its speech recognition to match the text to the audio. This means you get accurate captions without the need to manually time them to the video.

Further details in this blog post: http://briteblog.wordpress.com/2011/08/29/semi-automatic-captioning-using-youtube/

Editor

November 9, 2012

That’s very interesting, thank you! As well as editing this site, I’m also a filmmaker and I may very well use this advice on my next film! Best, Charlie

whatdoesthisdo

November 9, 2012

If you ever need a video subtitling check out universal subtitles http://www.universalsubtitles.org/en/teams/captions-requested/ submit a video direct to this team or send your request to the Deaf HoH email list: http://groups.google.com/group/universal-subtitles-deaf-hoh and a human like me will do our best to get it subtitled 🙂
Andy M

November 12, 2012

Google do have access to training data for captioning. Most US television programmes have Closed Captioning associated with them. Many of these were stored in the (now defunct?) Google Video site – can’t see why they can’t use this data to train speech recognition.
Josh

April 17, 2014

Its not ready yet Google!! put that shit back in the oven. Its an unreasonable waste of peoples time disabling them, especially when they never requested that they appear in the first place.
smart1

May 28, 2014

It would be nice if youtube allowed an ‘open captions’ option alongside of auto-captions. While you can upload a subtitle or transcript to your own video, you can’t do the same for videos that aren’t yours. If it was a youtube-community based effort a person could make a subtitle for the entire video, or pick up where someone left off, or 100 people could edit it.. kind of like how wikipedia s.
luis@hollywoodtools.com

July 26, 2014

It will get better but with so many variables, dialects, background noise ETC., I can’t see it ever being perfect. Our approach is to offer the service from audio video files and providing time stamps. Additionally we give the user a simple editor to edit the text and timeings. Lastly we provide the user the options of saving a text, SRT, TTML, or WebVTT file. http://www.uSubtitle.tv
RogerVoice (@iwantroger)

August 23, 2014

Hi. I’m a deaf American and I live in France. Sorry for entering the conversation so late, but stumbled across this post as I was looking for info on what people think of automatic captioning.

As an aside, and in response to smart1, you should check out http://www.amara.org they do an amazing job with crowdsourced captioning for online videos. And their online video editor tool is a breeze to use (I’ve used it myself).

Automatic captions will probably never replace the accuracy that comes from human understanding of speech. But I’ve been pretty amazed at what it does do. In a controlled environment. Automatic captions works. Not to randomly subtitle anything, no, that doesn’t work too well. But automatic captioning can work fine for specific well-enunciated conversations. And quite frankly, there are times when I’d rather have that than nothing. So I’m running this program where we use a smartphone app that subtitles your calls.
http://www.rogervoice.com
It would help in those countries where people don’t have the means to get transcribed phone calls. I would like as many people to get in on the beta version when we release it, to test and get feedback. The more of you get in on it, the better we’ll get. So have a go and join!

Steve Claridge: Can automated captioning ever work?

Now you've read it share it:

Like this: