Way back in 2009 I was ludicrously excited about Google’s effort to provide automatic captioning on Youtube videos. I thought this was going to usher in a new era of across-the-board captioning for all web videos. If Google could produce quality captions automatically and were then prepared to offer that service to all publishers of online video we would have the pick of any video on the Web.
Didn’t happen.
Here we are, almost three years later, still with poor captions on YouTube. Take this Obama video as an example. I can’t really follow that using subtitles, there’s too many missing words and a fair share of wrong ones too. In 2012 captions still have to be created by humans.
I don’t want to get to down on Google because I know how hard speech to text is. We’ve had good speech to text systems for a number of years that use a training phase: you talk to the software and it learns your voice. After learning what you sound like a computer does a very good job of converting your text to speech, but the Google/Youtube task is much harder. 72 hours of video are uploaded to YouTube every minute! There’s no way there’s enough time to teach a computer what all the people in all those videos sound like.
The Obama video should actually be one of the simpler videos to transcribe because, for most of it, there is only one person talking at a time. Things get even harder when multiple people are talking together.
Could automated captioning ever work?
To be fair to Google, what they have right now is hugely impressive from a technical point of view, to be able to interpret any speech without training on specific voices is a big step. But it’s clearly not enough, it’s not making it easy for us to read along.
I don’t believe that truly automated captioning with enough quality will happen any time soon. The only way I can see this working is to train Google’s computers to understand speech on a grand scale; and to do that we would need a lot of volunteers to manually caption a lot of videos. If we took a sample of, say, one million videos and captioned them all we would have a vast amount of training data for Google to use to automate captioning on other videos. You’d have to train on all different languages, different accents, slang words, people shouting, people whispering, female voices, male voices, yelling – the list goes on and on.
Could that even work? I don’t know. Google has indexed the text of the Web, it has also indexed a vast amount of images and videos, can they index sounds too?
Steve Claridge has been wearing hearing aids for over 30 years. What started off as a minor hearing loss at the age of five is now a severe one, but his hearing aids help a lot. He blogs about all this at www.hearingaidknow.com.
The Limping Chicken is supported by Deaf media company Remark!, provider of sign language services Deaf Umbrella, and the RAD Deaf Law Centre.
Andy
November 9, 2012
Accurate speech to text recognition is something of a Holy Grail for computer programmers. It is perfectly possible to render text into speech because we have the mechanics of synthesising speech well worked out. Common speech can be divided into 14 parts if I remember rightly. Complete speech is simply a joining up of all those parts to form sounds which are then interpreted as words.
This is possible on your own computer and when I was studying at University we had a “talking head” avatar called Baldi who could be programmed to reproduce very lifelike speech. He also had lifelike mouth movements and I have often wondered about his potential as a mechanised lipreading tutor. Baldi is a free download from the Combined Speech and Language Unit.
Research: http://reference.kfupm.edu.sa/content/t/o/tools_for_research_and_education_in_spee_104604.pdf
Note that I am discussing making computers talk. Well the problem is that doing it the other way round and making computers understand speech is much much harder. If you had two computers running Baldi they would be able to talk to and “understand” each other. Because both have mechanised speech. Speech recognition systems are good at understanding robot speech but quite poor at recognising human speech, because it isn’t consistent. This is the HUGE obstacle to making speech recognition work. Crack this one and fame and fortune are yours.
As you have said you can use things like Dragon Dictate quite successfully. I am part of a deaf club that experimented with Dragon Dictate last year. They had this idea of “subtitling” everything that was said at their meetings and projecting it on a screen. But it was very difficult to make it work and although one member spent hours training her voice it still did not work too well, lots of errors. And that is Dragon Dictate, state of the art software.
Will it ever be possible? Well the problem is that people don’t speak consistently. People have accents, speech impediments and a tendency to all shout at once. Current technology can’t deal with all that. The dreadful subtitles we get on live BBC are the result of the best voice recognition software possible and yet they are still a problem.
Given enough computing power it might be possible to find some way of accommodating these quirks but at the moment it can’t be done.
Robert Mandara
November 9, 2012
Thought-provoking article! There is one mistake where you have said “text to speech” instead of “speech to text”. But your mistake gave me a horrifying idea!
If speech to text conversion gets close to perfect (I think it will eventually), how long will it be before hearing aids are designed to work on a speech to speech basis? The hearing aid would work out what was being said (in any accent) and then repeat it in a clearer voice that the hearing aid wearer could easily understand. The probable time delay could be a massive obstacle to that.
More likely we will see speech to text apps on our phones first. That could be really cool and useful.
barakta
November 9, 2012
I’m one of those for whom Dragon Dictate does not work despite my apparent “normal” speech. I find people don’t believe me when I say my partner can talk to a me-trained-Dragon and make it better than I can. My speech is probably not very consistent even if it is good, people just don’t notice.
I did my degree in information management and did modules on speech to text and it was ~96% accurate in 2004 and the problem is it has kind of stalled because getting more accuracy than that is hard, yet 95%ish isn’t enough, it’ll mishear “it and its” or “tree and trees” and change the sense and context of everything. I sadly have to explain this all to people a lot who think they can record speakers and Dragon will magically transcribe it.
I wonder if for famous people like Obama it would be possible for someone to train specific vids of his then tag them “Barack Obama” and have Google’s Youtube being able to extrapolate training data. If enough people could provide corrections that could go into the correction collection. People could train their own voice and see if captions were more accurate thereafter.
I find YouTube’s captions offensively annoying as SO many people tell me “Oh there are captions” not realising or believing that they are SO bad they are unusable. It’s that faux accessibility where when mean little deafies say “No, this is not accurate enough” people throw a flounce cos they believed it was.
Fil McIntyre (@BRITE_Fil)
November 9, 2012
I agree that YouTube’s auto captions are terrible. However there is an underused option which enables straightforward, accurate captioning.
You upload a transcript of the speech (so yes unfortunately someone still does need to type it out!). There is no need to add any timings.
YouTube will use its speech recognition to match the text to the audio. This means you get accurate captions without the need to manually time them to the video.
Further details in this blog post: http://briteblog.wordpress.com/2011/08/29/semi-automatic-captioning-using-youtube/
Editor
November 9, 2012
That’s very interesting, thank you! As well as editing this site, I’m also a filmmaker and I may very well use this advice on my next film! Best, Charlie
whatdoesthisdo
November 9, 2012
If you ever need a video subtitling check out universal subtitles http://www.universalsubtitles.org/en/teams/captions-requested/ submit a video direct to this team or send your request to the Deaf HoH email list: http://groups.google.com/group/universal-subtitles-deaf-hoh and a human like me will do our best to get it subtitled 🙂
Andy M
November 12, 2012
Google do have access to training data for captioning. Most US television programmes have Closed Captioning associated with them. Many of these were stored in the (now defunct?) Google Video site – can’t see why they can’t use this data to train speech recognition.
Josh
April 17, 2014
Its not ready yet Google!! put that shit back in the oven. Its an unreasonable waste of peoples time disabling them, especially when they never requested that they appear in the first place.
smart1
May 28, 2014
It would be nice if youtube allowed an ‘open captions’ option alongside of auto-captions. While you can upload a subtitle or transcript to your own video, you can’t do the same for videos that aren’t yours. If it was a youtube-community based effort a person could make a subtitle for the entire video, or pick up where someone left off, or 100 people could edit it.. kind of like how wikipedia s.
luis@hollywoodtools.com
July 26, 2014
It will get better but with so many variables, dialects, background noise ETC., I can’t see it ever being perfect. Our approach is to offer the service from audio video files and providing time stamps. Additionally we give the user a simple editor to edit the text and timeings. Lastly we provide the user the options of saving a text, SRT, TTML, or WebVTT file. http://www.uSubtitle.tv
RogerVoice (@iwantroger)
August 23, 2014
Hi. I’m a deaf American and I live in France. Sorry for entering the conversation so late, but stumbled across this post as I was looking for info on what people think of automatic captioning.
As an aside, and in response to smart1, you should check out http://www.amara.org they do an amazing job with crowdsourced captioning for online videos. And their online video editor tool is a breeze to use (I’ve used it myself).
Automatic captions will probably never replace the accuracy that comes from human understanding of speech. But I’ve been pretty amazed at what it does do. In a controlled environment. Automatic captions works. Not to randomly subtitle anything, no, that doesn’t work too well. But automatic captioning can work fine for specific well-enunciated conversations. And quite frankly, there are times when I’d rather have that than nothing. So I’m running this program where we use a smartphone app that subtitles your calls.
http://www.rogervoice.com
It would help in those countries where people don’t have the means to get transcribed phone calls. I would like as many people to get in on the beta version when we release it, to test and get feedback. The more of you get in on it, the better we’ll get. So have a go and join!