Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Deep Speech 2: End-To-End Speech Recognition in English and Mandarin (arxiv.org)
105 points by sherjilozair on Dec 10, 2015 | hide | past | favorite | 19 comments


During the GPU Technology Conference (GTC) 2015, Andrew Ng showed a live demo of Deep Speech (1?) [0] (demo starts ~41 minute mark). There are other videos showing Deep Speech, but I found this one the most useful/interesting (of the ones I've seen).

[0] http://www.ustream.tv/recorded/60113824


non-flash version of the video

https://www.youtube.com/watch?v=qP9TOX8T-kI


The results are comparable to human transcribers, they note -- which is more a testament to the low quality of Mechanical Turk work than the high quality of this system. Surely a word error rate of 8% (for clean speech) would be unacceptable for a paid transcription service?


I think you should consider all of the results: tables 13-15 have comparisons to humans across a wide range of datasets. The human WER performance varies from 3.5 to 22.2 (on clean speech) - some datasets are much harder than others. And different people are probably better at different accents than others. On top of that, people aren't great spellers especially when it comes to names and proper nouns. One example off the top of my head - Narendra Modi is in the WSJ dataset. I bet many people would spell that wrong if they only hear it and have never seen it spelled before. Or even worse Tchaikovsky.

For the Mandarin system the human performance was obtained from people in our office, not random Turkers. 4% WER for a group of 5 humans vs 3.7% for the system.

Perhaps you're raising the bar - human level performance no longer consists of an average or median level, but the top 1% or better. I'm not sure that's fair.


I suffer from auditory dyslexia (not officially diagnosed, but it runs in my family and I have all the symptoms, etc). I have a slightly lower than average word recognition rate, especially if I can't see the speakers lips. Yet I get by, because usually the context and what words are expected are enough.

This makes me ask two questions: #1- Do systems like this need the court-reported word recognition rate in order to be useful? Or can they compensate for mistakes by using the context? #2- Could we improve these systems by also feeding them video of the speakers lips?

Maybe I should go do a masters to figure out the answers.


Here is a paper that audio and video for speech recognition, and they find that video helps especially in noisy environments.

https://www.uni-ulm.de/fileadmin/website_uni_ulm/allgemein/2...


Awesome!

Noisy environments are exactly when seeing the lips is a huge deal for me. I have a friend who has a tendency to absent-mindedly place his hand in front of his mouth. In a quiet office or home, no issue. In a bar? He pressed mute, as far as I'm concerned.


I personally think that context will improve these systems.

The current task they are performing is extremely difficult: I think it is analogous to having a listener receive anonymous random phone calls from people they have never heard before, speaking in a random accent, in a difficult/noisy environment, who speak for a few seconds about a random topic including proper nouns that the listener has never heard before and then promptly hang up and the listener is asked for an exact transcription with no mistakes.

I don't think it is surprising that the error rates produced by Mechanical Turk workers seem high for some of these tasks, and it actually seems like a big accomplishment that a speech recognition system can do nearly as well under such difficult conditions.

Context about the speaker or the current subject of conversation would clearly help, as would visual cues, and the ability to ask the user to repeat or clarify something.


That's a good idea, and it probably would help. You're talking about transfer learning or (probably more relevant) multi modal machine learning. You don't need a masters; all the papers are online and everyone has email.


Depends on context, audio quality, size of snippet, etc

But I guess Mechanical Turk workers are probably not very committed to getting every detail right (for what they're payed I guess I wouldn't worry too much).

Surely, you can get a good transcription from a crappy audio recording, but that's going to cost you more.


It can't possibly be the machine translation is becoming very good, can it?


What do court transcribers get? That's the standard I'd be inclined to use for very good.


Searches indicate a 2-3% error rate is required for certification.


Will they publish code and learned weights for English and Mandarin?

One great impact of the Deep Dream team contributing tools and libraries to the community was a wealth of applications from a wide variety of people.


If they do it might make projects like https://jasperproject.github.io/ a lot nicer. It sounds like at least in their demo they were using a top of the line GPU to run the network though.


Previous discussion of an article about work by this team:

https://news.ycombinator.com/item?id=10358072


About an hour in, near the end, Ng addressing the stupid digression that the industry recently had about evil AI destroying the world.


An hour into what?


The linked movie up there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: