Given the dynamic environment (background noise, microphone quality differences), inter-personal and intra-personal variability, the same word (or sentence) will produce quite a different sound signal for machines to recognize. It is also hard for machines to recognize whispers in which vowel are not spoken out loudly.
Machines (systems dedicated for speech recognition) are unlike humans who have multisensory perception systems and each one is superb comparing to machines’ (image recognition, speech recognition and etc). Other sensory can help the speech recognition process in human brain by providing conversation context, adding more information e.g. gestures and facial expressions. In noisy environments, we can actually guessing what the other guy is saying by looking at his/her mouth shapes.
We, as humans, are also able to recognize different sound in environment, e.g. car engine sound, knocking on the door and etc. Having knowledge of different types of sounds helps us to differentiate the speech component and the background noise. Machines are way simpler in dealing with this kind of situation, even after noise cancelling, if there are still background interruption signals, machines will take it as part of the speech, which will affect the accuracy of recognition drastically.
Another challenge for speech recognition is viewed more from language processing perspective. Speech recognition for human involves not just listening, but also a lot of cognitive efforts, we make guesses and retrieving relevant information to help the process. Based on my current knowledge , this is a missing link for machines to do speech recognition.