Speech Recognition

This paper deals with the topic SPEECH RECOGNITION which can make a revolution in the years to come. Speech recognition acts as an interface between the user and the system. Its applications vary to the extent that it is a successful replacement for input devices like Keyboard ,mouse etc. 

This paper contains information about Automatic Speech Recognition which decodes speech signals to phones, which is the basic building block of any word. Speech Recognition Systems are classified as Dependent and Independent Systems. Dependent systems recognize the sound generated by a single speaker whereas an Independent System recognizes sounds generated by multiple speakers. 

Speech recognition technologies allow computers equipped with a source of sound input, such as a microphone, to interpret human speech, e.g., for transcription or as an alternative method of interacting with a computer

Automatic Speech Recognition

Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech.

What does speaker dependent / adaptive / independent mean?

A speaker dependent system is developed to operate for a single speaker. These systems are usually easier to develop, cheaper to buy and more accurate, but not as flexible as speaker adaptive or speaker independent systems.

A speaker independent system is developed to operate for any speaker of a particular type (e.g. American English). These systems are the most difficult to develop, most expensive and accuracy is lower than speaker dependent systems. However, they are more flexible.

A speaker adaptive system is developed to adapt its operation to the characteristics of new speakers. It's difficulty lies somewhere between speaker independent and speaker dependent systems.


What does small/medium/large/very-large vocabulary mean?

The size of vocabulary of a speech recognition system affects the complexity, processing requirements and the accuracy of the system. Some applications only require a few words (e.g. numbers only), others require very large dictionaries (e.g. dictation machines). There are no established definitions, however, try

• Small Vocabulary - tens of words

• Medium Vocabulary - hundreds of words

• Large Vocabulary - thousands of words

• Very-Large Vocabulary - tens of thousands of words.

What does continuous speech and isolated-word mean?

An isolated-word system operates on single words at a time - requiring a pause between saying each word. This is the simplest form of recognition to perform because the end points are easier to find and the pronunciation of a word tends not affect others. Thus, because the occurrences of words are more consistent they are easier to recognize.

A continuous speech system operates on speech in which words are connected together, i.e. not separated by pauses. Continuous speech is more difficult to handle because of a variety of effects. First, it is difficult to find the start and end points of words. Another problem is "co articulation". The production of each phoneme is affected by the production of surrounding phonemes, and similarly the start and end of words are affected by the preceding and following words. The recognition of continuous speech is also affected by the rate of speech (fast speech tends to be harder).

The Process of Speech Recognition

There are several approaches to automatic speech recognition:

• Acoustic-Phonetic -- This approach is based on the idea that all spoken words can be split up into a finite group of phonetic units. If all of these phonetic units can be characterized computationally, one should be able to figure out what phonetic units have been spoken, and then decode them into words.

• Pattern Recognition -- This approach uses a training algorithm to teach a recognizer about the patterns present in specific words. It is similar to the acoustic-phonetic approach, but rather than defining the patterns explicitly (as phonetic units), Hidden Markov Model(HMM) based pattern recognizer finds it's own set of patterns.

• Artificial Intelligence -- This approach mixes the previous two approaches by combining phonetic, syntactic, lexical, and/or semantic based analysis with pattern recognition.

Speech Detection

 The first task is to identify the presence of a speech signal. This task is easy if the signal is clear, however frequently the signal contains background noise. The signals obtained were in fact found to contain some noise. Two criterions are used to identify the presence of a spoken word. First, the total energy is measured, and second the number of zero crossings are counted. Both of these were found to be necessary, as voiced sounds tend to have a high total energy, but a low frequency, while unvoiced sounds were found to have a high frequency. Only background noise was found to have both low energy and low frequency. The method was found to successfully detect the beginning and end of the several words tested. Note that this is not sufficient for the general case, as fluent speech tends to have pauses, even in the middle of words (such as in the word 'acquire', between the 'c' and 'q'). In fact reliable speech detection is a difficult problem, and is an important part of speech recognition.

Blocking

 The second task is blocking. Older speech recognition systems first attempted to detect where the phones would start and finish, and then block the signal by placing one phone in each block. However, phones can blend together in many circumstances, and this method generally could not reliably detect the correct boundaries. Most modern systems simply separate the signal into blocks of a fixed length. These blocks tend to overlap, so that phones which cross block boundaries will not be missed. Here is what a typical block might