Transcript
Dept. for Speech, Music and Hearing
Quarterly Progress and Status Report
Noise cancelling microphones for automatic speech recognition ยจ H. Sohlstrom,
journal: volume: number: year: pages:
STL-QPSR 19 4 1978 030-038
http://www.speech.kth.se/qpsr
STL-QPSR 4/1978
111. SPEECH RECOGNITION A.
NOISE CANCELLING MICROPHONES FOR AUTOMATIC SPEECH RECOGNITION *
H.
Sohlstrom
Abstract Automatic speech recognition a s well a s man-to-man communications in noisy environments require noise cancelling microphones . A number of such microphones a r e studied. Special attention i s given to a contact microphone. The test procedure i s described and the results a r e discussed. The contact microphone i s found to give better sound quality than expected. Introduction Automatic Speech Recognition systems a r e leaving the laboratory 6 stage. Several systems a r e today commercially available . If a system is to be useful i t s operation must be unaffected by background noise.
This can be achieved in three ways.
i s to reduce the noise level.
The f i r s t
- and best -
way
The second i s to use a noise cancelling
microphone. The third is to extract the phonetic information from the waveform i n a way that makes the system immune to noise
2
.
Noise cancelling microphones have been used for a long time in man-to-man communications.
The situation i s , however, somewhat
different in speech recognition systems depending on the special s e t of parameters adopted
7
.
In the present study several noise cancelling microphones were tried both for man-to-man communication and speech recognition. One of the microphones was of the contact type, i . e . it was to be fixed upon the speaker in o r d e r to pick up vibrations rather than sound. This type of microphone i s used in very noisy environments, for example in aircrafts. microphone
Because of the different principle used in this
special interest w a s paid to i t .
The speech recognition system was a phonetically oriented system developed by Mats Blomberg and Kjell Elenius. A short description of the system i s given in Blomberg and Elenius, 1978 1 .
* Thesis work
1977 under supervision of Mats Blomberg and Kjell Elenius.
STL-QPSR 4/1978
of measurements the microphones were tested in the way they a r e designed to work
- a t the
right distance and with speech.
Speech is by no means a regular source of sound.
To allow c o r -
r e c t comparisons we made simultaneously recordings from two microphones of a person.,reading a short test. The recordings were made i n an anechoic room.
One of the microphones was a pressure sensitive
dynamic microphone from Sennheiser, MD 2 11. This microphone was the "referencetthigh quality microphone with which a l l the other were compared.
The other recording was made with the microphone under
test. The average amplitude distribution a s a function of frequency for the two recordings was then computed.
This was done with our CD
1700 computer and a 51-channel spectrum analyzer.
The differences
between the distributions could then be interpreted a s deviations from ideal responses. To permit analysis of the separate speech sounds, . a s they were transduced by the microphones, recordings of VCV and CVG words were made.
Also in this c a s e each microphone was compared to the
reference microphone. F o r the contact microphone the measurements in the second group proved much m o r e relevant.
Its response was very much dependent
upon i t s positio.1 on the speaker.
Several positions were tried.
Two
positions were found to be representative, each i n i t s own way.
The two positions were on the forehead and on the neck just under the chin and halfway towards the e a r , Fig. III-A-1.
If the microphone had
been positioned closer to the larynx it would only have picked up a signal dominated by the fundamental and much of the formant pattern would have been lost. The *lrd gr_ouq of the measurements were performance tests using the speech recognition system. The recognition system works with isolated words. A standard vocabulary of 41 words was chosen. The words in this vocabulary a r e the words used in Swedish, when spelling out words over the telephone &dam, z e r t i l , C e s a r , e t c . ) , the numbers 0 - 9 i n Swedish and the words "miss" and "mellanslag" (space). This vocabulary was
STL-QPSR 4/1978
33.
spoken five t i m e s x i t h each microphone.
.A :eference recording was
made simultaneously with the MD 2 11 microphone mentioned e a r l i e r . Recordings were made both in an anechoic room and in a normal room where tapes with different kinds of noise were being played back.
5
The noise level was up to 90 dB (lin)
.
It should be mentioned that the words were read by the author, unfortunately i n a rather hoarse voice.
This accounts for the overall
low recognition r a t e s . Before the actual recognition t e s t with each microphone, the syst e m had to "listent1 to a number of repetitions of the vocabulary i n o r d e r to extract statistical information about formant freuuencies, sound duration etc.
This information i s used i n the recognition process.
This procedure will be r e f e r r e d to a s "learning1'. When the system was to operate in noise, the learning could be done either in silence o r with the noise used.
Both c a s e s were studied.
Re sults and discussion The measurement on the Se_nthsiger - -MD - -4Zi-did not give any s u r ~ r i s i n gresults.
The frequency response f o r different directions to the
sound source i s shown in Fig. 111-A-2.
The rejection of sounds from
the r e a r of the microphone i s a bit uneven over the frequency range, but a s this microphone i s not designed f o r u s e in noisy environments this i s of little importance.
As can be seen f r o m Fig. 111-A-2 the
response r i s e s some 10 dB f r o m the low end of the spectrum to the high.
This changes a s the microphone i s moved closer to the sound
source. At a distance of 1 dm the response is well balanced.
The
t e s t s with speech do not give any m o r e information about the microphone.
The microphone i s a good example of a dynamic cardiod mic-
rophone.
__
Sennheiser headset r n i c r o ~ h o n e is a differential microphone, designed to be used v e r y near the s p e a k e r ' s mouth.
If i t i s used f a r f r o m the sound source i t has a very uneven frequency response, rising steeply with frequency.
Sennheiser has published diagrams showing
a r a t h e r flat response at a distance of 1 cm.
F i g . 111-A- I .
Briiel 8 K j e r
Microphone position on the neck.
Potentiometer ~ a n ~ e : a - d ~e~ ctifiar:AM!!!~ower
Lirn. Freq.:
fO Hz
Copenhagen
Rec. No.:Date: 7706/6 S ~ g n . : .0 . 10
20
Hz
50
F i g . 111-A-2.
100
200
500
1L'OO
2000
5000
F r e q u e n c y r e s p o n s e , long d i s t a n c e . MD 4 2 1 .
10000
20000
STL-QPSR 4/5978
F o r our microphone this could not be duplicated with the test setup used. The sound source was a Briiel & Kjaer Artificial Voice 4215 modified to make the "mouth opening" s m a l l e r , more like the
recent 421 9.
The sound p r e s s u r e was held constant with the aid of
a measuring microphone, controlling the output from the generator. The result obtained can be seen in Fig. 111-A-3. The response i s
he&
i s apparently a strong resonance in the micro-
phone a t about 8 kHz.
The rejection of sounds from distant sources
f a r from flat.
i s good. I t varies from 10 to 30 dB through the audible range. The average spectrum f o r the short text confirms the impression f r o m Fig. 111-A-3, see Fig. IXI-A-4. A closer examination of the different speech sounds revealed some breath noises but this is almost unavoidable with close talking microphones.
The breath noises showed up a s an increased low f r e -
quency level.
-The - -KUC - - -7001 - - -microphone - - - - - has a frequency response that looks far better than that of the Sennheiser microphone. Fig. III-A-5 shows the response 1.0 cm from the sound source.
The rejection of distant
sound sources i s about the same a s that of the Sennheiser headset microphone.
More breath noises could be heard with this microphone
than with the preceding one. this.
There a r e two possible explanations for
This microphone has a better low frequency response and the
noises can therefore more easily be heard.
The second possible
reason is that i t is i n fact only a "naked" microphone capsule without any protective screening against the a i r s t r e a m from the mouth.
The
average spectrum confirms that this microphone gives a good r e p r o duction of speech. As mentioned above, the position of the contact - - - -microphone - - - - - -has a g r e a t influence on the results.
The response for speech transmitted
through the tissues of the neck o r face i s selectively frequency dependent.
The damping seems to be greatest in the soft tissues, especial-
l y for high frequencies.
The bone structure of the face s e e m s to have
a much lower damping.
This agrees well with what has been reported
by others. In Fig. III-A-7 the average speech spectrum with the microphone on the forehead i s compared with the reference condition.
The
Potentiometer R a n o e : X & .
dB Rectifier:..
RMS-
~ o w e rh n l . Freq.: .__- Hz
40-1
-
M r 0 . ~f/llldr,':rv
3 &,