Back Home Next  

| Japanese | English |

MEASUREMENT OF A VOICE OF A DIFFERENT PITCH
(Analysis of the Japanese voice 4)

Japanese vowel /a/ is measured with different pitch.

Date: 10:00, 30 Sep. 2002
Place: Nagoya, Japan
Microphone: SONY ECM-MS957
Microphone amplifier SONY DAT WALKMAN TCD-D100
Personal computer: DELL INSPIRON 7500
OS: Windows 2000 Professional
Software: DSSF3
WAVE sound file: voice4.wav (44.1kHz / Stereo / 12.67sec / 2.13MB)

Japanese vowel /a/ was pronounced along with the melodic pitch and was measured by the realtime analyzer. The integration interval and the running step were set to 0.2 and 0.5 s. This is the rough measurement in real time and the recorded data is analyzed in detail later. As shown in the figure below, it was uttered eight times. The pitch of the voice was raised in order.

This is the power spectrum of the second utterance "a".

In the spectrum, the peak frequencies are seen at 210Hz, 420Hz, 640Hz, 840Hz, 1050Hz, 1260Hz, 1470H and so on. This is a clear spectrum. The high frequency range of the spectrum roughly decreases 6dB/oct. It is the general characteristic of the speech signal. The fundamental frequency is at 210 Hz and the first formant (F1) frequency is at 840 Hz.

This is the spectrum of the forth utterance.

This is also a clear spectrum. The fundamental frequency at 250 Hz and the F1 at 750 Hz can be seen.

This is the spectrum of the sixth utterance.

The fundamental frequency at 330 Hz and F1 at 660 Hz can be seen.

This is the eighth utterance.

The fundamental frequency is at 370 Hz and the F1 frequency is at 740 Hz. As we can see so far, one of the harmonics becomes formant frequency. But it seems that the harmonic number that becomes formant depends on the fundamental frequency.

Next, the same data is analyzed by the running ACF. Let's see the ACF analysis for every 5ms in order after the utterance starts.

This is the running ACF analysis window. Load the file and start the measurement with the different settings. The integration time and the running step were set at 10 ms and 5 ms.

There are eight utterances. Pitch was raised in order. 

The ACFs measured after around 10 ms for each utterance are compared.

This is the ACF from 5 ms to 15 ms of the first utterance.

The te value is 4.69 ms. In the rise process of the power from an utterance start to the peak level, the te roughly decreases. It seems that there is a relation between the te and the way of utterance. In the ACF there are peaks at 0.73, 1.0, and 1.3 ms and so on. The fundamental frequency can not be identified yet.

As has been revealed in the previous experiment, the first small peaks in the ACF are corresponded to the formant frequencies F1-F3 in the spectrum. This seems to happen when the low frequency band has not been analyzed yet.

The data obtained in the previous test are as follow.

This is the ACF at 10 ms of the second utterance.

This is the ACF at 10 ms of the third utterance.

The ACF at 15 ms of the third utterance.

In the last two ACF, the measured formant frequency has changed more than 100 Hz from 10 ms to 15 ms after the utterance began. I thought it is meaningless to look for the formant as I did, because the fundamental frequency and formant frequency change heavily during only 5ms.

If uttered by changing a pitch, all frequency will change. It is said in the textbook that the fundamental frequency and the formant frequency are constant during the utterance. But actually those are changing heavily. It is easy to identify the formant from the ACF analysis but it seems difficult to continue the voice recognition by means of the formant.

I thought whether the voice analysis could be performed directly from the parameter of the ACF. Below, all the ACF parameters are shown to compare their time change.

The figure below shows all of the ACF and the IACF parameters. This figure can be output by one click using the screen copy function of the DSSF3.

In the t1 graph, the pitch change can be seen clearly. The longer t1 the lower the pitch, because the t1 means one cycle length of the signal. As for the other parameter, the f1 represents the pitch strength, and the te represents the reverberation of the signal. The problem to be tackled is whether the voice signal can be analyzed and identified by means of the combination of those ACF parameters. In the table below, three ACF parameters are summarized for the eight utterances.
  Time utterance(ms) t1(ms)

f1

te(ms)
1st 0.615 25 1.02 0.36 41.39
  0.62 30 1.11 0.3 16.88
  0.625 35 1.13 0.27 6.8
  0.63 40 1.09 0.24 7.32
2nd 2.035 5 1.25 0.37 15.67
  2.04 10 1.22 0.32 4.3
  2.045 15 1.2 0.26 42.67
  2.05 20 1.25 0.49 35.12
  2.06 25 1.25 0.41 31.22
3rd 3.56 5 1.22 0.48 5.37
  3.565 10 1.2 0.41 6.4
  3.57 15 1.36 0.13 7.37
  3.575 20 1.2 0.32 6.27
  3.58 25 1.2 0.37 22.97
4th 5.07 15 1.2 0.49 8.56
  5.075 20 1.22 0.38 4.94
  5.08 25 1.22 0.29 16.1
  5.085 30 1.2 0.42 18.51
5th 6.595 15 1.29 0.28 14.03
  6.6 20 1.25 0.41 8.36
  6.605 25 1.2 0.66 7.76
  6.61 30 1.16 0.49 7.76
  6.615 35 1.18 0.58 12.48
  6.62 40 1.2 0.31 12.03
6th 8.14 5 1.18 0.49 6.45
  8.145 10 1.29 0.82 33.04
  8.15 15 1.25 0.51 11.03
  8.155 20 1.25 0.5 8.11
  8.16 25 1.22 0.53 60.22
  8.165 30 1.22 0.53 27.77
  8.17 35 1.2 0.52 20.53
  8.175 40 1.18 0.59 84.69
7th 9.65 10 1.3 0.3 4.91
  9.655 15 1.27 0.54 139.92
  9.66 20 1.22 0.48 50.91
  9.665 25 1.2 0.5 25.74
  9.67 30 1.16 0.45 37.18
8th 11.24 5 1.3 0.15 18.8
  11.245 10 1.25 0.35 5.69
  11.25 15 1.25 0.6 6.77
  11.255 20 1.25 0.5 4.45
  11.275 40 1.27 0.53 29.25


You can find more information about the te value in "Architectural acoustics" by Yoichi Ando. It is said "Since the minimum value of the moving te is the most active part of each piece of, containing important information and influencing subjective responses for the temporal criteria ..... ."

When a voice is analyzed in very short time (every 5 ms in this experiment), high frequency sound with short wavelength can be analyzed first in a short time. The low-pitched sound with long wavelength is analyzed later. When that very short time could be analyzed well, the formant was identified from the peak of the autocorrelation.

It is important to analyze a time range that contains most amount of information. The te becomes the minimum when change of a sound signal is the largest. For example, it is a time of pronouncing strongly, or stopping utterance and going into the next utterance. That moment might be important for voice analysis. It is the reason I pay my attention to te

 

 

Time

utterance (ms)

t1(ms)

f1

te(ms)

1st

0.625

35

1.13

0.27

6.8

2nd

2.04

10

1.22

0.32

4.3

3rd

3.575

20

1.2

0.32

6.27

4th

5.075

20

1.22

0.38

4.94

5th

6.605

25

1.2

0.66

7.76

6th

8.14

5

1.18

0.49

6.45

7th

9.65

10

1.3

0.3

4.91

8th

11.255

20

1.25

0.5

4.45


April 2003 by Masatsugu Sakurai


Back Home Next