Estimation of vocal-tract shape from speech spectrum and speech resynthesis based on a generative model

Research output: Contribution to journalConference article

Abstract

Precise control of articulatory parameters is difficult and prevents a physical model from generating natural sounding speech signals. To determine vocal-tract shape from speech, this paper presents an inversion method for simultaneously esti- mating the cross-sectional area and length of the vocal tract. In addition, we performed speech resynthesis from a time-series of estimated vocal-tract shapes. The vocal-tract shape is deter- mined through an iterative procedure that gradually optimizes the parameter values to produce the target speech spectrum. The vocal-tract shape is updated using a sensitivity function that represents the change in formant frequency caused by a small perturbation of the vocal-tract shape. When combined with a perturbation relationship of speech spectrum parameters (i.e., cepstrum parameters) and formants, our method effectively op- Timizes the vocal-tract shape. We quantitatively examined the accuracy using area function data for 10 isolated vowels. The results showed that the average area error was 0.43 cm2 and the average length error was 0.23 cm. This indicates that the vocal- Tract shape was determined with satisfactory accuracy. We also performed an estimation experiment for continuous speech and synthesized speech from the estimated vocal-tract shape.

Original languageEnglish
Pages (from-to)422-426
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - Jan 1 2014
Event15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014 - Singapore, Singapore
Duration: Sep 14 2014Sep 18 2014

Fingerprint

Generative Models
Optimise
perturbation
Cepstrum
Speech
speech
Generative
Vocal Tract
Speech Signal
Iterative Procedure
Physical Model
Small Perturbations
Time series
Inversion
time series
Perturbation
Target
parameter

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

@article{2eeebcb218b442e7a0ba9aa6d4fabc38,
title = "Estimation of vocal-tract shape from speech spectrum and speech resynthesis based on a generative model",
abstract = "Precise control of articulatory parameters is difficult and prevents a physical model from generating natural sounding speech signals. To determine vocal-tract shape from speech, this paper presents an inversion method for simultaneously esti- mating the cross-sectional area and length of the vocal tract. In addition, we performed speech resynthesis from a time-series of estimated vocal-tract shapes. The vocal-tract shape is deter- mined through an iterative procedure that gradually optimizes the parameter values to produce the target speech spectrum. The vocal-tract shape is updated using a sensitivity function that represents the change in formant frequency caused by a small perturbation of the vocal-tract shape. When combined with a perturbation relationship of speech spectrum parameters (i.e., cepstrum parameters) and formants, our method effectively op- Timizes the vocal-tract shape. We quantitatively examined the accuracy using area function data for 10 isolated vowels. The results showed that the average area error was 0.43 cm2 and the average length error was 0.23 cm. This indicates that the vocal- Tract shape was determined with satisfactory accuracy. We also performed an estimation experiment for continuous speech and synthesized speech from the estimated vocal-tract shape.",
author = "Tokihiko Kaburagi",
year = "2014",
month = "1",
day = "1",
language = "English",
pages = "422--426",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Estimation of vocal-tract shape from speech spectrum and speech resynthesis based on a generative model

AU - Kaburagi, Tokihiko

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Precise control of articulatory parameters is difficult and prevents a physical model from generating natural sounding speech signals. To determine vocal-tract shape from speech, this paper presents an inversion method for simultaneously esti- mating the cross-sectional area and length of the vocal tract. In addition, we performed speech resynthesis from a time-series of estimated vocal-tract shapes. The vocal-tract shape is deter- mined through an iterative procedure that gradually optimizes the parameter values to produce the target speech spectrum. The vocal-tract shape is updated using a sensitivity function that represents the change in formant frequency caused by a small perturbation of the vocal-tract shape. When combined with a perturbation relationship of speech spectrum parameters (i.e., cepstrum parameters) and formants, our method effectively op- Timizes the vocal-tract shape. We quantitatively examined the accuracy using area function data for 10 isolated vowels. The results showed that the average area error was 0.43 cm2 and the average length error was 0.23 cm. This indicates that the vocal- Tract shape was determined with satisfactory accuracy. We also performed an estimation experiment for continuous speech and synthesized speech from the estimated vocal-tract shape.

AB - Precise control of articulatory parameters is difficult and prevents a physical model from generating natural sounding speech signals. To determine vocal-tract shape from speech, this paper presents an inversion method for simultaneously esti- mating the cross-sectional area and length of the vocal tract. In addition, we performed speech resynthesis from a time-series of estimated vocal-tract shapes. The vocal-tract shape is deter- mined through an iterative procedure that gradually optimizes the parameter values to produce the target speech spectrum. The vocal-tract shape is updated using a sensitivity function that represents the change in formant frequency caused by a small perturbation of the vocal-tract shape. When combined with a perturbation relationship of speech spectrum parameters (i.e., cepstrum parameters) and formants, our method effectively op- Timizes the vocal-tract shape. We quantitatively examined the accuracy using area function data for 10 isolated vowels. The results showed that the average area error was 0.43 cm2 and the average length error was 0.23 cm. This indicates that the vocal- Tract shape was determined with satisfactory accuracy. We also performed an estimation experiment for continuous speech and synthesized speech from the estimated vocal-tract shape.

UR - http://www.scopus.com/inward/record.url?scp=84910024002&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84910024002&partnerID=8YFLogxK

M3 - Conference article

SP - 422

EP - 426

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -