Improving body transmitted unvoiced speech with statistical voice conversion towards non-speech telephone

old_uid1189
titleImproving body transmitted unvoiced speech with statistical voice conversion towards non-speech telephone
start_date2006/05/11
schedule13h30
onlineno
location_infobât. C, 1e étage, amphi C
summaryCellular phones have enabled us to communicate with each other by speech whenever and wherever. However, it has caused a problem. Speech is recognized as NOISE by the other persons around a speaker in some situations such as a meeting. In order to address this problem, we aim to realize "non-speech telephone" allowing speech communication annoying nobody in any situation. Non-Audible Murmur (NAM) enables us to talk while keeping silent. However, it is hard to directly use NAM as a medium for human communication because of its less intelligibility and unfamiliar sounds. In order to address this problem, we propose a conversion method from NAM to ordinary speech (NAM-to-Speech). In advance, we train GMMs for representing correlations between spectral and source features of NAM and those of speech using around 50 utterance pairs of NAM and speech. Once we train those GMMs, we can convert any sample of NAM to that of speech with maximum likelihood estimation (MLE) of the features of speech for given those of NAM. Although NAM-to-Speech converts NAM to intelligible voices with similar quality to speech, there is still a large problem, i.e., difficulties of the F0 estimation from unvoiced speech. In order to avoid this problem, we propose another conversion method from NAM to whisper that is a familiar and intelligible unvoiced speech (NAM-to-Whisper). Moreover, we enhance NAM-to-Whisper so that multiple types of body-transmitted unvoiced speech such as NAM and Body Transmitted Whisper (BTW) are accepted as input voices. We evaluate the performance of the proposed conversion method. Experimental results demonstrate that 1) intelligibility and naturalness of NAM are significantly improved by NAM-to-Whisper, 2) NAM-to-Whisper outperforms NAM-to-Speech, and 3) we can train a single conversion model successfully converting both NAM and BTW to the target voice.
responsiblesLécuyer