End-to-end speech recognition from the raw waveform

old_uid15963
titleEnd-to-end speech recognition from the raw waveform
start_date2018/05/30
schedule14h-15h30
onlineno
location_infosalle C005
detailsTALC seminar
summaryState-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. We study end-to-end systems trained directly from the raw waveform, introducing a trainable replacement of mel-filterbanks that uses a convolutional architecture, based on the scattering transform. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of melfilterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We then improve this model and another frontend previously proposed and based on gammatones. We perform open vocabulary experiments on Wall Street Journal and show a consistent and significant improvement in Word Error Rate of our trainable frontends over mel-filterbanks, even with random initialization.
responsiblesDutech