|
Cognitively Inspired Approaches to Automatic Speech Recognition| old_uid | 3285 |
|---|
| title | Cognitively Inspired Approaches to Automatic Speech Recognition |
|---|
| start_date | 2007/10/18 |
|---|
| schedule | 13h30 |
|---|
| online | no |
|---|
| summary | For more than a decade there seems to be consensus in the scientific community that the dominant Hidden Markov Model approach to automatic speech recognition is not able to reach human-like performance. Yet, there have been few attempts to develop fundamentally new approaches, and perhaps even fewer that do not ignore the extremely valuable lessons that we have learned from HMM-based technology and its deployment in real-world applications.
In this presentation I will outline and explain why we became interested in the memory-prediction theory of cognitive processing and intelligence, and why we believe that this theory can be the basis for a fundamentally new approach to modeling automatic and human language acquisition and (spoken) language processing.
The first inspiration came from our work in multi-modal human-system interaction, in which it became evident that an approach based on the conventional sequence automatic speech & gesture recognition ? language understanding & fusion ? reasoning ? response generation & fission is so different from what humans do that it cannot support transparent interaction. Instead, human-human interaction relies on multiple hierarchically organized perception-action loops with different response latencies. On the lowest levels in the hierarchy the latencies are so short that we must assume direct links between perception and action that do not require any kind of explicit reasoning.
A second inspiration came from research in embodied artificial intelligence and especially from the awareness of the problem of mixing levels of description. All existing theories of human and machine speech and language processing model a meta-level description of the processes, rather than the processes themselves. One obvious example is the representation of speech as a sequence of phones or phonemes, i.e., symbols from a closed set that is pre-defined by linguists. Yet, in a phoneme detection task people can ‘reconstruct’ phonemes that are not physically present in the signal if they are able to recognize the words. At the same time, many children have problems in learning to read and write because they lack the meta-linguistic awareness that words can be analyzed as sequences of speech sounds.
The memory-prediction theory is based on recent knowledge about the hierarchical architecture of the neo-cortex, with bi-directional links between the levels, and lateral connections between parts of the brain that handle inputs from different sensors and control motor (and other physiological) actions. Specifically, this theory offers a basis for developing embodied simulations of language acquisition and language processing. In the very first stages, when babies do not need to distinguish between many different concepts, objects and actions, integral representations of short phrases or words can be formed on the lower levels of the hierarchy, linked with appropriate actions. As the need for distinguishing between ever more and more complex and subtle differences increases, more abstract representations are formed on the higher levels of the hierarchy, also linked with appropriate actions.
In the FP6 Future and Emerging Technologies project ACORNS we are developing computational simulations of (partially) embodied language acquisition based on the memory-prediction theory. We believe that the internal representations formed by the learning agent will differ from the conventional linguistic descriptions of speech, and that these will eventually result in more robust speech processing and more natural human-system interaction. |
|---|
| responsibles | Loevenbruck, Welby |
|---|
| |
|