The 'Effective Complexity' of language: All grammars leak

old_uid9299
titleThe 'Effective Complexity' of language: All grammars leak
start_date2010/11/23
schedule12h30
onlineno
summaryThe concept of linguistic complexity is central to many theories oflanguage structure and processing. However, the definition of linguisticcomplexity remains -- in the best of cases -- rather vague, leading tomuch heated disagreements among linguists and psycholinguists as to whatshould be considered more or less complex. Interestingly, the debate onwhat is meant by complex is not exclusive to linguistics orpsycholinguistics. Rather, it is a subject of utmost importance in thephysical sciences, where a long debate is taking place concerning how oneshould define the meaning of complex. Within this physical debate, Gell-Mann (1995) has introduced the concept of 'Effective Complexity'(EC), as a general measure (and definition) of the complexity of an objector phenomenon. In essence, this notion of complexity is very simple: Thecomplexity of an object is the length of the shortest possible descriptionof the object's regularities. Notice that, in contrast with Kolmogorov(1965)'s 'Algorithmic Complexity' (AC), Gell-Mann's notion refers to thedescription of the regularities, not of the object itself. In this way,both fully regular and fully random objects have very low or nullcomplexities, whereas high complexity is only attained at the 'borderbetween order an chaos'. I will show thatthe EC measure can be accuratelyestimated for human languages by combining two differentinformation-theoretical AC measures of the language derived from corpora.Generalizing the corpus-based measures to their infinite corpus sizelimit, one can then obtain a strict measure of the complexity of alanguage independently of the corpora themselves. Using two Englishcorpora, I demonstrate how these measures can be computed. This approachuncovers a fundamental property of human languages: No grammar (inwhichever grammatical paradigm one chooses) can ever account for Englishwithout under- or over-generating, that is, either it is not able todescribe the whole of the language, or it generates sentences that cannotpossibly happen. This confirms E. Sapir's famous adaggio "All grammarsleak". This finding implies that,at the flowing nature and intrinsicincompleteness of human languages, requires that speakers must rely onstatistical methods for minimizing the 'leakage' in their internalgrammars. Finally, using the operationalization of EC described above,using a corpus of Tok Pisin language (an English-based creole spoken inPapua-New Guinea) I will demonstrate that the common claim that thegrammars of pidgins and creoles are 'simpler' than those of otherlanguages is not founded in reality: objectively speaking, creole grammarsare as complex as are English grammars.
responsiblesBoussidan