Lexicalisation and lemmatization in the annotation of signed language corpora

old_uid5096
titleLexicalisation and lemmatization in the annotation of signed language corpora
start_date2008/06/16
schedule14h30-16h
onlineno
summaryIn this paper we discuss on-going work on the creation of corpora for signed language (SL) research, especially the Auslan (Australian Sign Language) corpus. In particular, we address the issue of machine-readability and the use of lemmatised glosses (or ID-glosses). Plans for the creation of new SL corpora in Europe and North America will be seriously flawed if they do not take into account the issue of lemmatisation. SL corpora are needed to empirically ground generalisations on signed language vocabularies and grammars, and to facilitate peer review of these descriptions and the theories which draw on them. Of course, corpora are important for the testing of language hypotheses in all language research. However, this is especially true of deaf signing communities which (like creole spoken languages) are inevitably young minority language communities. Although introspection and observation can help develop hypotheses, because SLs lack written forms or standardised varieties, and have interrupted transmission and few native users, intuitions and researcher observations may fail due to the absence of clear native signer consensus. The past reliance on the intuitions of very few informants and isolated textual examples (essentially inaccessible for peer review) has been problematic in the field. It should also go without saying that SL corpora—as with all modern linguistic corpora—should be representative, well-documented (i.e., with relevant metadata) and machine-readable (i.e., able to be annotated and tagged consistently and systematically). This requires dedicated technology (e.g., ELAN), standards and protocols (e.g., IMDI metadata descriptors), and transparent and agreed grammatical tags (e.g., grammatical class labels). However, we argue that it also requires the identification of lemmata. Lemmatisation—the classification or identification of related forms under a single label or lemma (the equivalent of headwords or headsigns in a dictionary)—is fundamental to the process of corpus creation. A successful corpus project team should already have available a reference dictionary or lexical database to facilitate lemma identification and consistency in lemmatisation (using ID-glosses). A robust understanding of the processes of lexicalisation in SLs is thus essential, and, reflecting this, annotation conventions that discriminate between, and treat consistently, different types of signs found within any SL text need to be articulated and adhered to.
responsiblesAroui