Named Entity Recognition, Transliteration and Use in MT (Machine Translation), TDT (Topic Detection and Tracking) and MDS (Multi-Document Summarization)

old_uid3281
titleNamed Entity Recognition, Transliteration and Use in MT (Machine Translation), TDT (Topic Detection and Tracking) and MDS (Multi-Document Summarization)
start_date2007/10/16
schedule14h
onlineno
location_infosale B311
detailsSivaji Bandyopadhyay est invité dans le cadre du programme d'échanges STIC-Asie coordonné par Patrick Saint-Dizier (IRIT, Toulouse)
summaryThe current trend in NER is to use the machine-learning approach, which is more attractive in that it is trainable and adoptable and the maintenance of a machine-learning system is much cheaper than that of a rule-based one. We have developed the Named Entity Recognition (NER) systems for Bengali using various techniques like pattern directed shallow parsing approach without and with linguistic knowledge, statistical Hidden Markov Model (HMM), Maximum Entropy (ME) Model, Conditional Random Field (CRF) and Support Vector Machine (SVM). Named Entity Recognition in Indian languages (ILs) particularly in Bengali is difficult and challenging as there is no concept of capitalization in ILs as like English. A web?based tagged Bengali news corpus of approximately 34 million wordforms in UTF-8 has been developed from the web archive of a leading Bengali newspaper and a part of this corpus has been used in NER tasks. All the systems have been evaluated and the SVM based model has outperformed others with an overall F-Score of 91.8%. We have used a modified joint source-channel model for named entity transliteration and this has been used for transliteration among English and Bengali. We are using the named entity tags for English named entities in an English-Bengali Machine translation system. We have recently started work on Story Link Detection in which each news story is represented as a collection of four vectors: locations, proper names, temporal expressions and general terms. The 4-vector representation of each news document will be used to measure the similarity between two news documents. The 4-vector representation of news stories and the similarity measure of news stories can be used further towards multidocument summarization of news stories.
responsibles<not specified>