Perspectives pour la structuration et le traitement automatique des informations textuelles issues de bases de données scientifiques et techniques

old_uid10506
titlePerspectives pour la structuration et le traitement automatique des informations textuelles issues de bases de données scientifiques et techniques
start_date2011/12/05
schedule10h30
onlineno
location_infosalle de séminaires 4B08R
summaryHow efficient is modeling and automatically processing multi-source scientific and technical information mediated by a large set of documents ? Scientific and technical text analysis has been receiving rising attention within the social sciences through an increasing amount of text in electronic format and the explosion of digital databases/libraries. This textual data may principally come from articles and patents, but also from specialized databases such as financial and scientific projects databases, economics news, surveys, and far more from bibliographic websites or the blogosphere. In order to allow efficient access and use of this information, several challenges must be overcome: at an organizational level it is necessary to constitute work teams, policies and agreements, and to facilitate the access to information collected and produced. At a technical level, the approach to how to process heterogeneous textual data should be discussed, along with other aspects, such as the treatment of large-scale corpus, reduction of noise contained, possible duplication, multilingualism, and several further computer/user processing tasks. But where should we start? The automatic processing of multi-source scientific and technical information involves various computer sciences disciplines: data & knowledge engineering, text mining, natural language processing, information retrieval and visualization, or software ergonomics. To begin with, it is necessary to propose a sort of ‘meeting point ’: a framework where to bring together these disciplines in a focused way. Unfortunately, the heterogeneous and dynamic nature of the information, does not make that task easier. In this talk, we will present an approach to gathering, modeling, and preserving large-scale textual information, by linking bits of information, normalizing them and enriching this data. We are going to talk about an open source modular framework (in pre-alpha developement), called Scilmarin, designed to allow the automatic processing of large-scale multi-source textual information derived from scientific and technical databases. Also, we will present XML, a draft specification to model scientific and technical data. Finally, we will explore the possibility of Scilmarin to assume tasks involving automatic language processing, using other software tools such as Unitex.
responsiblesSigogne, Rakho