|
Perspectives pour la structuration et le traitement automatique des informations textuelles issues de bases de données scientifiques et techniques| old_uid | 10506 |
|---|
| title | Perspectives pour la structuration et le traitement automatique des informations textuelles issues de bases de données scientifiques et techniques |
|---|
| start_date | 2011/12/05 |
|---|
| schedule | 10h30 |
|---|
| online | no |
|---|
| location_info | salle de séminaires 4B08R |
|---|
| summary | How efficient is modeling and automatically processing multi-source scientific and
technical information mediated by a large set of documents ? Scientific and technical
text analysis has been receiving rising attention within the social sciences through
an increasing amount of text in electronic format and the explosion of digital
databases/libraries. This textual data may principally come from articles and patents,
but also from specialized databases such as financial and scientific projects databases,
economics news, surveys, and far more from bibliographic websites or the blogosphere.
In order to allow efficient access and use of this information, several challenges must
be overcome: at an organizational level it is necessary to constitute work teams,
policies and agreements, and to facilitate the access to information collected and
produced. At a technical level, the approach to how to process heterogeneous textual
data should be discussed, along with other aspects, such as the treatment of large-scale
corpus, reduction of noise contained, possible duplication, multilingualism, and several
further computer/user processing tasks. But where should we start? The automatic
processing of multi-source scientific and technical information involves various computer
sciences disciplines: data & knowledge engineering, text mining, natural language
processing, information retrieval and visualization, or software ergonomics. To begin
with, it is necessary to propose a sort of ‘meeting point ’: a framework where to bring
together these disciplines in a focused way. Unfortunately, the heterogeneous and
dynamic nature of the information, does not make that task easier. In this talk, we will
present an approach to gathering, modeling, and preserving large-scale textual information,
by linking bits of information, normalizing them and enriching this data. We are going to
talk about an open source modular framework (in pre-alpha developement), called Scilmarin,
designed to allow the automatic processing of large-scale multi-source textual information
derived from scientific and technical databases. Also, we will present XML, a draft
specification to model scientific and technical data. Finally, we will explore the possibility
of Scilmarin to assume tasks involving automatic language processing, using other
software tools such as Unitex. |
|---|
| responsibles | Sigogne, Rakho |
|---|
| |
|