Distributional Paraphrasing with Distributional and Hybrid Semantic Distance Measures

old_uid15846
titleDistributional Paraphrasing with Distributional and Hybrid Semantic Distance Measures
start_date2015/06/19
schedule11h
onlineno
location_infosalle 165
summarySemantic distance measures estimate how close in meaning two words or phrases (or larger text units) are. These measures are useful in paraphrase generation, which in turn, is useful in NLP tasks such as statistical machine translation (SMT), information retrieval (IR), syntactic parsing, summarization and language generation.  I will start with presenting semantic measures: Lexicon-based semantic measures rely a dictionary, thesaurus, or taxonomy (e.g., WordNet), while distributional measures rely instead only on word distributions in a large corpus of non-annotated text (word2vec being a recent example). Lexicon-based measures tend to have higher correlation with human judgments, but lower coverage than distributional measures, especially for multi-word terms, specialized domains, resource-poor ("low density") languages, or non-classical semantic relations. Therefore, we are motivated to explore hybrid lexicon/corpus-based models that would benefit from both worlds. Previous work used shallow thesaurus-based “concepts” (lists of related words) for defining a coarse-grained aggregated distributional representation. I will show that finer granularity, in hybrid models, can benefit from concept information while retaining high-coverage word-based distributional representation.  Next, I will present a largely language-independent distributional paraphrase generation method, employing some of these semantic measures.  Time permitting, I will conclude with describing the integration and evaluation of paraphrasing in state-of-the-art SMT and in the IR task of event discovery and annotation.
responsiblesCandito