Typologically fair NLP

titleTypologically fair NLP
start_date2022/12/02
schedule14h-15h
onlineyes
location_infoen présentiel et en visioconférence via le lien Teams
summaryThe field of NLP has historically had a strong bias towards work that primarily uses English as a language of investigation. The situation is changing and multilingual NLP is booming. This talk starts with a description of the state of multilingual NLP, highlighting both its successes and its limitations. In particular, large multilingual pretrained models (PLM) such as mBERT or XLM-R have shown surprising cross-lingual capabilities but they cover a small fraction of the world's languages with large inequalities in performance. These inequalities stem from at least two sources: 1) NLP datasets are highly imbalanced with regards to typological diversity and 2) NLP models tend to be developed for English first and then adapted to other languages, which leads to biases in the model assumptions. I describe attempts at overcoming both of these limitations. To overcome data imbalance, I describe a method from algorithmic fairness which samples data from different sources in a way that is more robust to underrepresented languages than alternative sampling methods. To overcome model assumption biases, I describe a PLM which uses pixel-based representations of language instead of the commonly used subword representations. I conclude with some directions for working towards typologically fairer NLP.
responsiblesRolin