Expanding Linguistic and Culturage Coverage in Common Crawl

titleExpanding Linguistic and Culturage Coverage in Common Crawl
start_date2026/06/12
schedule11h
onlineno
location_infovisioconférence Big Blue Button
summaryThe Common Crawl Foundation (CCF) provides the largest open corpus of web data, enabling a wide range of scientific and technical applications including large language model (LLM) development. However, our current data processing pipeline faces challenges when processing multilingual data, decreasing language representation and impacting downstream model performance. In this talk, we will discuss CCF’s initiatives to improve multilingual coverage and language identification of our web corpus. These efforts include soliciting crowd-sourced web seeds for under-served languages and creating CommonLID, a community-driven, human-annotated language identification benchmark for the web domain. Throughout, we emphasize the collaborative nature of our efforts, working in partnership with members of the NLP community to improve content available in their languages.
responsiblesBawden