search

actions - event

state: published

Expanding Linguistic and Culturage Coverage in Common Crawl

title	Expanding Linguistic and Culturage Coverage in Common Crawl
start_date	2026/06/12
schedule	11h
online	no
location_info	visioconférence Big Blue Button
summary	The Common Crawl Foundation (CCF) provides the largest open corpus of web data, enabling a wide range of scientific and technical applications including large language model (LLM) development. However, our current data processing pipeline faces challenges when processing multilingual data, decreasing language representation and impacting downstream model performance. In this talk, we will discuss CCF’s initiatives to improve multilingual coverage and language identification of our web corpus. These efforts include soliciting crowd-sourced web seeds for under-served languages and creating CommonLID, a community-driven, human-annotated language identification benchmark for the web domain. Throughout, we emphasize the collaborative nature of our efforts, working in partnership with members of the NLP community to improve content available in their languages.
responsibles	Bawden

Workflow history

from state (1)	to state	comment	date
submitted	published		2026/06/10 07:01 UTC

hosted_by

Institut national de recherche en informatique et en automatique - Inria

speakers

event_of

Automatic language modelling and analysis & computational humanities (séminaire de l’équipe ALMAnaCH, Inria, Paris) (2025)

Event #6021711 - latest update on 2026/06/10, created on 2026/06/10