search

actions - event

state: published

Probing large language models for syntactic structure: Creating nonce treebanks in order to separate syntax from semantics

title	Probing large language models for syntactic structure: Creating nonce treebanks in order to separate syntax from semantics
start_date	2023/05/17
schedule	17h-18h
online	no
location_info	En ligne
summary	The question we address in this work is to which extent large language models (LMs) learn syntactic structure. One difficulty when probing LMs for syntax is to ensure whether the syntactic findings are purely due to syntax or are the result of semantic knowledge learned in the model. In order to avoid this potential problem, we create syntactically well-formed but semantically nonce treebanks by, starting from an existing dependency treebank, replacing a certain ratio of words with words that can appear in the same syntactic contexts. In Arps et al. (2022), we use the Penn Treebank and semantically perturbed versions of it and we train a linear probe using our treebanks that constructs entire constituency parse trees based on a sequence labeling scheme. We find that even on semantically perturbed data, the constituency tree can be reconstructed by the probe with an F1 score of 72.8 in the lowest nonce setting. In more recent work, we extend this line of research to other languages, creating nonce treebanks also for Arabic, Chinese, French, German and Russian. We then apply a structural probing method for labeled dependency trees (Müller-Eberstein et al., 2022) to the nonce treebanks. The results show that in this setting, probes trained on the original treebanks are able to predict the syntactic structure of nonce test sets, with good performance but an increase in error rate. Probes trained on nonce treebanks perform on par with standard probes when evaluated on the original treebanks in monolingual and crosslingual settings. These results and the ones from Arps et al. (2022) indicate that the syntactic structure of nonce sentences is encoded in the language model.
responsibles	Bernard

Workflow history

from state (1)	to state	comment	date
submitted	published		2023/06/08 14:39 UTC

speakers

event_of

Interactions entre linguistiques formelles et computationnelles (séminaire ILFC du GDR LIFT - Linguistique Informatique, Formelle et de Terrain) (2022)

Event #537488 - created on 2023/04/27