|
Probing large language models for syntactic structure: Creating nonce treebanks in order to separate syntax from semantics| title | Probing large language models for syntactic structure: Creating nonce treebanks in order to separate syntax from semantics |
|---|
| start_date | 2023/05/17 |
|---|
| schedule | 17h-18h |
|---|
| online | no |
|---|
| location_info | En ligne |
|---|
| summary | The question we address in this work is to which extent large language models (LMs) learn syntactic structure. One difficulty when probing LMs for syntax is to ensure whether the syntactic findings are purely due to syntax or are the result of semantic knowledge learned in the model. In order to avoid this potential problem, we create syntactically well-formed but semantically nonce treebanks by, starting from an existing dependency treebank, replacing a certain ratio of words with words that can appear in the same syntactic contexts. In Arps et al. (2022), we use the Penn Treebank and semantically perturbed versions of it and we train a linear probe using our treebanks that constructs entire constituency parse trees based on a sequence labeling scheme. We find that even on semantically perturbed data, the constituency tree can be reconstructed by the probe with an F1 score of 72.8 in the lowest nonce setting. In more recent work, we extend this line of research to other languages, creating nonce treebanks also for Arabic, Chinese, French, German and Russian. We then apply a structural probing method for labeled dependency trees (Müller-Eberstein et al., 2022) to the nonce treebanks. The results show that in this setting, probes trained on the original treebanks are able to predict the syntactic structure of nonce test sets, with good performance but an increase in error rate. Probes trained on nonce treebanks perform on par with standard probes when evaluated on the original treebanks in monolingual and crosslingual settings. These results and the ones from Arps et al. (2022) indicate that the syntactic structure of nonce sentences is encoded in the language model. |
|---|
| responsibles | Bernard |
|---|
Workflow history| from state (1) | to state | comment | date |
| submitted | published | | 2023/06/08 14:39 UTC |
| |
|