Feature Selection for Semantic Clustering of Hungarian Verbs

titleFeature Selection for Semantic Clustering of Hungarian Verbs
summaryThe presentation will focus on current issues in semantic verb classification, with a particular emphasis on the choice of the feature set which represent verbal syntactic distribution. Experiments on supervised semantic verb classification generally aim at obtaining verb classes equivalent to Beth Levins English verb classification or WordNet synsets. For languages which lack such hand-made resources, verb classes can be obtained by unsupervised clustering. In either case, the underlying hypothesis is that syntactically similar verbs share one or more meaning components, and an adequate representation of verbs distributional context is crucial to the succes of the experiment. The first experiments for clustering Hungarian verbs were carried out using data from a manually annotated treebank. The 150 most frequent verbs were categorized according to the complementation patterns they exhibit in the treebank. There was no limit on the length of the patterns, and adjuncts were also included. As Hungarian is a highly inflective language (with 19 different case suffixes), a huge quantity of different patterns, i.e., a large feature set was used. Despite the fact that the large number of features scatter frequency data, the results were promising. However, in order to extend the clustering to less freqent verbs, one needs a bigger  and, consequently, automatically parsed  corpus. As the parser introduces more noise, which is an even more sensitive issue when dealing with medium or low frequency verbs, the question raises of how the feature set can be tuned in order to achieve a more precise description of the syntactic distribution of verbs. An obvious solution would be to filter complementation patterns according to their frequency, but this could yield misleading reults for low frequency verbs. A more sophisticated method would be to use a manually built verbal valency dictionary to filter out longer and/or less relevant distributional contexts. On the other hand, this method would imply losing the information carried by adjuncts (e.g. temporal andjuncts reveal aspectual properties of verbs). After presenting the results of the first clustering experiment, I will outline future research directions with respect to feature set reduction, and discuss the advantages and disadvantages of  the particular methods.