In this talk I deal with automated acquisition of linguistic knowledge as a means of enhancing robustness of lexicalised grammars for real life applications.
I focus on Multiword Expressions (henceforward MWEs). Specifically, in the first part of the talk I am taking a closer look at the linguistic properties of MWEs, in particular, their lexical, syntactic, as well as semantic characteristics.
With the observations about the linguistic properties of MWEs at hand, I turn in the second part of the talk to methods for the automated acquisition of these properties for robust grammar engineering and parsing. To this effect, I first investigate the hypothesis that MWEs can be detected by the distinct statistical properties of their component words, regardless of their type, comparing various statistical measures, a procedure which leads to extremely interesting conclusions. I then investigate the influence of the size and quality of different corpora, using the BNC and the Web search engines Google and Yahoo. I conclude that, in terms of language usage, web generated corpora are fairly similar to more carefully built corpora, like the BNC, indicating that the lack of control and balance of these corpora are probably compensated by their size.
Then, I show a qualitative evaluation of the results of automatically adding extracted MWEs to existing linguistic resources. To this effect, I first discuss two main approaches commonly employed in NLP for treating MWEs: the words-with-spaces approach which models an MWE as a single lexical entry and it can adequately capture fixed MWEs like "by and large", and compositional approaches which treat MWEs by general and compositional methods of linguistic analysis. On this basis, I argue that the process of the automatic addition of extracted MWEs to existing linguistic resources improves qualitatively, if a more compositional approach to grammar/lexicon automated extension is adopted.
Finally, I propose that the methods developed for the acquisition of linguistic knowledge in the case of English MWEs can be tuned to enhance robustness of parsing with lexicalised grammars for languages with richer morphology and freer word order, as is the case of German.
Valia Kordoni is at Humboldt University, Berlin and at Saarland University
Studies of language in Alzheimer's disease have concluded that, along with a general cognitive decline, linguistic features are also negatively affected. Studies of the language of healthy elders also observe a linguistic decline, but one which, in contrast, is markedly less severe than that induced by dementia. We examine whether the disease can be detected from the diachronic changes in written texts and, more importantly, whether it can be clearly distinguished from normal aging.
Lexical and syntactic analyses were conducted on 51 novels by three prolific literary authors: Iris Murdoch, P.D. James, and Agatha Christie. Murdoch was diagnosed with Alzheimer's disease shortly after finishing her last novel; James, at 89 years of age, continues to publish critically-acclaimed works; Christie, whose last few novels are deemed strikingly subpar compared to her previous works, presents an interesting case study of possible dementia.
The lexical analysis reveals significant patterns of decline in Murdoch's and Christie's later novels, while James's rates remain relatively consistent throughout her career. The syntactic measures, though unveiling fewer significant linear trends, discover a cubic model of change in Murdoch's novels, with a deep decline around her 50s. Our findings provide support for the hypothesis that dementia, which manifests clearly in lexical features, can be detected in writing.
(Joint work with Xuan Le, Ian Lancashire, and Regina Jokel)
Dry run of talks for Nodalida 2013 in Oslo
Yvonne Adesam & Gerlof Bouma: Experiments on sentence segmentation in Old Swedish editions
Malin Ahlberg & Peter Andersson: Towards automatic tracking of lexical change: linking historical lexical resources
The platform in aimed primarily at learners of Swedish as a Second/Foreign language. It is divided into several modules: an exercise generator with activities for university students of linguistics and second/foreign language learners including multiple-choice and spelling exercises; and modules facilitating different aspects of development and research. These at the moment consist of an experimental sentence readability module for a level-wise selection of appropriate dictionary examples / exercise items and an editor for learner-oriented corpora.
The platform is under active development, and in this talk we will describe its current state, including the exercise generation and the principles behind them, as well as the two projects relevant for Lärka's development: the project on the collection of a corpus of course book texts used in CEFR**-based language teaching and the sentence readability project.
We expect our talk to be interesting for computational linguists, language teachers, lexicographers and linguists in general.
*ICALL = Intelligent Computer-Assisted Language Learning
**CEFR = Common European Framework of Reference for Languages
Readability formulas are methods used to match texts with the readers' reading level. Several methodological paradigms have previously been investigated in the ﬁeld. The most popular paradigm dates several decades back and gave rise to well known readability formulas such as the Flesch formula (among several others).
In this talk, I will present the results of a study I did in collaboration with Thomas Francois from the UCLouvain. We compare traditional readability formula approaches (henceforth "classic") with an emerging paradigm which uses sophisticated NLP-enabled features and machine learning techniques.
Our experiments, carried on a corpus of texts for French as a foreign language, yield four main results: (1) the new readability formula performed better than the "classic" formula; (2) "non-classic" features were slightly more informative than "classic" features; (3) modern machine learning algorithms did not improve the explanatory power of our readability model, but allowed to better classify new observations; and (4) combining "classic" and "non-classic" features resulted in a significant gain in performance.
Four short talks in preparation for the FrameNet workshop in Berkeley.
Rule-based machine translation systems need explicit linguistic resources which are usually coded by human experts in a highly time-consuming task. When experts are not available for a given language pair, automatic or semi-automamic acquisition of these resources can be carried out.
In this talk, a method to build monolingual dictionaries from the contributions of non-expert users is presented. A strategy to learn shallow-transfer rules from small parallel corpora is also described. Finally, the integration of shallow-transfer rules in statistical machine translation is addressed.
Victor Sánchez-Cartagena (Alacant)
A wordnet is a kind of lexical semantic network that describes word meanings in terms of lexical semantic relations which they enter. This is a limited kind of description, but it has allowed to build large wordnets (e.g. the seminal Princeton WordNet, plWordNet), which are useful in various Natural Language Processing applications. A wordnet must be large enough to provide practical support for such applications. Still, its construction process is laborious which is a serious limitation. However, the indispensable manual work can be effectively facilitated by language knowledge extracted from large corpora. An example are pairs of words linked by various lexico-semantic relations that are processed by the WordnetWeaver system to produce suggestions for wordnet expansion.
Pattern-based methods exploit occurrences of word pairs in search for lexico-syntactic constructions that can be markers of particular lexico-semantic relations. Distributional Semantics methods are based on the analysis of statistically significant similarities among different word uses in order to identify those words that are semantically related. Results produced by methods of both types are complementary to some extent: pattern-based methods extract pairs of words that seem to be associated by particular lexico-semantic relations, while Distributional Semantics produces measures of semantic relatedness between words. Advantages, disadvantages and limitations of both paradigms will be discussed on the basis of rich practical experience in their utilisation.
The WordnetWeaver system can utilise the results of a number of different extraction methods and suggest the likely location of a new word in the wordnet. Each suggestion defines a potential sense of a new word. The suggestions are presented visually on the relation network graph. Linguists can browse suggestions, modify and freely edit the wordnet structure.
The complete process of data processing and relation extraction will be discussed from the perspective of our experience of wordnet building. A corpus-based lexicographic process supported by the WordnetWeaver system will be presented. Possibilities and limitations of the semi-automated wordnet expansion will be discussed on the basis of examples collected during plWordNet expansion
The work was co-funded by the European Union Innovative Economy Programme (Project POIG.01.01.02-14-013/09) and the Polish Ministry of Science and Higher Education (Project N N516 068637).
Maciej Piasecki (Wroclaw)