Diabase

The need for a basic research infrastructure for language technology is increasingly recognized by the language technology research community and research funding agencies alike. At the core of such an infrastructure we find the so-called BLARK -- Basic LAnguage Resource Kit, a set of language resources and language technology tools deemed essential both to fundamental research in language technology and to the development of useful language technology applications for a language. The BLARK, as normally presented in the literature, reflects a modern standard language variety, which is topic- and genre-neutral, thus abstracting away from all kinds of language variation. However, modern linguistics increasingly recognizes variation as a fundamental and essential characteristic of human language. We thus argue that a BLARK could fruitfully be extended along any of the three axes implicit in this characterization (the social, the topical and the temporal). In our case, it would be extended along the temporal axis, towards a diachronic BLARK for Swedish, which can be used to develop e-science tools in support of historical studies.

We are currently extending and merging two lexical resources, SALDO and Dalin. Additionally, we have three major dictionaries of Old Swedish (1225--1526): Söderwall (23,000 entries), Söderwall supplement (21,000 entries), and Schlyter (10,000 entries). Due to overlap, the three resources together contain just under 25,000 different entries/lemmas/headwords. We have started the work on creating a morphological component for Old Swedish, covering the regular paradigms and created a smaller lexicon with a couple of thousand entries.

The natural next step after linking up SALDO and Dalin would be to add the Old Swedish lexicon to this growing diachronic Swedish lexical and morphological resource. Including the Old Swedish lexicon in the same way as we are doing this for Dalin's dictionary will probably be more difficult, however, since the distance between Old Swedish and the other two forms of the language is fairly great, something like that between modern English and Anglo-Saxon (Old English). This certainly holds for the grammar -- morphology and syntax -- of the language, and even more so for the semantic information encoded in the SALDO lexical resource. It will be a difficult but hopefully rewarding endeavor to work with the lexical semantics of Old Swedish.

Additionally, we are working on a Swedish FrameNet, building in part on the SALDO work and in part on our long experience in corpus linguistics. In this way, we should be able to forge a bridge from the lexical databases which we have already developed, to syntactic analysis systems. The hypothesis is that substantial parts of the frame semantic specifications in the modern Swedish FrameNet will carry over to the lexical items in Dalin's dictionary, using the (semantic) links independently established between SALDO and Dalin, and possibly further to the Old Swedish lexical resources.

Project contact: Lars Borin (sb-info@svenska.gu.se). For more information, see the project web page: http://spraakbanken.gu.se/eng/forskning/diabase .

To the top

Page updated: 2011-12-19 17:25

Send as email
Print page
Show as pdf

X
Loading