• old

old

Mark an item as old if it should not be shown with the newer ones.

Swedish multiword-entity extraction

Goal

.Finding long (>2 words) word/lemma n-grams

Background

Multiword entitites have received much attention lately both in computional linguistics and in general linguistics. There is a long tradition in computational and corpus linguistics of mining multiword entities from text by applying (a wide range of) collocation measures to pairs of entities (text words, lemmas, syntactic dependencies), contiguous or non-contiguous, in order to find two-word lexical units or terms. Attempts to discover longer units are much more rare in the literature, in part because good collocation measures seem to be lacking for this problem.

Problem description

The aim of this work is to refine a purely frequency-based way of finding contiguous word n-grams in annotated text, for instance by applying methods from work on automatic word segmentation. The preferred target language is Swedish. English is also acceptable, but in this case, Språkbanken can provide only limited support wrt annotation tools and linguistic expertise.

Recommended skills

  • Good knowledge of linguistic analysis
  • General familiarity with POS tagging and parsing
  • Familiarity with machine learning
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Extending IDS

Goal

.Enriching IDS using Wiktionary and other multilingual resources

Background

The IDS list is a kind of “universal base vocabulary” containing about 1,500 word senses. See <http://lingweb.eva.mpg.de/ids/>, <http://spraakbanken.gu.se/eng/research/digital-areal-linguistics/word-lists> and <http://spraakbanken.gu.se/swe/sblex/resources#lwt>. There is a general wish on the part of the main editor of the IDS effort to collect IDS lists for as many languages as possible.

Problem description

This project should address the problem of using freely available multilingual resources, such as Wiktionary, in order to add new full or partial IDS lists to the collection. The work should include implementing a way of generating candidate IDS lists from, e.g., Wiktionary, as well as an evaluation of the method by using it to generate lists for languages that are already in the IDS collection.

Recommended skills

  • Fair knowledge of lexicography
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

A Swedish diachronic lexicon

Goal

Automatic diachronic linking of Swedish lexical resources .

Background

Språkbanken <http://språkbanken.gu.se> possesses a number of digitized lexical resources in various stages of preparation and representing i.a. various historical forms of Swedish. One of the resources – SALDO – is singled out as the pivot resource to which all the others should be linked in some way. The hope is that the resulting interlinking of the lexicons will enable many kinds of linguistic information to be transferred among them. However, the interlinking of the lexical resources has only begun and there is much scope for innovation.

Problem description

This problem is an open one and should be suitably narrowed down to be solvable in the framework of a master’s thesis, e.g., by focusing on one pair of lexical resources but of course with a view to the general applicability of the proposed solution. On the one hand, there are the lexicons themselves with the associated, partly overlapping linguistic information. On the other hand, there are various external resources, such as text corpora representing different historical language stages, and possibly freely available external lexicons. The problem more narrowly construed consists in proposing and implementing a set of tools for interlinking the lexicons, using all and any relevant information available, as well as some kind of evaluation procedure. The interlinking should be semi-automatic, and the extent of the manual component should be explicitly indicated as part of the result (e.g., as the number and percentage of ambiguous links).

Recommended skills

  • Fair knowledge of Swedish lexicography, lexical semantics and grammatical analysis
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Spelling variation in Swedish text

Goal

Dealing with spelling variation in Swedish text in order to improve lemmatization, part-of-speech tagging and parsing.

Background

Språkbanken <http://språkbanken.gu.se> uses an in-house large lexical resource cum morphological analyzer, plus an off-the-shelf part-of-speech tagger and dependency parser to annotate its online corpora. These tools expect standardized spellings in the texts to be analyzed (although the data-driven tools – the POS tagger and parser – will handle out-of-vocabulary items which are not recognized by the morphological analyzer).

Problem description

Many of the texts in Språkbanken also sport non-standard spellings, either because they represent a pre-standardization language stage – medieval and 17th century texts – or because they are full of spelling errors and variants, which often is the case with modern blog texts. The problem consists in developing and implementing a (partial) solution for discovering and dealing with the spelling variation in modern texts (for which we already have sufficiently large-scale language analysis tools). Preferably the solution should be general and extensible to other text types. The work thus includes a good deal of linguistic analysis of lemmatizer, POS tagger and parser output.

Recommended skills

  • Good knowledge of Swedish grammatical analysis
  • General familiarity with POS tagging and parsing
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Dialogue Technology for Second Language Learning

In his thesis entitled "The Virtual Language Teacher: Models and
applications for language learning using embodied conversational agents"
[1], Preben Wik presents a framework for computer assisted language
learning using a virtual language teacher. At least one implementation
is also available at [2].

The purpose of the project is to investigate the feasibility of
implementing some of Wik's applications on an platform such as Voxeo or
Tropo.

Supervisor: Torbjörn Lager

[1] <http://www.speech.kth.se/~preben/thesis/thesisPrebenWik.pdf>
[2] <http://www.speech.kth.se/ville/swell.html>

 

Building an Alternate Reality Game on a platform for unified communication

According to Wikipedia, an Alternate Reality Game (ARG) is an
interactive narrative that uses the real world as a platform, often
involving multiple media and game elements, to tell a story that may be
affected by participants' ideas or actions. ARGs generally use
multimedia, such as telephones, email and mail but rely on the Internet
as the central binding medium. [1]

The purpose of the project is to design and prototype an ARG using an
existing platform for unified communication [2] such as Voxeo Prophecy
or Tropo and standards such as VoiceXML, SRGS, SSML, CCXML and SCXML. It
is suggested that the game is to be played in the house in which FLOV
resides.

Supervisor: Torbjörn Lager (and possibly Johan Roxendal)

[1] <http://en.wikipedia.org/wiki/Alternate_reality_game>
[2] <http://en.wikipedia.org/wiki/Unified_communications>

X
Loading