• master_thesis_proposal

master_thesis_proposal

Temporal information in Swedish - identification, resolution, normalization and standardization

Background

Identification and resolution of temporal (and numerical) information in natural language text is important in many tasks in artificial intelligence (temporal reasoning) and natural language processing (information extraction and retrieval, Q&A).

A temporal expression in a text is a sequence of tokens  (words, numbers and characters) that denote time, that is express a point in time, a duration or a frequency.

Purpose

The purpose of this work is on temporal information processing and the development of algorithms for temporal information identification, resolution, normalization and standardization using TIMEX3/TimeML (or equivalent) on Swedish text material.

For instance the examples below illustrate hoe the TIMEX3-format is used:

  • "June 7, 2003": <TIMEX3 tid="t1" type="DATE" value="2003-06-07">
  • "the dawn of 2000": <TIMEX3 tid="t2" type="DATE" value="2000" mod="START">the dawn of 2000</TIMEX3>

A more complex example can look like this:

  • "two weeks from June 7, 2003": <TIMEX3 tid="t6" type="DURATION" value="P2W" beginPoint="t61" endPoint="t62">two weeks</TIMEX3> from <TIMEX3 tid="t61" type="DATE" value="2003-06-07">June 7, 2003</TIMEX3><TIMEX3 tid="t62" type="DATE" value="2003-06-21" temporalFunction="true" anchorTimeID="t6"/>

Depending on background and interest of the student, the work can be  given different focus and scope; e.g. own implementation of a temporal information processing or adapting available software to Swedish; compare the effect of different resources and module combinations for temporal processing, etc.

Application

As a practical application the resulting software will be used as a supporting technology for de-identifying temporal information of patient data. Normalized and standardized temporal occurrences in authentic text (patient history) will be used to "mask" the temporal information on the text. For instance, a text occurrence of the date "2011-12-15" will be converted to e.g. "start date + 4 months 7 days" (under the assumption that 'start date' is a relevant point in time from where a patient history started to be recorded).  Note! The development of this application will be made on non-authentic texts but the intention is to use the developed software on real data.

Supervisors

Dimitrios Kokkinakis, PhD, Department of Swedish

Staffan Svensson, PhD, MD, specialist in clinical pharmacology

Prerequisites:

Native Swedish or good Swedish language skills.

Good programming skills.

Relevant Links

TempEval Temporal Relation Identification <http://timeml.org/tempeval/>

TempEval2: Evaluating Events, Time Expressions, and Temporal Relations <http://www.timeml.org/tempeval2/>

TempEval3: Temporal Annotation <http://www.cs.york.ac.uk/semeval-2013/task1/>

TimeML: Markup Language for Temporal and Event Expressions <http://www.timeml.org/site/index.html>

TIMEX at MUC-6 <http://www.timexportal.info/timexmuc6>

Guidelines for Temporal Expression Annotation for English for TempEval 2010. <http://www.timeml.org/tempeval2/tempeval2-trial/guidelines/timex3guidelines-072009.pdf>

A multilingual corpus database for typological and genetic linguistics

Goal

.Building a multilingual corpus database and interface for typological and genetic linguistics research

Background

Over the last few years, linguists and computaional linguists have started looking into the possibilities of using multilingual corpora (mainly parallel corpora) for typological and genetic linguistic research.

Problem description

The aims of this work are (1) to collect and link at the verse level as many digitized Bible texts as possible; (2) to apply linguistic annotation tools for those languages where such tools are available (at least English and Swedish); (3) to correlate linguistic units of varying granularity among the languages using the linguistic annotations and freely available word alignment tools; (4) to design the first version of a user interface for conducting research with the database; (5) to conduct a small typological or genetic linguistic study as a showcase of the utility of the database and user interface.

Recommended skills

  • Good knowledge of typological and possibly genetic linguistics
  • Very good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Linking a pronunciation lexicon to SALDO

Goal

.Linking the NST Swedish pronunciation lexicon to SALDO

Background

The NST Swedish pronunciation lexicon is a large (almost 1M entries) fullform lexicon for Swedish linking text words to their (standard) pronunciations. SALDO is a large semantic and morphological lexicon for Swedish (see <http://spraakbanken.gu.se/eng/saldo/>).

Problem description

The aim of this work is to link lexical entries in SALDO to the corresponding entries in the NST lexicon, as well as to explore the feasibility of providing the SALDO-FM morphological component with a pronunciation module, i.e., generate pronunciations for word forms not present in the NST lexicon from parts of word forms that are in this lexicon.

Recommended skills

  • Good knowledge of Swedish morphological analysis, and some knowledge of Swedish phonology and phonetics
  • General familiarity with morphological analysis systems
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Swedish multiword-entity extraction

Goal

.Finding long (>2 words) word/lemma n-grams

Background

Multiword entitites have received much attention lately both in computional linguistics and in general linguistics. There is a long tradition in computational and corpus linguistics of mining multiword entities from text by applying (a wide range of) collocation measures to pairs of entities (text words, lemmas, syntactic dependencies), contiguous or non-contiguous, in order to find two-word lexical units or terms. Attempts to discover longer units are much more rare in the literature, in part because good collocation measures seem to be lacking for this problem.

Problem description

The aim of this work is to refine a purely frequency-based way of finding contiguous word n-grams in annotated text, for instance by applying methods from work on automatic word segmentation. The preferred target language is Swedish. English is also acceptable, but in this case, Språkbanken can provide only limited support wrt annotation tools and linguistic expertise.

Recommended skills

  • Good knowledge of linguistic analysis
  • General familiarity with POS tagging and parsing
  • Familiarity with machine learning
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Extending IDS

Goal

.Enriching IDS using Wiktionary and other multilingual resources

Background

The IDS list is a kind of “universal base vocabulary” containing about 1,500 word senses. See <http://lingweb.eva.mpg.de/ids/>, <http://spraakbanken.gu.se/eng/research/digital-areal-linguistics/word-lists> and <http://spraakbanken.gu.se/swe/sblex/resources#lwt>. There is a general wish on the part of the main editor of the IDS effort to collect IDS lists for as many languages as possible.

Problem description

This project should address the problem of using freely available multilingual resources, such as Wiktionary, in order to add new full or partial IDS lists to the collection. The work should include implementing a way of generating candidate IDS lists from, e.g., Wiktionary, as well as an evaluation of the method by using it to generate lists for languages that are already in the IDS collection.

Recommended skills

  • Fair knowledge of lexicography
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

A Swedish diachronic lexicon

Goal

Automatic diachronic linking of Swedish lexical resources .

Background

Språkbanken <http://språkbanken.gu.se> possesses a number of digitized lexical resources in various stages of preparation and representing i.a. various historical forms of Swedish. One of the resources – SALDO – is singled out as the pivot resource to which all the others should be linked in some way. The hope is that the resulting interlinking of the lexicons will enable many kinds of linguistic information to be transferred among them. However, the interlinking of the lexical resources has only begun and there is much scope for innovation.

Problem description

This problem is an open one and should be suitably narrowed down to be solvable in the framework of a master’s thesis, e.g., by focusing on one pair of lexical resources but of course with a view to the general applicability of the proposed solution. On the one hand, there are the lexicons themselves with the associated, partly overlapping linguistic information. On the other hand, there are various external resources, such as text corpora representing different historical language stages, and possibly freely available external lexicons. The problem more narrowly construed consists in proposing and implementing a set of tools for interlinking the lexicons, using all and any relevant information available, as well as some kind of evaluation procedure. The interlinking should be semi-automatic, and the extent of the manual component should be explicitly indicated as part of the result (e.g., as the number and percentage of ambiguous links).

Recommended skills

  • Fair knowledge of Swedish lexicography, lexical semantics and grammatical analysis
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Spelling variation in Swedish text

Goal

Dealing with spelling variation in Swedish text in order to improve lemmatization, part-of-speech tagging and parsing.

Background

Språkbanken <http://språkbanken.gu.se> uses an in-house large lexical resource cum morphological analyzer, plus an off-the-shelf part-of-speech tagger and dependency parser to annotate its online corpora. These tools expect standardized spellings in the texts to be analyzed (although the data-driven tools – the POS tagger and parser – will handle out-of-vocabulary items which are not recognized by the morphological analyzer).

Problem description

Many of the texts in Språkbanken also sport non-standard spellings, either because they represent a pre-standardization language stage – medieval and 17th century texts – or because they are full of spelling errors and variants, which often is the case with modern blog texts. The problem consists in developing and implementing a (partial) solution for discovering and dealing with the spelling variation in modern texts (for which we already have sufficiently large-scale language analysis tools). Preferably the solution should be general and extensible to other text types. The work thus includes a good deal of linguistic analysis of lemmatizer, POS tagger and parser output.

Recommended skills

  • Good knowledge of Swedish grammatical analysis
  • General familiarity with POS tagging and parsing
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Dialogue Technology for Second Language Learning

In his thesis entitled "The Virtual Language Teacher: Models and
applications for language learning using embodied conversational agents"
[1], Preben Wik presents a framework for computer assisted language
learning using a virtual language teacher. At least one implementation
is also available at [2].

The purpose of the project is to investigate the feasibility of
implementing some of Wik's applications on an platform such as Voxeo or
Tropo.

Supervisor: Torbjörn Lager

[1] <http://www.speech.kth.se/~preben/thesis/thesisPrebenWik.pdf>
[2] <http://www.speech.kth.se/ville/swell.html>

 

Building an Alternate Reality Game on a platform for unified communication

According to Wikipedia, an Alternate Reality Game (ARG) is an
interactive narrative that uses the real world as a platform, often
involving multiple media and game elements, to tell a story that may be
affected by participants' ideas or actions. ARGs generally use
multimedia, such as telephones, email and mail but rely on the Internet
as the central binding medium. [1]

The purpose of the project is to design and prototype an ARG using an
existing platform for unified communication [2] such as Voxeo Prophecy
or Tropo and standards such as VoiceXML, SRGS, SSML, CCXML and SCXML. It
is suggested that the game is to be played in the house in which FLOV
resides.

Supervisor: Torbjörn Lager (and possibly Johan Roxendal)

[1] <http://en.wikipedia.org/wiki/Alternate_reality_game>
[2] <http://en.wikipedia.org/wiki/Unified_communications>

Developing and evaluating a TDM app (2016)

Goal

To develop a voice interface to an Android smartphone application.

Background

Voice interfaces give users the possibility to interact with a device without using their eyes or hands. This can be particularily useful in situations where interaction with a device would otherwize interfere with other tasks. A typical example is driving, in which the simultaneous use of a screen-based device such a smartphone can constitute a direct danger. This project focuses in the potential benefits of voice enabled smartphone apps in driving and similar scenarios.

Problem description

The problem consists of developing a spoken dialog interface to an Android application. The user should be able to switch freely between touch-screen and voice interaction. To facilitate the development of this multimodal interface, Talkamatic Dialog Manager (TDM) will be used.

The problem mainly consists of the following tasks:

  • Develop a Use Case for the application; this may involve market surveys and interviews
  • Write a TDM device resource, which communicates with the app functionality (Python and Android SDK)
  • Write a TDM domain and ontology specification, describing the structure of the app (XML)
  • Write a TDM grammar, describing the verbal phrases used by the user and the app (XML, Grammatical Framework)
  • Evaluate the app

Recommended:

  • Python
  • Grammar formalisms
  • XML

Experience of Android app development is valuable but not required.

Supervisors

Staffan Larsson, FLoV, together with Talkamatic AB. Talkamatic is a university research spin-off company based in Göteborg.

X
Loading