• master_thesis_proposal

master_thesis_proposal

GF Resource Grammar for a yet unimplemented language

Description

See http://www.grammaticalframework.org/lib/doc/status.html for the current status. This is an ambitious project and hard work, requiring good knowledge of GF and the target language. But the result will be a permanent contibution to language resources, and almost certainly publishable. For instance the Japanese grammar implemented within this Masters programme was published in JapTAL.

Supervisor

Contact and possible supervisor: Aarne Ranta.

Free Robust Parsing

Background

Open speech recognition

Talkamatic build dialogue systems and are currently using a GF-based grammar tool for parsing and generation. A unified language description is compiled into a speech recognition grammar (for Nuance Vocon ASR, PocketSphinx and others), a parser and a generator.

The problem with this is that the parser can only handle the utterances which the ASR can recognize fom the ASR grammar. The parser is thus not robust, and if an open dictation grammar is used (such as Dragon Dictate used in Apple’s Siri) the parser is mostly useless.

Ontology

Currently TDM (the Talkamatic Dialogue Manager) requires all concepts used in the dialogue to be known in advance. Hence, for a dialogue-controlled music player, all artists, songs, genres etc. need to be known and explicitly declared beforehand.

There are disadvantages with this approach. For example, it requires access to a extensive music database in order to be able to build a dialogue interface for a music player.

Problem description

To simplify the building of dialogue interfaces for this kind of application, it would be useful to have a  more robust parse, which can identify sequences of dialogue moves from arbitrary user input strings.

Ex.

Utterance

Dialogue Moves

”Play Like a Prayer with Madonna”

 

request(play_song),

answer(”Like a Prayer”:song_title)

answer(”Madonna”:artist_name)

”Play Sisters of Mercy”

request(play_song)

answer(”Sisters of Mercy”:song_name)

”Play Sisters of Mercy”

request(play_artist)

answer(”Sisters of Mercy”:artist_name)

”I would like to listen to Jazz”

request(play_genre)

answer(”Jazz”:genre_name)

 

Method

Several different methods can be used: Named Entity Recognizers, regular expressions, databases etc., or combinations of several of these. A strong requirement is that the parser should be built automatically or semiautomatically from a small corpus or database. Computational efficiency is also desirable but less important. The parser must have a Python interface and run on Linux.

Supervision

Peter Ljunglöf, Chalmers Data- och informationsteknik or Staffan Larsson, FLoV, together with Talkamatic AB. Talkamatic is a university research spin-off company based in Göteborg.

Payment

A smallcompensation may be paid by Talkamatic AB when the thesis is completed.

Attachments: 

Spelling games

Goal

Developing an algorithm (web service(s)) for automatic generation of exercises for training spelling (primarily, for Swedish)

Background

The currently developed application Lärka, with its web services, is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. Vocabulary knowledge covers a broad spectrum of word knowledge, spelling and recognition in speech being some of them.

Problem description

The aims of this work are:

  1. to implement web service(s) for adaptive spelling exercise generation using text-to-speech module for Swedish, where the target words/phrases will be pronounced фnd the student will have to type what he/she hears. If the user seems to be comfortable with word spellings, target words get longer, get inflected, or pronounced in phrases, sentences etc.
  2. analyze possible approaches to provide reasonable feedback
  3. implement user interface for the exercise to be used in Lärka
  4. create a database for storing all possible misspellings associated with each individual graphical word for future analysis (for better feedback or better adaptivity path)
  5. (potentially) evaluate the results.

Recommended skills:

  • Python
  • Query

Supervisor(s)

  • Elena Volodina/Torbjörn Lager
  • eventually others from Språkbanken/FLOV

Exercise generator for English or any other language available through NLTK

Goal

Developing python-based programs (web service(s)) for automatic generation of exercises (e.g. of the same type as in Lärka) for other languages than Swedish using corpora and necessary language resources/tools available through NLTK

Background

The currently developed application Lärka, with its web services, is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. To target other languages than Swedish we need access to annotated corpora and relevant resources/tools for other languages, which potentially are available through NLTK.

Problem description

The aims of this work are therefore:

  1. to implement web service(s) for exercise generation for English or any other language using NLTK, primarily of the same type as offered by Lärka (to avoid the hassle of implementing user interface); or eventually others exercise types.
  2. depending upon the type of exercises you choose to implement, a number of questions might arise, i.e. how to adapt exercises to relevant learner levels; which way to assign texts to appropriate language proficiency levels; how to select distractors, etc.
  3. (potentially) if other exercise types are chosen, the necessary user interface modules will need to be implemented.
  4. (potentially) evaluate the results.

Recommended skills:

  • Python
  • (potentially) jQuery

Supervisor(s)

  • Elena Volodina/Markus Forsberg
  • possibly others from Språkbanken

Automatic text classification by its readability

Goal

Developing algorithm for automatic assigning of texts to relevant language learner levels (to be used in Lärka and eventually Korp)

Background

Text readability measures assign readability scores to texts according to certain features, like sentence and word length. These are not enough to fully estimate text appropriateness for language learners or eventually other user groups with limited abilities in a language. The recent PhD research at Språkbanken (Katarina Heimann Mühlenbock) has concentrated on studying different aspects of text with regards to text readability. However, no available implementation has been released.

Problem description

The aim of this work is thus:

  1. to study the above-mentioned PhD Thesis as well as a number of other research papers, and find a feasible implementation approach
  2. implement a program in python for automatic categorization of texts into CEFR levels
  3. implement a user interface for working with different text parameters (e.g. for switching them on/off)
  4. evaluate the results by comparing the classification results on a number of texts of known CEFR levels

Recommended skills:

  • Python
  • jQuery

Supervisor(s)

  • Elena Volodina/Katarina Heimann Mühlenbock
  • eventually others from Språkbanken

Automatic selection of (thematic) texts from web

Goal

Developing an algorithm for automatic collection of (Swedish) texts on specific topic from internet (as a part of Korp and/or Lärka)

Background

The currently developed application Lärka is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. The topic of the source texts is, however, not known. To be able to select authentic contexts of a relevant theme (as described in CEFR document, Common European Framework of References), we need an automated approach to selection of texts of a given theme, with all the subsequent annotations.

Problem description

The aims of this work include the following:

  1. to implement a python-based program (eventually web service(s)) for automatic selection of texts from the web, e.g. using so-called “seed words” (web-crawling approach). Face the possible problems with language identification, duplicates, noise, etc.
  2. test/evaluate programme performance by creating a domain corpus for Swedish taking CEFR themes as a basis for sub-corpora (Common European Framework of References).
  3. (potentially) compare performance of this programme with WebBootCat/Corpus Factory (via SketchEngine).
  4. (potentially) deploy the web-service in Lärka, i.e. implement the necessary user interface “module”  

Recommended skills:

  • Python
  • (potentially) jQuery

Supervisor(s)

  • Elena Volodina/Sofie Johansson Kokkinakis
  • possibly others from Språkbanken

Medication extraction from "Dirty data"

Aim

Dealing with spelling variation in Swedish medical texts with respect to names of drugs and related information, in order to improve indexing and aggregation. Extraction of information related to the medication is an important task within the biomedical area. Updating of drug vocabularies cannot follow the evolution of the drug development. Several methods can be used by e.g. combining internal and contextual clues.

The application will primarily based on "dirty" data (bloggs, twitter, logs) (and if necessary from scientific "clean" data for comparison).

Recommended skills

  • Don't have to be native speaker of Swedish, but some superficial knowledge of Swedish would be good to have.
  • Good programming skills

Supervisor(s)

Dimitrios Kokkinakis and possibly others from Språkbanken

References

Chen E, Hripcsak G, Xu H, Markatou M, and Friedman C. Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. J Am Med Inform Assoc 2008;15(1):87–98.

Chieng D, Day T, Gordon G, and Hicks J. Use of natural language programming to extract medication from unstructured electronic medical records. In: AMIA, 2007:908–8.

Segura-Bedmar I, Martinez P, and Segura-Bedmar M. Drug name recognition and classification in biomedical texts. Drug Safety 2008;13(17-18):816–23.

Sibanda T and Uzuner O. Role of local context in deidentification of ungrammatical, fragmented test. Proceedings of the North American Chapter of Association for Computational Linguistics/Human Language Technology (NAACL-HLT 2006), New York, USA. 2006.

Topic models in Swedish Litterature and other Collections

 

Topic modeling is a simple way to analyze large volumes of unlabeled text. A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. (Wikipedia <http://en.wikipedia.org/wiki/Topic_model>). Thus, a "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with "similar" meanings and distinguish between uses of words with multiple meanings. For a general introduction to topic modeling, see for example: Steyvers and Griffiths (2007).

Material and applications

The textual material the topic modeling resources will be applied on is i) Swedish literature collections and ii) Swedish biomedical texts. The Purpose is to identify e.g. topics that rose or fell in popularity; classify text passages (cf. Jockers, 2011); visualize topics with authors (cf. Meeks, 2011); identify potential issues of interest for historians, literary scholars or other (cf. Yang et al., 2011).
 

Avaialable Software to be used:

  • MALLET <http://mallet.cs.umass.edu/topics.php>
  • Gensim – Topic Modelling for Humans (Python) <http://radimrehurek.com/gensim/>
  • topicmodel in R; <http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf>
  • Comprehensive list of topic modeling software <http://www.cs.princeton.edu/~blei/topicmodeling.html>

Requirements

Good programming skills
Not necessary to have Swedish as mother tongue!
 

Supervisors

Dimitrios Kokkinakis
Richard Johansson
Mats Malm

References

Blei DM. 2012. Probabilistic  topic models. Communications of the ACM. vol. 55 no. 4. <http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf>

Jockers M. 2011 Who's your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling Matthew L. Jockers, posted 19 March 2010

Meeks E. 2011 Comprehending the Digital Humanities Digital Humanities Specialist, posted 19 February 2011

Steyvers M. and Griffiths T. (2007). Probabilistic Topic Models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum. <http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf>.

Yang T., Torget A. and Mihalcea R. (2011) Topic Modeling on Historical Newspapers. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. The Association for Computational Linguistics, Madison, WI. pages 96–104.

Extensive Topic Modeling bibliography: <http://www.cs.princeton.edu/~mimno/topics.html>

Information extraction for dialogue interaction

 

The goal of the project is to equip a robotic companion/dialogue manager with topic  modelling and information extraction from corpora, for example Wikipedia articles and topic oriented dialogue corpora, to guide the conversation with a user. Rather than concentrating on a task, a companion engages in free conversation with a user and therefore must supplement traditional rule-based dialogue management with data-driven models. The project thus attempts to examine ways in which text-driven semantic extraction techniques can be integrated with rule-based dialogue management.

Possible directions of this project are:

A. Topic modelling

The system must recognise robustly the topics of user's utterances in order to respond appropriately. This method can be used in addition to a rule-based technique. Given a suitable corpus of topic oriented conversations:

  • what is the most likely topic in the user's dialogue move;
  • ... and given a sequence of topics discussed so far, what is the next most likely topic?

B. Named entity recognition and information extraction for question
generation

The system could take initiative and guide the conversation. It could start with some (Wikipedia) article and identify named entities. If any of the entities match the domain of questions that it can handle, it should generate questions about them.

User: I've been to Paris for holiday.
DM: Paris... I see. Have you been to the Eiffel tower?
...

C. Question answering
 

Supervisors: Simon Dobnik and possible others from the Dialogue Technology Lab

Learning language and perception with a robot

 

The task of the project is to learn a mapping between natural language descriptions on one hand and sensory observations and commands issued to a simple mobile robot (Lego NX) using machine learning. The project would involve building a corpus of descriptions paired with actions - one person is guiding the robot and another person is describing. Multimodal ML models would then be built from this corpus both to predict a description, action or perceptual observation. Finally, the models should be integrated with a simple dialogue manager with which humans can interact and test the success of learning in context.

The system should be implemented in ROS (Robot Operation System) which provides access to sensors and actuators of the robot and allows writing new models in a simplified (well-organised) manner in Python.

Contributions/possible research directions of this thesis:

  • to examine to what extend and for what situations Lego NX robot can be used for learning multimodal semantics;
  • to examine whether bag-of-features approach (features being both linguistic and perceptional/action features) can be used to learn multimodal semantic representations;
  • examine how such models can be integrated with dialogue;
  • examine ML techniques that would actively learn/update the models through the interaction with a user (clarification, correction).


Supervisors: Simon Dobnik and possible others from the Dialogue Technology Lab

X
Loading