• old

old

Mark an item as old if it should not be shown with the newer ones.

Exercise generator for English or any other language available through NLTK

Goal

Developing python-based programs (web service(s)) for automatic generation of exercises (e.g. of the same type as in Lärka) for other languages than Swedish using corpora and necessary language resources/tools available through NLTK

Background

The currently developed application Lärka, with its web services, is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. To target other languages than Swedish we need access to annotated corpora and relevant resources/tools for other languages, which potentially are available through NLTK.

Problem description

The aims of this work are therefore:

  1. to implement web service(s) for exercise generation for English or any other language using NLTK, primarily of the same type as offered by Lärka (to avoid the hassle of implementing user interface); or eventually others exercise types.
  2. depending upon the type of exercises you choose to implement, a number of questions might arise, i.e. how to adapt exercises to relevant learner levels; which way to assign texts to appropriate language proficiency levels; how to select distractors, etc.
  3. (potentially) if other exercise types are chosen, the necessary user interface modules will need to be implemented.
  4. (potentially) evaluate the results.

Recommended skills:

  • Python
  • (potentially) jQuery

Supervisor(s)

  • Elena Volodina/Markus Forsberg
  • possibly others from Språkbanken

Automatic text classification by its readability

Goal

Developing algorithm for automatic assigning of texts to relevant language learner levels (to be used in Lärka and eventually Korp)

Background

Text readability measures assign readability scores to texts according to certain features, like sentence and word length. These are not enough to fully estimate text appropriateness for language learners or eventually other user groups with limited abilities in a language. The recent PhD research at Språkbanken (Katarina Heimann Mühlenbock) has concentrated on studying different aspects of text with regards to text readability. However, no available implementation has been released.

Problem description

The aim of this work is thus:

  1. to study the above-mentioned PhD Thesis as well as a number of other research papers, and find a feasible implementation approach
  2. implement a program in python for automatic categorization of texts into CEFR levels
  3. implement a user interface for working with different text parameters (e.g. for switching them on/off)
  4. evaluate the results by comparing the classification results on a number of texts of known CEFR levels

Recommended skills:

  • Python
  • jQuery

Supervisor(s)

  • Elena Volodina/Katarina Heimann Mühlenbock
  • eventually others from Språkbanken

Automatic selection of (thematic) texts from web

Goal

Developing an algorithm for automatic collection of (Swedish) texts on specific topic from internet (as a part of Korp and/or Lärka)

Background

The currently developed application Lärka is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. The topic of the source texts is, however, not known. To be able to select authentic contexts of a relevant theme (as described in CEFR document, Common European Framework of References), we need an automated approach to selection of texts of a given theme, with all the subsequent annotations.

Problem description

The aims of this work include the following:

  1. to implement a python-based program (eventually web service(s)) for automatic selection of texts from the web, e.g. using so-called “seed words” (web-crawling approach). Face the possible problems with language identification, duplicates, noise, etc.
  2. test/evaluate programme performance by creating a domain corpus for Swedish taking CEFR themes as a basis for sub-corpora (Common European Framework of References).
  3. (potentially) compare performance of this programme with WebBootCat/Corpus Factory (via SketchEngine).
  4. (potentially) deploy the web-service in Lärka, i.e. implement the necessary user interface “module”  

Recommended skills:

  • Python
  • (potentially) jQuery

Supervisor(s)

  • Elena Volodina/Sofie Johansson Kokkinakis
  • possibly others from Språkbanken

Medication extraction from "Dirty data"

Aim

Dealing with spelling variation in Swedish medical texts with respect to names of drugs and related information, in order to improve indexing and aggregation. Extraction of information related to the medication is an important task within the biomedical area. Updating of drug vocabularies cannot follow the evolution of the drug development. Several methods can be used by e.g. combining internal and contextual clues.

The application will primarily based on "dirty" data (bloggs, twitter, logs) (and if necessary from scientific "clean" data for comparison).

Recommended skills

  • Don't have to be native speaker of Swedish, but some superficial knowledge of Swedish would be good to have.
  • Good programming skills

Supervisor(s)

Dimitrios Kokkinakis and possibly others from Språkbanken

References

Chen E, Hripcsak G, Xu H, Markatou M, and Friedman C. Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. J Am Med Inform Assoc 2008;15(1):87–98.

Chieng D, Day T, Gordon G, and Hicks J. Use of natural language programming to extract medication from unstructured electronic medical records. In: AMIA, 2007:908–8.

Segura-Bedmar I, Martinez P, and Segura-Bedmar M. Drug name recognition and classification in biomedical texts. Drug Safety 2008;13(17-18):816–23.

Sibanda T and Uzuner O. Role of local context in deidentification of ungrammatical, fragmented test. Proceedings of the North American Chapter of Association for Computational Linguistics/Human Language Technology (NAACL-HLT 2006), New York, USA. 2006.

Topic models in Swedish Litterature and other Collections

 

Topic modeling is a simple way to analyze large volumes of unlabeled text. A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. (Wikipedia <http://en.wikipedia.org/wiki/Topic_model>). Thus, a "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with "similar" meanings and distinguish between uses of words with multiple meanings. For a general introduction to topic modeling, see for example: Steyvers and Griffiths (2007).

Material and applications

The textual material the topic modeling resources will be applied on is i) Swedish literature collections and ii) Swedish biomedical texts. The Purpose is to identify e.g. topics that rose or fell in popularity; classify text passages (cf. Jockers, 2011); visualize topics with authors (cf. Meeks, 2011); identify potential issues of interest for historians, literary scholars or other (cf. Yang et al., 2011).
 

Avaialable Software to be used:

  • MALLET <http://mallet.cs.umass.edu/topics.php>
  • Gensim – Topic Modelling for Humans (Python) <http://radimrehurek.com/gensim/>
  • topicmodel in R; <http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf>
  • Comprehensive list of topic modeling software <http://www.cs.princeton.edu/~blei/topicmodeling.html>

Requirements

Good programming skills
Not necessary to have Swedish as mother tongue!
 

Supervisors

Dimitrios Kokkinakis
Richard Johansson
Mats Malm

References

Blei DM. 2012. Probabilistic  topic models. Communications of the ACM. vol. 55 no. 4. <http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf>

Jockers M. 2011 Who's your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling Matthew L. Jockers, posted 19 March 2010

Meeks E. 2011 Comprehending the Digital Humanities Digital Humanities Specialist, posted 19 February 2011

Steyvers M. and Griffiths T. (2007). Probabilistic Topic Models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum. <http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf>.

Yang T., Torget A. and Mihalcea R. (2011) Topic Modeling on Historical Newspapers. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. The Association for Computational Linguistics, Madison, WI. pages 96–104.

Extensive Topic Modeling bibliography: <http://www.cs.princeton.edu/~mimno/topics.html>

Information extraction for dialogue interaction

 

The goal of the project is to equip a robotic companion/dialogue manager with topic  modelling and information extraction from corpora, for example Wikipedia articles and topic oriented dialogue corpora, to guide the conversation with a user. Rather than concentrating on a task, a companion engages in free conversation with a user and therefore must supplement traditional rule-based dialogue management with data-driven models. The project thus attempts to examine ways in which text-driven semantic extraction techniques can be integrated with rule-based dialogue management.

Possible directions of this project are:

A. Topic modelling

The system must recognise robustly the topics of user's utterances in order to respond appropriately. This method can be used in addition to a rule-based technique. Given a suitable corpus of topic oriented conversations:

  • what is the most likely topic in the user's dialogue move;
  • ... and given a sequence of topics discussed so far, what is the next most likely topic?

B. Named entity recognition and information extraction for question
generation

The system could take initiative and guide the conversation. It could start with some (Wikipedia) article and identify named entities. If any of the entities match the domain of questions that it can handle, it should generate questions about them.

User: I've been to Paris for holiday.
DM: Paris... I see. Have you been to the Eiffel tower?
...

C. Question answering
 

Supervisors: Simon Dobnik and possible others from the Dialogue Technology Lab

Learning language and perception with a robot

 

The task of the project is to learn a mapping between natural language descriptions on one hand and sensory observations and commands issued to a simple mobile robot (Lego NX) using machine learning. The project would involve building a corpus of descriptions paired with actions - one person is guiding the robot and another person is describing. Multimodal ML models would then be built from this corpus both to predict a description, action or perceptual observation. Finally, the models should be integrated with a simple dialogue manager with which humans can interact and test the success of learning in context.

The system should be implemented in ROS (Robot Operation System) which provides access to sensors and actuators of the robot and allows writing new models in a simplified (well-organised) manner in Python.

Contributions/possible research directions of this thesis:

  • to examine to what extend and for what situations Lego NX robot can be used for learning multimodal semantics;
  • to examine whether bag-of-features approach (features being both linguistic and perceptional/action features) can be used to learn multimodal semantic representations;
  • examine how such models can be integrated with dialogue;
  • examine ML techniques that would actively learn/update the models through the interaction with a user (clarification, correction).


Supervisors: Simon Dobnik and possible others from the Dialogue Technology Lab

Automatic alignment between expert and non-expert language

Goal

Create automatic alignment between professional medical vocabulary and non-expert vocabulary in Swedish in order to enhance an information retrieval system.

Background

Health care professionals and lay persons express themselves in different ways when discussing medical issues. When searching for documents on a medical topic they most likely are interested in finding documents on different reading levels and with different vocabulary. It could also be the case that the user expresses the search query in terms typical for one group or the other, while being interested in finding documents from both categories.

Språkbanken has a Swedish medical test collection with documents marked for target group: Doctors or Patients which could be used both for categorization of terms and for testing.

Problem description

The approach is a question of automatic alignment between expert and non-expert terminology. The objective is to enrich an information retrieval system with links between corresponding concepts in the two sublanguages. The alignment can be done by different machine-learning techniques, such as k-nearest neighbor classifiers or support vector machines.

Automatic alignment of the vocabulary of the two groups could help the user either to find documents written for a certain target group or to find documents for either group even if the query only contains terms from one.

Recommended skills

General knowledge in Swedish.

Some knowledge of information retrieval.

Some knowledge of machine learning.

Programming skills, for example in Python.

Supervisors

Karin Friberg Heppin and possibly others from Språkbanken.

References

Diosan, Rogozan and Pècuchet. 2009. Automatic alignment of medical terminologies with general dictionaries for an efficient information retrieval. Information retrieval in biomedicine: Natural language processing for knowledge integration.

Friberg Heppin. 2010.Resolving power of search keys in MedEval – A Swedish medical test collection with user groups: Doctors and Patients.

Using medical domain language models for expert and non-expert language for user oriented information retrieval

Goal

Create language models for medical experts and for non-expert from Swedish medical documents and use these in order to enhance an information retrieval system to retrieve documents on a, for the user, suitable level of expertise.

Background

When searching for documents on a medical topic health care professionals and lay persons most likely are interested in finding documents on different levels of expertise. Most information retrieval systems do not adjust the returned ranked list of documents to the users background.

As Språkbanken has a Swedish medical test collection with documents marked for target group: Doctors or Patients this could be used to make language models for the two user groups which then could be used to adjust the results to the users needs.

Problem description

The approach is to make language models for medical expert language and for lay persons. The objective is to describe differences between the sublanguages and to use these models to retrieve documents suited for the user.

Recommended skills

General knowledge in Swedish.

Some knowledge of information retrieval.

Some knowledge of machine learning.

Programming skills, for example in Python.

Supervisors

Karin Friberg Heppin and possibly others from Språkbanken.

References

Hiemstra, D. 2000. Using language models for information retrieval.<http://wwwhome.cs.utwente.nl/~hiemstra/papers/thesis.pdf>

Friberg Heppin. 2010.Resolving power of search keys in MedEval – A Swedish medical test collection with user groups: Doctors and Patients. <https://gupea.ub.gu.se/handle/2077/22709>

Improving Accuracy and Efficiency of Automatic Software Localization

This project involves working closely with industry to improve accuracy of automatic translation of software interfaces and documentation by exploiting context specificity.


For instance, software source code can be mined to glean the appropriate context for individual messages (e.g., to distinguish a button from an error message). The student(s) will incorporate GF (Grammatical Framework, www.molto-project.eu) grammars into a hybrid translation system that improves on CA Labs' current statistical- and Translation Memory-based methods. User interfaces will enjoy more accurate automatic translations, and error/feedback messages will no longer be generic, but will be adapted to the user's specific interaction scenario. The first goals are to deliver high-quality translations for the most commonly used languages/dialects and to develop an infrastructure to quickly produce acceptable-quality results for new languages. Follow-on work will optimize the translation engine for performance (thereby enabling fast, off-line translation of very large corpora of documents/artifacts).

This project not only involves working closely with researchers and linguists/language experts at CA Labs, but also includes a collaboration with faculty and students at the Universitat Politecnica de Catalunya. Opportunities for either short research visits or longer internships at CA Labs are very good.

S.A. McKee, A. Ranta (Chalmers/GU), V. Montés, P. Paladini (CA Labs Barcelona)

X
Loading