• master_thesis_proposal


Automatic alignment between expert and non-expert language


Create automatic alignment between professional medical vocabulary and non-expert vocabulary in Swedish in order to enhance an information retrieval system.


Health care professionals and lay persons express themselves in different ways when discussing medical issues. When searching for documents on a medical topic they most likely are interested in finding documents on different reading levels and with different vocabulary. It could also be the case that the user expresses the search query in terms typical for one group or the other, while being interested in finding documents from both categories.

Språkbanken has a Swedish medical test collection with documents marked for target group: Doctors or Patients which could be used both for categorization of terms and for testing.

Problem description

The approach is a question of automatic alignment between expert and non-expert terminology. The objective is to enrich an information retrieval system with links between corresponding concepts in the two sublanguages. The alignment can be done by different machine-learning techniques, such as k-nearest neighbor classifiers or support vector machines.

Automatic alignment of the vocabulary of the two groups could help the user either to find documents written for a certain target group or to find documents for either group even if the query only contains terms from one.

Recommended skills

General knowledge in Swedish.

Some knowledge of information retrieval.

Some knowledge of machine learning.

Programming skills, for example in Python.


Karin Friberg Heppin and possibly others from Språkbanken.


Diosan, Rogozan and Pècuchet. 2009. Automatic alignment of medical terminologies with general dictionaries for an efficient information retrieval. Information retrieval in biomedicine: Natural language processing for knowledge integration.

Friberg Heppin. 2010.Resolving power of search keys in MedEval – A Swedish medical test collection with user groups: Doctors and Patients.

Using medical domain language models for expert and non-expert language for user oriented information retrieval


Create language models for medical experts and for non-expert from Swedish medical documents and use these in order to enhance an information retrieval system to retrieve documents on a, for the user, suitable level of expertise.


When searching for documents on a medical topic health care professionals and lay persons most likely are interested in finding documents on different levels of expertise. Most information retrieval systems do not adjust the returned ranked list of documents to the users background.

As Språkbanken has a Swedish medical test collection with documents marked for target group: Doctors or Patients this could be used to make language models for the two user groups which then could be used to adjust the results to the users needs.

Problem description

The approach is to make language models for medical expert language and for lay persons. The objective is to describe differences between the sublanguages and to use these models to retrieve documents suited for the user.

Recommended skills

General knowledge in Swedish.

Some knowledge of information retrieval.

Some knowledge of machine learning.

Programming skills, for example in Python.


Karin Friberg Heppin and possibly others from Språkbanken.


Hiemstra, D. 2000. Using language models for information retrieval.<http://wwwhome.cs.utwente.nl/~hiemstra/papers/thesis.pdf>

Friberg Heppin. 2010.Resolving power of search keys in MedEval – A Swedish medical test collection with user groups: Doctors and Patients. <https://gupea.ub.gu.se/handle/2077/22709>

Improving Accuracy and Efficiency of Automatic Software Localization

This project involves working closely with industry to improve accuracy of automatic translation of software interfaces and documentation by exploiting context specificity.

For instance, software source code can be mined to glean the appropriate context for individual messages (e.g., to distinguish a button from an error message). The student(s) will incorporate GF (Grammatical Framework, www.molto-project.eu) grammars into a hybrid translation system that improves on CA Labs' current statistical- and Translation Memory-based methods. User interfaces will enjoy more accurate automatic translations, and error/feedback messages will no longer be generic, but will be adapted to the user's specific interaction scenario. The first goals are to deliver high-quality translations for the most commonly used languages/dialects and to develop an infrastructure to quickly produce acceptable-quality results for new languages. Follow-on work will optimize the translation engine for performance (thereby enabling fast, off-line translation of very large corpora of documents/artifacts).

This project not only involves working closely with researchers and linguists/language experts at CA Labs, but also includes a collaboration with faculty and students at the Universitat Politecnica de Catalunya. Opportunities for either short research visits or longer internships at CA Labs are very good.

S.A. McKee, A. Ranta (Chalmers/GU), V. Montés, P. Paladini (CA Labs Barcelona)

Name linking and visualization in (Swedish) digital collections


Given a number of Swedish novels taken from the Swedish Literature Bank (<http://litteraturbanken.se/#!om/inenglish>), pre-annotated with named entities (i.e. person names with their gender [male, female or unknown]), the purpose of this work is to: 

i) find pronominal and other references associated with these person entities and link them to each other and ii) apply different visualization techniques for analyzing the entities in these novels with respect to the characters involved; e.g. using a network representation so that it would be easy to identify possible clusters of e.g. people "communities".


Obviously (i) aims at developing a (simple) coreference resolution software for Swedish either rule based, machine learning or hybrid. According to Wikipedia: "co-reference occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same "referent." For example, in the sentence "Mary said she would help me", "she" and "Mary" are most likely referring to the same person or group, in which case they are coreferent. Similarly, in "I saw Scott yesterday. He was fishing by the lake," Scott and he are most likely coreferent." With respect of (ii) any available visualization software can be used and there a number available such as: Visone; Touchgraph or Gephi.


As a practical application the resulting software will be used as a supporting technology for literature scholars that want to get a bird's eye view on analyzing literature; for social network analysis etc.



This project deals with "name linking and visualization" in digital collections (e.g. novels). Theoretically the focus of the project will be framed around the term ”distant reading” (Moretti, 2005) or "macro analysis". Distant reading means that "the reality of the text undergoes a process of deliberate reduction and abstraction”. According to this view, understanding literature is not accomplished by studying individual texts, but by aggregating and analyzing massive amounts of data. This way it becomes possible to detect possible hidden aspects in plots, the structure and interactions of characters becomes easier to follow enabling experimentation and exploration of new uses and development that otherwise would be impossible to conduct. Moretti advocated the usage of visual representations such as graphs, maps, and trees for literature analysis.


Some Swedish language skills - probably not need to be a native speaker 

Very good programming skills.


Dimitrios Kokkinakis, PhD, Department of Swedish

Richard Johansson, PhD, Department of Swedish

Mats Malm, Prof., Department Language and Literature

Some Relevant Links

Matthew L. Jockers website <http://www.matthewjockers.net/>

Franco Moretti. 2005. Graphs, maps, trees: abstract models for a literary history. R. R. Donnelley & Sons.

Daniela Oelke, Dimitrios Kokkinakis, Mats Malm. (2012). Advanced Visual Analytics Methods for Literature Analysis. Proceedings of the Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). An EACL 2012 workshop. Avignon, France. <http://demo.spraakdata.gu.se/svedk/pbl/FINAL_eacl2012-1.pdf>


Multilingual FraCaS test suite


Develop a version of the FraCaS test suite in your native language


The FraCaS test suite was created as part of the FraCaS project back in the nineties. A few years ago Bill McCartney (Stanford) made a machine readable xml version of it and it has been used in connection with textual entailment. This project involves developing the test suite further as a multilingual web accessible resource for computational semantics.

Project description

  1. Learn about the test suite in English, reading the original literature and some recent literature about its current use in computational semantics. Write a description of the work.
  2. Focus on one of the sections of the test suite and learn about the semantic problems which it illustrates and write a description of the semantic issues involved
  3. Translate at least the part of the test suite you focussed on in (2) into your native language and make it machine readable.
  4. Discuss the semantic issues you raised in (2) with respect to your own language and your translations. In particular focus on difficulties in translation or differences between the original English and your translation.
  5. (optional) Implement a parser for (some of) your translations and write documentation of it.
  6. (optional) Extend your parser so that it provides semantic representations which will support the inferences. Document this.
  7. (optional) Run an experiment (perhaps involving a web form) where subjects (native speakers of your language) can express their judgements about the inferences in your translation. Document the results you obtain.


Robin Cooper, Department of Philosophy, Linguistics and Theory of Science. The project will be carried out in connection with Dialogue Technology Lab associated with the Centre for Language Technology.

Fact extraction from (bio)medical article titles


Article titlesare brief, descriptive and to the point, using very well-chosen, specific terminology that intend to attract the reader's attention. Factual information extraction from such article titles and the construction of structured fact data banks has a great potential to facilitate computational analysis in many areas of biomedicine and in the open domain.


Given a Swedish collection of published article titles the purpose of this proposal is twofold:

a) to automatically classify titles into factual and non-factual. For this you will need to:

  • write some simple guidelines that will help you differentiate between factual/non factual instances/examples
  • annotate titles as factual or not
  • decide and extract suitable attributes (features) such as verbs, n-grams etc.
  • experiment with one (or more) machine learning algorithms
  • evaluate and report results

b) to extract sets of triplesfrom the factual titles and represent them in a graphical way using available software such as "visone" or "touchgraph".

A factual title in biomedicine according to Eales et al. (2011) is: "a direct (the title does not merely imply a result but actually states the result) sentential report about the outcome of a biomedical investigation". In this proposal, we take a little more general approach since our data is not strictly biomedical, but medical in general. Such results can be both a positive or negative outcome. For instance the first example below is positive and the second negative (the annotations provided below are simplified for readability):

"Antioxidanter motverkar fria radikalers nyttiga effekter" (LT nr 28–29 2009 volym 106, pp 1808)

<substance>Antioxidanter</substance> motverkar <substance>fria radikalers</substance> nyttiga <qualifier value>effekter</qualifier value>

"B12 och folat skyddar inte mot hjärt-kärlsjukdom" (LT nr 38 2010 volym 107, pp, 2228)

<substance>B12</substance> och <substance>folat</substance> skyddar inte mot <disorder>hjärt-kärlsjukdom</disorder>

A non-factual fact can be a title that does not state all the necessary contextual information in order to fully understand whether the results or implications of the finding have a factual (direct) outcome. Moreover, a non-factual fact can be one with speculative language, such as:

"Hyperemesis gravidarum kan vara ärftlig" (LT nr 22 2010 volym 107, pp 1462)

<disorder>Hyperemesis gravidarum</disorder> kan vara <qualifier value>ärftlig</qualifier value>

"Influensavaccinering av friska unga vuxna" (LT nr 14 2002 volym 107, pp 1600)

<procedure>Influensavaccinering</procedure> av <person>friska unga vuxna</person>

For training and evaluation the article title corpus needs to be suitably divided, e.g. 75%-25% into training sentences and test sentences. All will be manually annotated as "factual" or "non-factual" but the test portion will be only kept for evaluation and not used during training (e.g. feature generation).


A Swedish collection of published article titles (about 1,000) will be provided in two formats, a raw (unannotated) format and an annotated version with labels from a medical ontology. Also, a few titles can be composed of several sentences and these can be a mix of factual and non-factual statements. A number of other annotation can be provided, if necessary, such as part of speech tags.


Native Swedish or good Swedish language skills - all data is Swedish.

Good programming skills, interest (experience) in Machine Learning is a plus!


Dimitrios Kokkinakis

Richard Johansson


  1. Eales J., Demetriou G. and Stevens R. 2011. Creating a focused corpus of factual outcomes from biomedical experiments. Proceedings of the Mining Complex Entities from Network and Biomedical Data. Athens, Greece.
  2. Kastner I. and Monz C. 2009. Automatic Single-Document Key Fact Extraction from Newswire Articles. Proceedings of the 12th Conference of the European Chapter of the ACL (EACL). Athens, Greece.
  3. Kilicoglu H. and Bergler S. 2008. Recognizing speculative language in biomedical research articles: a linguistically motivated perspective. BMC Bioinformatics 2008, 9(Suppl 11):S10.


Robust Parsing using Minimum Edit Distance

Assume that you have a grammar for parsing utterances into some kind of semantic interpretation, such as voice commands or dialogue acts. Can you use this grammar for handling also non-grammatical utterances?

A possible example application is if you want to implement a dialogue system for an Android device. Android has built-in speech recognition, but the ASR engine is not grammar driven, so it returns a sentence which does not have to be grammatical.

The idea of this project is to use a Minimum Edit Distance metric, such as the Levenshtein Distance, and try to find the closest sentence that is accepted by the grammar.

Suggested workflow:

1. Start with a grammar G and a test set T of sentences with associated semantics.

2. Generate all possible sentences from the grammar G (up to a given maximum length), with associated semantics. This will give a corpus C, a semantic treebank.

3. For each sentence S in the test set T, find the closest sentence S' that is in C.

4. Evaluate each sentence S by comparing its intended semantics T(S), with the semantics C(S') of the grammatical sentence S'. This will give a Semantic Error Rate (SER).

5. Do this for different possible distance metrics.

The different distance metrics can be
- Levenshtein distance on word level
- Levenshtein distance on character level
- possibly the distance on phoneme or morpheme level

Supervisor: Peter Ljunglöf

Free Dyslexia Software

There are many (more or less successful) attempts of using software to help people with dyslexia.

This project will give an overview of existing software and then develop and evaluate new software.

The supervisor will be Bengt Nordström, Dept of Computer Science and Engineering at Chalmers. He has good contacts with teachers specializing in assisting dyslexic students. The software developed in this work will be licensed under GPL or some similar scheme.

Either a solid background in Language Technology with good programming skills or a solid background in Computer Science with shown interests in Language Technology

Clustering corpus paragraphs for lexical differentiation


.Developing and evaluating a system for clustering corpus paragraphs in order to differentiate word usages in the corpus


Determining the range of usages for a particular word in a corpus is a great challenge. Particular aspects of this problem are investigated under the headings word sense disambiguation and word sense induction. In Språkbanken <http://språkbanken.gu.se>, the focus is on developing language-aware tools to aid us in building lexical resources, such as the Swedish FrameNet and Swesaurus (a Swedish wordnet).

Problem description

The paragraph is the smallest content unit of a text. The project aims at classifying/clustering paragraphs in a corpus in a way which makes it likely that the same lemma occuring in paragraphs of the same class (in the same cluster) will reflect the same sense of this word. This will allow us to design a corpus search interface where such hits are collaped by default and potentially different senses can be highlighted.

The work should preferably be carried out on the Swedish SUC corpus, but some suitable English corpus could be used instead, e.g., in the framework of NLTK. Many relevant tools are available in Java; hence, a familiarity with Java is necessary.

Recommended skills

  • Fair linguistic analysis skills in the target language
  • Good programming skills, including familiarity with Java


Lars Borin and possibly others, Språkbanken

Number Sense Disambiguation for Swedish - "Assign each Number a Sense"


Word Sense Disambiguation is a well studied field in the Natural Language Processing Community which has resulted in a full range of successful methods and software. However, the identification and disambiguation of numerical information in natural language text is not so well studied and to the best of our knowledge there has not been yet research in Sweden on empirical evidence of the linguistic variation of numerical expressions, therefore this work is a good opportunity to investigate this topic since it is important in many tasks in natural language processing that require understanding of e.g. quantities (e.g. in information extraction or Q&A).

A numerical expression in a text is a sequence or combinationof digits with possible operators, identifiers or a mathematical symbols. Numerals in text can be used to express a variety of different senses, in a similar manner that words are used in different senses. For instance, "11" can denote:

  • the age of a person "11 years of age"
  • a reference of time "11 hours"
  • a reference to a published article "see [11]"
  • a quantity "11 women"
  • a part of a phone number "011-726 11 28"
  • a frequency "11 Hz"
  • a latitude "11 degrees"
  • a length unit "11 km2"
  • a dose "11 mg/ml"
  • ...


The purpose of this work is on numerical information processing and the development of new/or adaptation of existing algorithms for numerical information identification and disambiguation on Swedish text material. Depending on the background and interest of the student, the work can be given different focus and scope; e.g. own implementation of a numerical information processing or adapting available software to Swedish; compare the effect of different resources and module combinations for numerical processing, etc.


As a practical application the resulting software will be used as a supporting technology for number sense disambiguation of medical data perhaps using the LOINC ontology.


Dimitrios Kokkinakis,PhD, Department of Swedish, and possibly others.


Native Swedish or good Swedish language skills.

Good programming skills.

Relevant Links and References

NUMEX: SPECIFIC GUIDELINES - Message Understanding Conferences MUC-6 <http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_17.html#HEADING44>

LOINC: Logical Observation Identifiers Names and Codes (LOINC®) Users' Guide. Clem McDonald, Stan Huff, Kathy Mercer, Jo Anna Hernandez, Daniel J. Vreeman

Definition of Sekine’s Extended Named Entity, Version 6.1.0 (English). 2003. <http://qallme.fbk.eu/SekineENE_Definition_v6.pdf>

Stuart Moore, Anna Korhonen and Sabine Buchholz. 2009. Number Sense Disambiguation. In Proceedings of the 12th Conference of the Pacific Association for Computational Linguistics. Sapporo, Japan.