• old

old

Mark an item as old if it should not be shown with the newer ones.

Name linking and visualization in (Swedish) digital collections

Aim

Given a number of Swedish novels taken from the Swedish Literature Bank (<http://litteraturbanken.se/#!om/inenglish>), pre-annotated with named entities (i.e. person names with their gender [male, female or unknown]), the purpose of this work is to: 

i) find pronominal and other references associated with these person entities and link them to each other and ii) apply different visualization techniques for analyzing the entities in these novels with respect to the characters involved; e.g. using a network representation so that it would be easy to identify possible clusters of e.g. people "communities".

 

Obviously (i) aims at developing a (simple) coreference resolution software for Swedish either rule based, machine learning or hybrid. According to Wikipedia: "co-reference occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same "referent." For example, in the sentence "Mary said she would help me", "she" and "Mary" are most likely referring to the same person or group, in which case they are coreferent. Similarly, in "I saw Scott yesterday. He was fishing by the lake," Scott and he are most likely coreferent." With respect of (ii) any available visualization software can be used and there a number available such as: Visone; Touchgraph or Gephi.

 

As a practical application the resulting software will be used as a supporting technology for literature scholars that want to get a bird's eye view on analyzing literature; for social network analysis etc.

 

Background

This project deals with "name linking and visualization" in digital collections (e.g. novels). Theoretically the focus of the project will be framed around the term ”distant reading” (Moretti, 2005) or "macro analysis". Distant reading means that "the reality of the text undergoes a process of deliberate reduction and abstraction”. According to this view, understanding literature is not accomplished by studying individual texts, but by aggregating and analyzing massive amounts of data. This way it becomes possible to detect possible hidden aspects in plots, the structure and interactions of characters becomes easier to follow enabling experimentation and exploration of new uses and development that otherwise would be impossible to conduct. Moretti advocated the usage of visual representations such as graphs, maps, and trees for literature analysis.

Prerequisites:

Some Swedish language skills - probably not need to be a native speaker 

Very good programming skills.

Supervisors

Dimitrios Kokkinakis, PhD, Department of Swedish

Richard Johansson, PhD, Department of Swedish

Mats Malm, Prof., Department Language and Literature

Some Relevant Links

Matthew L. Jockers website <http://www.matthewjockers.net/>

Franco Moretti. 2005. Graphs, maps, trees: abstract models for a literary history. R. R. Donnelley & Sons.

Daniela Oelke, Dimitrios Kokkinakis, Mats Malm. (2012). Advanced Visual Analytics Methods for Literature Analysis. Proceedings of the Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). An EACL 2012 workshop. Avignon, France. <http://demo.spraakdata.gu.se/svedk/pbl/FINAL_eacl2012-1.pdf>

 

Multilingual FraCaS test suite

Goal

Develop a version of the FraCaS test suite in your native language

Background

The FraCaS test suite was created as part of the FraCaS project back in the nineties. A few years ago Bill McCartney (Stanford) made a machine readable xml version of it and it has been used in connection with textual entailment. This project involves developing the test suite further as a multilingual web accessible resource for computational semantics.

Project description

  1. Learn about the test suite in English, reading the original literature and some recent literature about its current use in computational semantics. Write a description of the work.
  2. Focus on one of the sections of the test suite and learn about the semantic problems which it illustrates and write a description of the semantic issues involved
  3. Translate at least the part of the test suite you focussed on in (2) into your native language and make it machine readable.
  4. Discuss the semantic issues you raised in (2) with respect to your own language and your translations. In particular focus on difficulties in translation or differences between the original English and your translation.
  5. (optional) Implement a parser for (some of) your translations and write documentation of it.
  6. (optional) Extend your parser so that it provides semantic representations which will support the inferences. Document this.
  7. (optional) Run an experiment (perhaps involving a web form) where subjects (native speakers of your language) can express their judgements about the inferences in your translation. Document the results you obtain.

Supervisors

Robin Cooper, Department of Philosophy, Linguistics and Theory of Science. The project will be carried out in connection with Dialogue Technology Lab associated with the Centre for Language Technology.

Fact extraction from (bio)medical article titles

Introduction

Article titlesare brief, descriptive and to the point, using very well-chosen, specific terminology that intend to attract the reader's attention. Factual information extraction from such article titles and the construction of structured fact data banks has a great potential to facilitate computational analysis in many areas of biomedicine and in the open domain.

Purpose

Given a Swedish collection of published article titles the purpose of this proposal is twofold:

a) to automatically classify titles into factual and non-factual. For this you will need to:

  • write some simple guidelines that will help you differentiate between factual/non factual instances/examples
  • annotate titles as factual or not
  • decide and extract suitable attributes (features) such as verbs, n-grams etc.
  • experiment with one (or more) machine learning algorithms
  • evaluate and report results

b) to extract sets of triplesfrom the factual titles and represent them in a graphical way using available software such as "visone" or "touchgraph".

A factual title in biomedicine according to Eales et al. (2011) is: "a direct (the title does not merely imply a result but actually states the result) sentential report about the outcome of a biomedical investigation". In this proposal, we take a little more general approach since our data is not strictly biomedical, but medical in general. Such results can be both a positive or negative outcome. For instance the first example below is positive and the second negative (the annotations provided below are simplified for readability):

"Antioxidanter motverkar fria radikalers nyttiga effekter" (LT nr 28–29 2009 volym 106, pp 1808)

<substance>Antioxidanter</substance> motverkar <substance>fria radikalers</substance> nyttiga <qualifier value>effekter</qualifier value>

"B12 och folat skyddar inte mot hjärt-kärlsjukdom" (LT nr 38 2010 volym 107, pp, 2228)

<substance>B12</substance> och <substance>folat</substance> skyddar inte mot <disorder>hjärt-kärlsjukdom</disorder>

A non-factual fact can be a title that does not state all the necessary contextual information in order to fully understand whether the results or implications of the finding have a factual (direct) outcome. Moreover, a non-factual fact can be one with speculative language, such as:

"Hyperemesis gravidarum kan vara ärftlig" (LT nr 22 2010 volym 107, pp 1462)

<disorder>Hyperemesis gravidarum</disorder> kan vara <qualifier value>ärftlig</qualifier value>

"Influensavaccinering av friska unga vuxna" (LT nr 14 2002 volym 107, pp 1600)

<procedure>Influensavaccinering</procedure> av <person>friska unga vuxna</person>

For training and evaluation the article title corpus needs to be suitably divided, e.g. 75%-25% into training sentences and test sentences. All will be manually annotated as "factual" or "non-factual" but the test portion will be only kept for evaluation and not used during training (e.g. feature generation).

Material

A Swedish collection of published article titles (about 1,000) will be provided in two formats, a raw (unannotated) format and an annotated version with labels from a medical ontology. Also, a few titles can be composed of several sentences and these can be a mix of factual and non-factual statements. A number of other annotation can be provided, if necessary, such as part of speech tags.

Prerequisites

Native Swedish or good Swedish language skills - all data is Swedish.

Good programming skills, interest (experience) in Machine Learning is a plus!

Supervisors

Dimitrios Kokkinakis

Richard Johansson

References

  1. Eales J., Demetriou G. and Stevens R. 2011. Creating a focused corpus of factual outcomes from biomedical experiments. Proceedings of the Mining Complex Entities from Network and Biomedical Data. Athens, Greece.
  2. Kastner I. and Monz C. 2009. Automatic Single-Document Key Fact Extraction from Newswire Articles. Proceedings of the 12th Conference of the European Chapter of the ACL (EACL). Athens, Greece.
  3. Kilicoglu H. and Bergler S. 2008. Recognizing speculative language in biomedical research articles: a linguistically motivated perspective. BMC Bioinformatics 2008, 9(Suppl 11):S10.

 

Robust Parsing using Minimum Edit Distance

Assume that you have a grammar for parsing utterances into some kind of semantic interpretation, such as voice commands or dialogue acts. Can you use this grammar for handling also non-grammatical utterances?

A possible example application is if you want to implement a dialogue system for an Android device. Android has built-in speech recognition, but the ASR engine is not grammar driven, so it returns a sentence which does not have to be grammatical.

The idea of this project is to use a Minimum Edit Distance metric, such as the Levenshtein Distance, and try to find the closest sentence that is accepted by the grammar.

Suggested workflow:

1. Start with a grammar G and a test set T of sentences with associated semantics.

2. Generate all possible sentences from the grammar G (up to a given maximum length), with associated semantics. This will give a corpus C, a semantic treebank.

3. For each sentence S in the test set T, find the closest sentence S' that is in C.

4. Evaluate each sentence S by comparing its intended semantics T(S), with the semantics C(S') of the grammatical sentence S'. This will give a Semantic Error Rate (SER).

5. Do this for different possible distance metrics.

The different distance metrics can be
- Levenshtein distance on word level
- Levenshtein distance on character level
- possibly the distance on phoneme or morpheme level
 

Supervisor: Peter Ljunglöf

Free Dyslexia Software

There are many (more or less successful) attempts of using software to help people with dyslexia.

This project will give an overview of existing software and then develop and evaluate new software.

The supervisor will be Bengt Nordström, Dept of Computer Science and Engineering at Chalmers. He has good contacts with teachers specializing in assisting dyslexic students. The software developed in this work will be licensed under GPL or some similar scheme.

Requirement:
Either a solid background in Language Technology with good programming skills or a solid background in Computer Science with shown interests in Language Technology

Clustering corpus paragraphs for lexical differentiation

Goal

.Developing and evaluating a system for clustering corpus paragraphs in order to differentiate word usages in the corpus

Background

Determining the range of usages for a particular word in a corpus is a great challenge. Particular aspects of this problem are investigated under the headings word sense disambiguation and word sense induction. In Språkbanken <http://språkbanken.gu.se>, the focus is on developing language-aware tools to aid us in building lexical resources, such as the Swedish FrameNet and Swesaurus (a Swedish wordnet).

Problem description

The paragraph is the smallest content unit of a text. The project aims at classifying/clustering paragraphs in a corpus in a way which makes it likely that the same lemma occuring in paragraphs of the same class (in the same cluster) will reflect the same sense of this word. This will allow us to design a corpus search interface where such hits are collaped by default and potentially different senses can be highlighted.

The work should preferably be carried out on the Swedish SUC corpus, but some suitable English corpus could be used instead, e.g., in the framework of NLTK. Many relevant tools are available in Java; hence, a familiarity with Java is necessary.

Recommended skills

  • Fair linguistic analysis skills in the target language
  • Good programming skills, including familiarity with Java

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Number Sense Disambiguation for Swedish - "Assign each Number a Sense"

Background

Word Sense Disambiguation is a well studied field in the Natural Language Processing Community which has resulted in a full range of successful methods and software. However, the identification and disambiguation of numerical information in natural language text is not so well studied and to the best of our knowledge there has not been yet research in Sweden on empirical evidence of the linguistic variation of numerical expressions, therefore this work is a good opportunity to investigate this topic since it is important in many tasks in natural language processing that require understanding of e.g. quantities (e.g. in information extraction or Q&A).

A numerical expression in a text is a sequence or combinationof digits with possible operators, identifiers or a mathematical symbols. Numerals in text can be used to express a variety of different senses, in a similar manner that words are used in different senses. For instance, "11" can denote:

  • the age of a person "11 years of age"
  • a reference of time "11 hours"
  • a reference to a published article "see [11]"
  • a quantity "11 women"
  • a part of a phone number "011-726 11 28"
  • a frequency "11 Hz"
  • a latitude "11 degrees"
  • a length unit "11 km2"
  • a dose "11 mg/ml"
  • ...

Purpose

The purpose of this work is on numerical information processing and the development of new/or adaptation of existing algorithms for numerical information identification and disambiguation on Swedish text material. Depending on the background and interest of the student, the work can be given different focus and scope; e.g. own implementation of a numerical information processing or adapting available software to Swedish; compare the effect of different resources and module combinations for numerical processing, etc.

Application

As a practical application the resulting software will be used as a supporting technology for number sense disambiguation of medical data perhaps using the LOINC ontology.

Supervisors

Dimitrios Kokkinakis,PhD, Department of Swedish, and possibly others.

Prerequisites:

Native Swedish or good Swedish language skills.

Good programming skills.

Relevant Links and References

NUMEX: SPECIFIC GUIDELINES - Message Understanding Conferences MUC-6 <http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_17.html#HEADING44>

LOINC: Logical Observation Identifiers Names and Codes (LOINC®) Users' Guide. Clem McDonald, Stan Huff, Kathy Mercer, Jo Anna Hernandez, Daniel J. Vreeman

Definition of Sekine’s Extended Named Entity, Version 6.1.0 (English). 2003. <http://qallme.fbk.eu/SekineENE_Definition_v6.pdf>

Stuart Moore, Anna Korhonen and Sabine Buchholz. 2009. Number Sense Disambiguation. In Proceedings of the 12th Conference of the Pacific Association for Computational Linguistics. Sapporo, Japan.

 

Temporal information in Swedish - identification, resolution, normalization and standardization

Background

Identification and resolution of temporal (and numerical) information in natural language text is important in many tasks in artificial intelligence (temporal reasoning) and natural language processing (information extraction and retrieval, Q&A).

A temporal expression in a text is a sequence of tokens  (words, numbers and characters) that denote time, that is express a point in time, a duration or a frequency.

Purpose

The purpose of this work is on temporal information processing and the development of algorithms for temporal information identification, resolution, normalization and standardization using TIMEX3/TimeML (or equivalent) on Swedish text material.

For instance the examples below illustrate hoe the TIMEX3-format is used:

  • "June 7, 2003": <TIMEX3 tid="t1" type="DATE" value="2003-06-07">
  • "the dawn of 2000": <TIMEX3 tid="t2" type="DATE" value="2000" mod="START">the dawn of 2000</TIMEX3>

A more complex example can look like this:

  • "two weeks from June 7, 2003": <TIMEX3 tid="t6" type="DURATION" value="P2W" beginPoint="t61" endPoint="t62">two weeks</TIMEX3> from <TIMEX3 tid="t61" type="DATE" value="2003-06-07">June 7, 2003</TIMEX3><TIMEX3 tid="t62" type="DATE" value="2003-06-21" temporalFunction="true" anchorTimeID="t6"/>

Depending on background and interest of the student, the work can be  given different focus and scope; e.g. own implementation of a temporal information processing or adapting available software to Swedish; compare the effect of different resources and module combinations for temporal processing, etc.

Application

As a practical application the resulting software will be used as a supporting technology for de-identifying temporal information of patient data. Normalized and standardized temporal occurrences in authentic text (patient history) will be used to "mask" the temporal information on the text. For instance, a text occurrence of the date "2011-12-15" will be converted to e.g. "start date + 4 months 7 days" (under the assumption that 'start date' is a relevant point in time from where a patient history started to be recorded).  Note! The development of this application will be made on non-authentic texts but the intention is to use the developed software on real data.

Supervisors

Dimitrios Kokkinakis, PhD, Department of Swedish

Staffan Svensson, PhD, MD, specialist in clinical pharmacology

Prerequisites:

Native Swedish or good Swedish language skills.

Good programming skills.

Relevant Links

TempEval Temporal Relation Identification <http://timeml.org/tempeval/>

TempEval2: Evaluating Events, Time Expressions, and Temporal Relations <http://www.timeml.org/tempeval2/>

TempEval3: Temporal Annotation <http://www.cs.york.ac.uk/semeval-2013/task1/>

TimeML: Markup Language for Temporal and Event Expressions <http://www.timeml.org/site/index.html>

TIMEX at MUC-6 <http://www.timexportal.info/timexmuc6>

Guidelines for Temporal Expression Annotation for English for TempEval 2010. <http://www.timeml.org/tempeval2/tempeval2-trial/guidelines/timex3guidelines-072009.pdf>

A multilingual corpus database for typological and genetic linguistics

Goal

.Building a multilingual corpus database and interface for typological and genetic linguistics research

Background

Over the last few years, linguists and computaional linguists have started looking into the possibilities of using multilingual corpora (mainly parallel corpora) for typological and genetic linguistic research.

Problem description

The aims of this work are (1) to collect and link at the verse level as many digitized Bible texts as possible; (2) to apply linguistic annotation tools for those languages where such tools are available (at least English and Swedish); (3) to correlate linguistic units of varying granularity among the languages using the linguistic annotations and freely available word alignment tools; (4) to design the first version of a user interface for conducting research with the database; (5) to conduct a small typological or genetic linguistic study as a showcase of the utility of the database and user interface.

Recommended skills

  • Good knowledge of typological and possibly genetic linguistics
  • Very good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Linking a pronunciation lexicon to SALDO

Goal

.Linking the NST Swedish pronunciation lexicon to SALDO

Background

The NST Swedish pronunciation lexicon is a large (almost 1M entries) fullform lexicon for Swedish linking text words to their (standard) pronunciations. SALDO is a large semantic and morphological lexicon for Swedish (see <http://spraakbanken.gu.se/eng/saldo/>).

Problem description

The aim of this work is to link lexical entries in SALDO to the corresponding entries in the NST lexicon, as well as to explore the feasibility of providing the SALDO-FM morphological component with a pronunciation module, i.e., generate pronunciations for word forms not present in the NST lexicon from parts of word forms that are in this lexicon.

Recommended skills

  • Good knowledge of Swedish morphological analysis, and some knowledge of Swedish phonology and phonetics
  • General familiarity with morphological analysis systems
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

X
Loading