• seminar

seminar

SEMINAR

This talk will introduce SVALex, a lexical resource primarily aimed at learners and teachers of Swedish as a foreign and second language that describes the distribution of 15,681 words and expressions across the Common European Framework of Reference, CEFR, based on a corpus of coursebook texts. This talk will center around COCTAILL, the corpus that the list has been generated from, the methodology applied to create the list and comparison of the SVALex to other lexical resources. There will also be a possibility to browse the list via a specially set-up user interface.

Apart from the description of the work done, I would like to welcome some discussion on how to continue with our work, especially when it comes to profiling the existing wordlist into central-peripheral vocabulary as well as on which grounds to assign vocabulary to each level.

Joint work by Elena Volodina, Ildikó Pilán, Thomas Fraçois.

Date: 2015-09-17 10:30 - 12:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

Three talks in preparation for the RANLP conference in Hissar, Bulgaria: http://lml.bas.bg/ranlp2015/.

(1) Mehdi Ghanimifard (FLoV): "Enriching word-sense embeddings with translational context".

Vector-space models derived from corpora are an effective way to learn a representation of word meaning directly from data, and these models have many uses in practical applications. A number of unsupervised approaches have been proposed to learn representations of word senses directly from corpora, but since these methods use no information but the words themselves, they sometimes miss distinctions that could be possible to make if more information were available. In this paper, we present a general framework called context enrichment that incorporates external information during the training of multi-sense vector-space models. Our approach is agnostic as to which external signal is used to enrich the context; here, we use translations as the source of enrichment. We evaluated the models trained using the translation-enriched context on several similarity benchmarks and a word analogy test set. In all our evaluations, the enriched model outperformed the purely word-based baseline soundly.

(2) Luis Nieto Piña (Språkbanken): "A simple and efficient method to generate word sense representations".

Distributed representations of words have boosted the performance of many Natural Language Processing tasks. However, usually only one representation per word is obtained, not acknowledging the fact that some words have multiple meanings. This has a negative effect on the individual word representations and the language model as a whole. In this paper we present a simple model that enables recent techniques for building word vectors to represent distinct senses of polysemic words. In our assessment of this model we show that it is able to effectively discriminate between words' senses and to do so in a computationally efficient manner.

(3) Olof Mogren (CSE, Chalmers): "Extractive summarization by aggregating multiple similarities".

News reports, social media streams, blogs, digitized archives and books are part of a plethora of reading sources that people face every day. This raises the question of how to best generate automatic summaries. Many existing methods for extracting summaries rely on comparing the similarity of two sentences in some way. We present new ways of measuring this similarity, based on sentiment analysis and continuous vector space representations, and show that combining these together with similarity measures from existing methods, helps to create better summaries. The finding is demonstrated with MULTSUM, a novel summarization method that uses ideas from kernel methods to combine sentence similarity measures. Submodular optimization is then used to produce summaries that take several different similarity measures into account. Our method improves over the state-of-the-art on standard benchmark datasets; it is also fast and scale to large document collections, and the results are statistically significant.

Date: 2015-08-31 10:15 - 12:00

Location: L307, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

This thesis will investigate cognitive workload in relation to the multitasking activity of driving and interacting with a dialogue system, and look at dialogue strategies for preventing or shortening the duration time of high cognitive workload. We will do this by analysing a corpus of in-vehicle driver-passenger dialogue collected in the DICO project. The aim of the thesis is to learn more about in-vehicle dialogue during various cognitive workload, and the long term goal is to improve in-vehicle dialogue system interaction.

Supervisor: Staffan Larsson.
Opponent: Simon Dobnik.

Thesis draft: https://linux.dobnik.net/oblacek/index.php/s/GQDmzN9ltUdtNCZ

Date: 2015-08-12 13:15 - 15:00

Location: T116, Olof Wijksgatan 6

Permalink

SEMINAR

Date: 2015-09-03 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

In this project a large vocabulary continuous speech recognition system was built on the basis of freely available Swedish speech data. One acoustic model and several bigram and trigram language models were trained with the open-source software packages HTK and CMU Statistical Language Model- ing toolkit. Using different HTK tools the system was then evaluated with these models in order to test what results can be achieved with the given data and how the language model size affects the recognition results.

The lowest word error rate achieved was 47.45% and the lowest sen- tence error rate was 87.45%. Recognition results showed that raising lan- guage model complexity—with regard to both n-gram order and vocabulary size—lowers the error rates. The error rates were rather high in comparison to the ones yielded in similar projects but the results can easily be improved by building larger n-gram models and using a decoder that is better suited for the recognition of continuous speech.

Examinator: Richard Johansson
Opponent: Joel Hinz
Supervisor: Chris Koniaris

Date: 2015-05-28 17:00 - 19:00

Location: T116, Olof Wijksgatan 6

Permalink

SEMINAR

Computational analysis of historical and typological data has made great progress in the last fifteen years. In this thesis, we work with vocabulary lists for addressing some classical problems in historical linguistics such as discriminating related languages from unrelated languages, assigning possible dates to splits in a language family, employing structural similarity for language classification, and providing an internal structure to a language family.

In this thesis, we compare the internal structure inferred from vocabulary lists with the family trees inferred given in Ethnologue. We also explore the ranking of lexical items in the widely used Swadesh word list and compare our ranking to another quantitative reranking method and short word lists composed for discovering long-distance genetic relationships. We also show that the choice of string similarity measures is important for internal classification and for discriminating related from unrelated languages. The dating system presented in this thesis can be used for assigning age estimates to any new language group and overcomes the criticism of constant rate of lexical change assumed by glottochronology. We also train and test a linear classifier based on gap-weighted subsequence features for the purpose of cognate identification. An important conclusion from these results is that n-gram approaches can be used for different historical linguistic purposes.

Examiner: Jörg Tiedemann, Department of Linguistics and Philology, Uppsala University

Link to thesis: http://spraakdata.gu.se/taraka/slut_seminar_thesis.pdf

Date: 2015-05-28 13:15 - 16:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

In this talk, I will describe an unsupervised topic modelling approach to do word sense induction, i.e. to automatically learn the meaning of words based on text collection. I will then talk about two applications of the methodology: (1) automatic detection of new senses - senses that are used in one corpus but unseen in another; and (2) automatic adaptation of induced senses to existing inventory-defined senses. For the latter, we will show that adapting the senses allows us to automatically learn the predominant sense, compute sense distribution, and identifying senses that are not recorded in the sense inventory ("novel sense") and senses that are not used in the corpus ("unattested sense").

Date: 2015-04-21 15:15 - 17:00

Location: K333, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

In this talk, I discuss the extension of two Construction Grammar models to the computational domain: Berkeley Construction Grammar (Fillmore 2013) and Cognitive Construction Grammar (Goldberg 2006). Specifically, I present the two approaches for constructional inheritance used by those models to account for the relationships among constructions (full inheritance and normal inheritance, respectively), and then discuss how those proposals have been implemented computationally in the FrameNet Brasil Constructicon.

Tiago Timponi Torrent
Federal University of Juiz de Fora, Brasil, FrameNet Brasil

Date: 2015-02-10 13:15 - 15:00

Location: T340, Olof Wijksgatan 6

Permalink

SEMINAR

Word sense induction (WSI) is the task of discovering word senses automatically, given a corpus. We propose a vector space model for WSI that leverages neural word embeddings, and the correlation statistics they capture, to compute high quality word instance embeddings. The instance embeddings are subsequently clustered to find the word senses present in the text. The model archives state of the art results on a well known dataset.

We expand on the idea of using neural embeddings for linguistic analysis by modelling temporal evolution and performing joint, comparative embedding of multiple corpora.

Date: 2015-02-26 10:30 - 12:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

This talk describes the creation of and initial experiments with a massively parallel corpus of diachronic Germanic.

The majority of parallel and quasi-parallel data on older Germanic languages is constituted by texts directly or indirectly based on the Bible. This includes actual translations, but also loose paraphrases, in prose or in verse, either as independent works (psalters, gospel harmonies, but also free adaptations in medieval romance), or as part of derived works (such as exegetic commentaries, sermons or chronicles). For several historical languages, most noteably Old Saxon and Old High German, Biblical text represents the majority of parallel data available at all, gospel harmonies represent even the majority of data currently known.

Still today, the Bible is the single most translated book in the world and not only available in a vast majority of world languages, but also a of dialects. The Lord's Prayer and the Tale of the Prodigal Son have been the basis for early studies on dialectology, and with the rise of the internet, home-grown dialectal translations of Bible excerpts, books or the full Old and New Testament have been developed and are circulating in digital form.

This amount of parallel data is of crucial interest to philologists and comparative linguists, and out of this context, aligned Bible corpora with morphosyntactic annotation have been developed at the Goethe University Frankfurt in the context of the project "Old German Reference Corpus" (2010-2014) and the LOEWE cluster "Digital Humanities" (2011-2014) for Old Saxon and Old High German, and complement the series of annotated Bibles currently available for Gothic, Middle English, and Middle Icelandic.

A massively parallel diachronic corpus of the Germanic languages is, however, not only a valuable resource for historical linguistics, but also relevant to current research in Natural Language Processing: The Germanic languages, with their great body of diachronic material, and their well-understood grammatical, morphological and phonological development provide us with a test bed to study the impact of diachronic relatedness on algorithms for historical-to-modern normalization, annotation projection or model transfer between related language stages. With this data, we can investigate, for example, the correlation between diachronic relatedness and the preferred method to derive NLP tools for less-resourced languages from tools for better-resourced languages.

Accordingly, the annotated Bibles mentioned above have been aligned with each other by the Applied Computational Linguistics Lab of the University Frankfurt, and augmented with a massive corpus of unannotated Bibles (Fig.1).

In this talk, I present the parallel corpus as a resource, I will issues with respect to coverage, availability, data quality, legal issues as well as initial results on
(i) usability for philological research,
(ii) alignment and annotation projection, and
(iii) normalization and hyperlemmatization.

Christian Chiarcos
Frankfurt Am Main, Germany

Date: 2015-02-11 13:15 - 15:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

X
Loading