• seminar

seminar

SEMINAR

In this talk I will present some on-going activities aiming at publishing language resources in the Linked Open Data (LOD) framework, with the goal to be able to link efficiently language data not only to other language data but also to knowledge objects that are already published in the LOD cloud. One approach is the one followed by the W3C community group "Ontolex", describing senses of lexical entries by their reference to knowledge objects in the LOD cloud. While most of the work dealing with the publication of language resources in the LOD is concerned with lexical resources, increased attention has been recently given to the encoding of corpora in LOD compliant representation languages. We will sketch the current state of this endeavor, which is the basis for the development of the NIF platform (http://aksw.org/Projects/NIF.html). More information can be found on those topics at the Ontolex page (http://www.w3.org/community/ontolex/), as well as the European project LIDER (http://www.lider-project.eu/), which is supporting the adoption of Ontolex and NIF for a large set of language resources.

Date: 2014-10-16 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

Conversation can be described as a joint activity between two or more participants, and the ease of conversation relies on a close coordination of actions between them. First, since it is difficult to speak and listen at the same time, interlocutors have to take turns speaking, and this turn-taking has to be coordinated somehow. Second, while speaking, humans continually evaluate how the listener perceives and reacts to what they say and adjust their future behaviour to accommodate this feedback. Third, speakers also have to coordinate their joint focus of attention. Joint attention is fundamental to efficient communication: it allows people to interpret and predict each other's actions and prepare reactions to them.

In this talk, I will present an ongoing research effort at KTH in which we aim to model these phenomena for improving spoken interaction between humans and machines. I will start with a dyadic human-computer dialogue setting and show how we can use data-driven methods for detecting feedback-inviting cues in the user's speech. I will then move on to situated interaction where the human interacts face-to-face with a robot, making references to physical objects in the surroundings, and explore how the system can invite feedback from the user and how the user and system can achieve joint attention. Finally, we will look at multi-party interaction, where the robot interacts with several humans at the same time, and explore how gaze is used to regulate turn-taking in such settings.

Date: 2014-05-27 13:15 - 15:00

Location: T340, Olof Wijksgatan 6

Permalink

SEMINAR

Automatic Readability Assessment deals with evaluating the reading difficulty of a text for a given target audience. In general, a "text" means a larger piece of text. However, in certain scenarios, it becomes important to understand readability at a more fine grained level, like a sentence.

In my talk, I will discuss our past and on-going research on building and evaluating sentence level readability models for English. I will focus on questions like: (a) Can document level models be directly used for sentences? (b) Is it possible to accurately compare sentences in terms of their readability level? (c) Where will this kind of analysis be useful?

My talk is in the context of automatic text simplification. However, the approach may be useful in general in other scenarios where sentence level readability estimates are needed.

Initial part of this talk was presented at EACL 2014, in April.

Sowmya Vajjala and Detmar Meurers: "On assessing the reading level of individual sentences for text simplification", Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014. pages 288-297. (http://www.aclweb.org/anthology/E/E14/E14-1031.pdf).

Date: 2014-11-20 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

In this talk, I’ll present my (recently completed) PhD, in which I developed a new, hybrid approach to dialogue management based on the notion of "probabilistic rules". The key idea is to represent the internal models of a dialogue domain in a structured manner, through high-level rules that combine aspects of logical and probabilistic approaches to dialogue modelling into a single framework. The rules may include parameters that can be estimated from dialogue data (via e.g. supervised or reinforcement learning).

The approach offers two main benefits: 1) due to the expressivity of the rules, the models can be encoded in a compact form, making it possible to efficiently estimate the domain parameters from limited amounts of data. 2) the rules also allow expert knowledge and domain constraints to be directly incorporated in the domain models, in a human-readable form.

The above approach has been implemented in a new, domain-independent dialogue toolkit called "OpenDial", which is available under an open source license: http://opendial.googlecode.com

In this talk, I’ll present the framework as well as some practical experiments that we recently conducted to evaluate the approach in a human-robot interaction domain (using the Nao robot as experimental platform).

Date: 2014-05-20 13:15 - 15:00

Location: T116, Olof Wijksgatan 6

Permalink

SEMINAR

In this talk, I argue that modern corpus-driven lexicography must focus on discovering patterns of usage (valencies and collocations), rather than asking direct questions about word meanings. In fact, I argue, words do not have meanings as such; they have only meaning potentials. Different aspects of a word's meaning potential are realized in different contexts. Corpus-driven analyses of patterns of usage are necessarily selective and probabilistic.

The modern lexicographer, confronted with a vast array of evidence for different uses of a word, needs guidance on appropriate levels of generalization and on what can be safely left out, rather than the more traditional search for new words and new senses to be added to the dictionary. At the same time, computational linguists need to abandon the deluded search for certainties about "all possible" uses (if they have not already done so) and accept the more achievable goal of matching actual usage to patterns. A newly emerging theory of language in use (The Theory of Norms and Exploitations -- Hanks 2013) can help here, as it provides a basis for distinguishing conventional uses and meanings of words from creative ones

Date: 2014-09-04 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

Learner corpora as collections of language produced by language learners have been systematically collected since the 90s, and with increasing numbers and types of learner corpora becoming available, in principle there is a growing empirical basis on which theories of second language acquisition can be informed and applications can be trained and tested. While most research on learner corpora has analyzed the (co)occurrence of (sequences of) words or manual error annotation, tools for automatically analyzing large corpora in terms of linguistic abstractions such as parts-of-speech, syntactic constituency, or dependency are in principle available - though they also raise fundamental conceptual questions related to the linguistic annotation of learner language.

The situation also raises some questions which are reminiscent of the discussion on the role of exemplars vs. prototypes in language, namely surface forms as such and when linguistic categories abstracting and generalizing over surface forms are useful in a corpus-based analysis.

In this talk, I want to illustrate some of the underlying conceptual issues and then exemplify the trade-off between surface-based and deeper linguistic modeling based on our experiments in Native Language Identification, the task of automatically determining the native language of a non-native writer.

This talk is based on joint work with Serhiy Bykh and Julia Krivanek:

  • Detmar Meurers, Julia Krivanek and Serhiy Bykh (2014): On the Automatic Analysis of Learner Corpora: Native Language Identification as Experimental Testbed of Language Modeling between Surface Features and Linguistic Abstraction. Diachrony and Synchrony in English Corpus Studies edited by Alexandro Alcaraz Sintes and Salvador Valera. Frankfurt am Main: Peter Lang. 285-314.

Depending on interests/time, I may also include aspects of:

  • Serhiy Bykh and Detmar Meurers (2014): Exploring Syntactic Features for Native Language Identification: A Variationist Perspective on Feature Encoding and Ensemble Optimization. Proceedings of the 25th International Conference on Computational Linguistics (COLING), Dublin, Ireland.

  • Serhiy Bykh, Sowmya Vajjala, Julia Krivanek, and Detmar Meurers (2013): Combining Shallow and Linguistically Motivated Features in Native Language Identification. Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications (BEA), Atlanta, GA, USA.

  • Serhiy Bykh and Detmar Meurers (2012): Native Language Identification Using Recurring N-grams - Investigating Abstraction and Domain Dependence. Proceedings of the 24th International Conference on Computational Linguistics (COLING), Mumbai, India.

http://www.sfs.uni-tuebingen.de/~dm/

Date: 2014-09-25 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

Recorded Future organizes open source information for analysis. Whether you’re conducting intelligence research, competing in business, or monitoring the horizon for situational awareness, the web is loaded with valuable predictive signals. Our goal is to help analysts across many industries make sense of the overwhelming information available on the web.

This talk will describe the ideas behind Recorded Future's web intelligence machine. We will describe the overall purpose of the system and its high level architecture, but focus on key challenges in natural language processing and analytics, and how we address them.

For more information see www.recordedfuture.com

BIO

Staffan Truvé is Chief Technology Officer and co-founder of Recorded Future (www.recordedfuture.com). Staffan has spent the last 20 years working in the borderland between research and industry, creating new companies based on cutting-edge research results. He helped launch more than 15 companies, including Spotfire, Appgate, Axiomatics, Peerialism, Makewave, Enmesh, and Recorded Future.

From 2005 to 2009, he was CEO of the Swedish Institute of Computer Science (SICS) and Interactive Institute.

Staffan holds a PhD in Computer Science from Chalmers University of Technology and an MBA from Gothenburg University. He has been a Fulbright Visiting Scholar at MIT and is a member of the Royal Swedish Academy of Engineering Sciences.

Date: 2014-09-26 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

We present a framework for using continuous-space vector representations of word meaning to derive representations of the meaning of word senses listed in a semantic network. The idea is based on two assumptions: 1) word vectors for polysemous words are a mix of underlying sense representations; 2) the representation of a sense should be similar to those of its neighbors in the network. This leads to a constrained optimization problem, and we present anapproximate iterative algorithm that can be used if the similarity between senses is defined in terms of the squared Euclidean distance.

We apply the algorithm on a Swedish semantic network, and we evaluate the quality of the resulting sense representations intrinsically using the vector offset method and extrinsically by showing that they give large improvements when used in classifiers that maps word senses to FrameNet frames.

Date: 2014-06-05 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

The typological diversity of the world's languages poses important challenges for the techniques used in machine translation, syntactic parsing and other areas of natural language processing. Statistical models developed and tuned for English do not necessarily perform well for richly inflected languages, where larger morphological paradigms and more flexible word order gives rise to data sparseness. Since paradigms can easily be captured in rule-based descriptions, this suggests that hybrid approaches combining statistical modeling with linguistic descriptions might be more effective. However, in order to gain more insight into the benefits of different techniques from a typological perspective, we also need linguistic resources that are comparable across languages, something that is currently lacking to a large extent.

In this talk, I will report on two ongoing projects that tackle these issues in different ways. In the first part, I will describe techniques for joint morphological and syntactic parsing that combines statistical dependency parsing and rule-based morphological analysis, specifically targeting the challenges posed by richly inflected languages. In the second part, I will present the Universal Dependency Treebank Project, a recent initiative seeking to create multilingual corpora with morphosyntactic annotation that is consistent across languages.

Joakim Nivre, Uppsala University http://stp.lingfil.uu.se/~nivre/

Date: 2014-05-22 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

This talk will focus on some recent techniques that provide useful theory-neutral mechanisms for analyzing and evaluating phonological and morphological generalizations in various contexts.

These techniques draw on finite state technology which is customarily used for modeling phonological and morphological phenomena computationally and also finds many applications in speech technology. The popularity of finite state machines – automata and transducers – rests on a few main attributes: they provide a theory-neutral platform for encoding linguistic generalizations, they are inherently bidirectional (a model defined in the direction of generation can also perform parsing), they can accommodate gradience effects and probabilistic generalizations, and they enjoy substantial practical support in the form of software and development tools. For our purposes, the most important feature is the set of computational methods available for formal verification and investigation of finite state models.

One technique particularly useful to the linguist is equivalence testing of grammars. While testing the equivalence of e.g. finite transducers is computationally undecidable in the general case, there exist efficient algorithms for doing so in the realm of many linguistically interesting contexts. When combined with techniques to model richer phonological structures such as autosegments or constraint-driven formalisms like Optimality Theory, such equivalence testing permits the automation of various difficult tasks in phonological modeling. Among other things, it enables one to formally ascertain the correctness of generalizations expressed in a particular formalism, investigate competing theories of historical-comparative reconstruction, and perform more general comparisons of phonological and morphological models.

Date: 2014-05-15 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

X
Loading