Most machine learning methods we use in NLP are methods for learning from unbiased labeled data. However, in NLP we always learn from biased data. When we train a parser on a treebank, for example, and apply it to emails, legal text, or university websites, our training data is biased in terms of genre, style, recency, possibly dialect, etc. In this talk we present learning algorithms for automatically correcting bias - or algorithms for learning under sample bias.
The first part of the talk focuses on large-margin perceptron learning algorithms for learning from weighted data. We discuss sampling vs. weighting and different weight functions. In the second part of the talk we consider the more challenging scenario where the target data cannot be assumed to form a single, coherent distribution, but where instead we need to adapt our model to every new data point on the fly.
Anders Søgaard at Center For Sprogteknologi, Copenhagen: http://cst.dk/anders/main.html
Traditional morphological tools seek to treat the morphological variation of a language comprehensively. While the results tend to be good, at least linguistically, the down side is complexity of construction, maintenance and use of such tools. During the past few years, several statistical methods for morphological processing have been proposed for use in Information Retrieval (IR). These methods may be characterized as unsupervised, semi-supervised, light-weight, and/or (partially) language-independent. Generally, they are easy to set up for a new language or collection and provide competitive results for treating morphological variation in IR. The talk introduces some generative and reductive light statistical methods for use in IR and discusses their effectiveness and limitations.
Kalervo Järvelin (Tampere)
Frédérique Segond, Research and Development Manager, Viseo, France
The focus of this talk is to show how natural language technologies can support the process of understanding electronic information at your fingers tips. We will present two projects using NLP technologies for this purpose in two completely different domains: ALADIN in the domain of healthcare and GALATEAS in business domain of content providers.
ALADIN was supported by The National French Research Agency (Agence nationale de recherche - ANR) as part of the TecSan 2008 Framework Programme (Technologies pour la Santé et l'Autonomie GALATEAS is supported by the European Union’s ICT Policy Support Programme as part of the Competitiveness and Innovation Framework Programme, CIP ICT-PSP under grant agreement nr 250430
Probabilistic and stochastic methods have been fruitfully applied to a wide variety of problems in grammar induction, natural language processing, and cognitive modeling. In this paper we explore the possibility of developing a class of combinatorial semantic representations for natural languages that compute the semantic value of a (declarative) sentence as a probability value which expresses the likelihood of competent speakers of the language accepting the sentence as true in a given model, relative to a specification of the world. Such an approach to semantic representation treats the pervasive gradience of semantic properties as intrinsic to speakers' linguistic knowledge, rather than the result of the interference of performance factors in processing and interpretation. In order for this research program to succeed, it must solve three central problems. First, it needs to formulate a type system that computes the probability value of a sentence from the semantic values of its syntactic constituents. Second, it must incorporate a viable probabilistic logic into the representation of semantic knowledge in order to model meaning entailment. Finally, it must show how the specified class of semantic representations can be efficiently learned. We construct a probabilistic semantic fragment and consider how the approach that the fragment instantiates addresses each of these three issues.
(Joint work with Jan van Eijck, CWI, Amsterdam)
Location: L308, Lennart Torstenssonsgatan 8
Morphological lexica are often implemented on top of morphological paradigms, corresponding to different ways of building the full inflection table of a word. Computationally precise lexica may use hundreds of paradigms, and it can be hard for a lexicographer to choose among them.
To automate this task, this paper introduces the notion of a smart paradigm. It is a meta-paradigm, which inspects the base form and tries to infer which low-level paradigm applies. If the result is uncertain, more forms can be given for discrimination. The number of forms needed in average is a measure of predictability of an inflection system.The overall complexity of the system also has to take into account the code size of the paradigms definition itself.
This paper investigates inspects the smart paradigms as implemented in the open-source GF Resource Grammar Library. Predictability and complexity are estimated for four different languages: English, French, Swedish, and Finnish. The main result is that predictability does not decrease when the complexity of morphology grows, which means that smart paradigms provide an efficient tool for the manually construction and/or automatically bootstrapping of lexica.
(This is joint work with Aarne Ranta.)
Location: EDIT room, EDIT building, Chalmers
Semantic role classification accuracy for most languages other than English is constrained by the small amount of annotated data. In this talk, we describe how the frame-to-frame relations described in the FrameNet ontology can be used to improve the performance of a FrameNet-based semantic role classifier for Swedish, a low-resource language. In order to make use of the FrameNet relations, we cast the semantic role classification task as a non-atomic label prediction task. The experiments show that the cross-frame generalization methods lead to a significant reduction in the number of errors made by the classifier. This reduction is especially notable when the classifier is evaluated on previously unseen frames.
Whereas nominals in Swedish running text correspond to a (virtually) small group of wh-counterparts in questions, it is clear from the beginning that adverbials as a group are diverse in comparison.
From a computational perspective, the task of determining question counterparts presupposes identification of full adverbials, their syntactic heads and often heads of complements. But also information about modifiers, the full sentence (its main verb) or even the placement of the adverbial can be essential for correct question generation. This is due to phenomena that could perhaps be more clearly explained through theory of information structure.
In my talk I will briefly describe what CLARIN wants to achieve and what needs to be done in the participating countries to transform the present fragmented European landscape to a Schengen area for access to language resources and technology for the humanities and social sciences research community
For more information, see the following publication: http://www.linguistics.ucsb.edu/faculty/stgries/research/2010_STG_BehavProf_TheMentalLexicon.pdf
In this talk we will tell you more about Findwise as a company and how we design search solutions. We will focus on the collection and processing of the data as well as its presentation to the end user. Our focus is on bring most value to the customers give their structured and unstructured data.