• seminar

seminar

SEMINAR

Ioanna Papadopoulou in the MLT programme will defend her master's thesis "GF Modern Greek Resource Grammar".

The thesis describes the implementation of the Modern Greek grammar as part of the Grammatical Framework Resource Grammar Library (RGL).

Grammatical Framework (GF) is a special-purpose language for multilingual grammar applications. The RGL is a reusable library for dealing with the morphology and syntax of a growing number of natural languages. It uses an abstract syntax, which is common for all languages, and different concrete syntaxes implemented in GF. Both GF itself and the GF Resource Grammar Library are open-source.

The Modern Greek grammar covers all morphological variations of the language and contains definitions and rules for all the categories and functions that are provided in the GF abstract syntax, managing to fulfill the multilingualism purpose of the GF, and correlate the language with the various other languages in the RGL.

For the purpose of the implementation, a morphology-driven approach was used, meaning a bottom-up method, starting from the smallest units of the language (the words) before moving to the larger units (the sentences). We discuss the number of challenges we encountered during the development process, that originate primarily from the complexity of the Modern Greek language, both in a syntactic but mainly in a morphological level, and also from the difficulty to attribute forms and structures that derive solely from a semantic level and the choice of which, in actual speech situations, depends exclusively on the speaker itself.

Supervisor: Aarne Ranta
Opponent: Sevasti Louizou
Examiner: Torbjörn Lager

Date: 2013-05-30 13:15 - 14:15

Location: T219, Olof Wijksgatan 6

Permalink

SEMINAR

Writer-based and reader-based views of text-meaning are reflected by the respective questions "What is the author trying to tell me?" and "What does this text mean to me personally?" Contemporary computational linguistics, however, generally takes neither view; applications do not attempt to answer either question. Instead, a text is regarded as an object that is independent of, or detached from, its author or provenance, and as an object that has the same meaning for all readers. This is not adequate, however, for the further development of sophisticated NLP applications for intelligence gathering and question answering.

I will discuss different views of text-meaning from the perspective of the needs of computational text analysis, and then extend the analysis to include discourse as well – in particular, the collaborative construction of meaning and the collaborative repair of misunderstanding.

Graeme Hirst (Toronto)

Date: 2013-05-29 10:15 - 12:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

Karin Cavallin (PhD student in GSLT) will present her forthcoming PhD thesis:

Detecting Lexical change via Semantic Distribution - Investigating meaning change in and through Lexical Sets

Opponent: Richard Johansson, Språkbanken

Date: 2013-05-28 13:15 - 15:00

Location: T340, Olof Wijksgatan 6

Permalink

SEMINAR

Research in Spoken language technology (SLT) has done tremendous steps during the last 50 years. From the offline isolated word recognition in late 60's/early 70's to nowadays real-time multimodal (and multipurpose) spoken dialogue systems, the initially limited scientific area has been expanded, and its research outcome has become broadly available. As a result, new fields of application have been developed with an increasing demand for high quality products. Still, the available technology has not yet reached the performance of human communication.

A recent trend includes biologically-inspired hypotheses and perceptually-relevant assumptions in order to find a path for a new breakthrough in human-machine interaction by incorporating different scientific areas in a joint, multidisciplinary research effort. In this talk, we examine the importance of human perception in SLT and give examples of relevant applications.

Web page at KTH

Date: 2013-04-05 13:15 - 15:00

Location: T346, Olof Wijksgatan 6

Permalink

SEMINAR

Wordnets are mostly constructed either on the basis of the transfer method applied to Princeton WordNet or on the basis of knowledge extraction from monolingual dictionaries. Neither of the methods could be applied in the construction of plWordNet. There were no publicly available bilingual Polish-English dictionaries nor monolingual Polish lexical resources. Moreover, we wanted plWordNet to be a faithful description of the Polish lexicalsystem.

Thus, from the very beginning plWordNet development process was based on the exploration of a huge Polish corpus. Language tools were employed in plWordNet development on every possible step: from data gathering through data analysis to data presentation. A set of language tools for advanced corpus browsing, as well as for the extraction of lexical semantic knowledge was developed and applied. The extracted knowledge was the input to the WordnetWeaver system which suggested nodes in the wordnet structure as potential attachment places for new synsets. The suggestions are presented visually on the relation network graph. Linguists can browse suggestions, modify and edit the wordnet structure. Automatically discovered senses are also described by automatically identified usage examples.

During the seminar, we will discuss the complete plWordNet development cycle: corpus gathering and preprocessing, lemma and lexico-semantic relation extraction, visual wordnet editing supported by the extracted knowledge and coordination supported by a system for monitoring the work of a team of linguists. Expansion of derivationally motivated lexico-semantic relations is facilitated by tools for example-based relation learning and corpus-based discovering of new relation instances. Next, we will present plWordNet to Princeton WordNet mapping process and tools facilitating it.

The most recent size of the corpus is 1.8 billion words. The complete process of data processing and relation extraction will be discussed from the perspective of our experience of wordnet building. A corpus-based lexicographic process supported by the Wordnet Weaver system will be presented. Possibilities and limitations of the semi-automated wordnet expansion will be discussed on the basis of examples collected during plWordNet expansion.

The work was co-funded by the European Union Innovative Economy Programme (Project POIG.01.01.02-14-013/09) and the Polish Ministry of Science and Higher Education (Project N N516 068637).

www.nlp.pwr.wroc.pl

www.plwordnet.pwr.wroc.pl

Date: 2013-03-22 10:15 - 12:00

Location: L307, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

Wordnets are built of synsets, not of words. A synsets consists of words. Synonymy is a relation between words. Words go into a synset because they are synonyms. Later, a wordnet treats words as synonymous because they belong in the same synset. . . Such circularity, which is a well-known problem, poses a practical difficulty in wordnet construction, notably when it comes to maintaining consistency.

plWordNet – a very large Polish wordnet – is a net of lexical units. We will discuss our assumptions and present their implementation in a steadily growing Polish wordnet. A small set of constitutive relations allows us to construct synsets automatically out of groups of lexical units of the same connectivity.

plWordNet system of relations will be presented and compared to systems of relations in several influential wordnets. Additional synset-forming mechanisms such as stylistic registers and verb aspect will be also discussed. The rich morphology of Polish pertains to the important role of lexico-semantic relations that are derivationally motivated.

The work was co-funded by the European Union Innovative Economy Programme (Project POIG.01.01.02-14-013/09) and the Polish Ministry of Science and Higher Education (Project N N516 068637).

www.nlp.pwr.wroc.pl

www.plwordnet.pwr.wroc.pl

Date: 2013-03-20 15:15 - 17:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

The LekBot project, part 1, was a collaboration between DART, Talkamatic and GU in 2010-2011. The project developed a talking and playing robot for children with communicative disabilities, with the aim of providing a toy that is easy and fun to use, and that provides opportunities for genuine play in the sense of play that is spontaneous, independent, on equal terms, etc. Three test groups participated in the project, with each test group consisting of a child with cerebral palsy, a peer and pre-school staff. All groups were recorded in interactions with various versions of the system.

The LekBot project, part 2, started in 2012, and is a collaboration between GU and DART. The focus now is on the analysis of the recorded interactions. In the talk we discuss the notions of play on equal terms and enjoyable play, as they manifest themselves in different ways in the LekBot recordings. We also discuss implications for further development of the LekBot system.

Date: 2013-09-05 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

In this talk I deal with automated acquisition of linguistic knowledge as a means of enhancing robustness of lexicalised grammars for real life applications.

I focus on Multiword Expressions (henceforward MWEs). Specifically, in the first part of the talk I am taking a closer look at the linguistic properties of MWEs, in particular, their lexical, syntactic, as well as semantic characteristics.

With the observations about the linguistic properties of MWEs at hand, I turn in the second part of the talk to methods for the automated acquisition of these properties for robust grammar engineering and parsing. To this effect, I first investigate the hypothesis that MWEs can be detected by the distinct statistical properties of their component words, regardless of their type, comparing various statistical measures, a procedure which leads to extremely interesting conclusions. I then investigate the influence of the size and quality of different corpora, using the BNC and the Web search engines Google and Yahoo. I conclude that, in terms of language usage, web generated corpora are fairly similar to more carefully built corpora, like the BNC, indicating that the lack of control and balance of these corpora are probably compensated by their size.

Then, I show a qualitative evaluation of the results of automatically adding extracted MWEs to existing linguistic resources. To this effect, I first discuss two main approaches commonly employed in NLP for treating MWEs: the words-with-spaces approach which models an MWE as a single lexical entry and it can adequately capture fixed MWEs like "by and large", and compositional approaches which treat MWEs by general and compositional methods of linguistic analysis. On this basis, I argue that the process of the automatic addition of extracted MWEs to existing linguistic resources improves qualitatively, if a more compositional approach to grammar/lexicon automated extension is adopted.

Finally, I propose that the methods developed for the acquisition of linguistic knowledge in the case of English MWEs can be tuned to enhance robustness of parsing with lexicalised grammars for languages with richer morphology and freer word order, as is the case of German.

Valia Kordoni is at Humboldt University, Berlin and at Saarland University

Date: 2013-03-05 13:30 - 15:00

Location: TBA

Permalink

SEMINAR

In the project Text+Berg we have built a large multilingual heritage corpus of alpine texts. In this presentation I will share our experiences and lessons learned. 
 
We have digitized and annotated all yearbooks of the Swiss Alpine Club (SAC) from 1864 until 2011. Texts include mountaineering reports and articles about the geology, biology and history of mountaineous regions around the globe. Our annotations comprise linguistic information (e.g. Part-of-speech tags and lemmas) but also person names and toponym classes (e.g. mountains, glaciers and cabins). The Text+Berg corpus currently consists of around 22 million tokens in both German and French, out of which about 5 million tokens are translations (i.e. parallel texts). In addition, the corpus contains few Italian, English and Romansh texts.
 
The 150 year sequence of the digitised books provides new opportunities for linguistic research: it enables the quantitative analysis of diachronic language change as well as the study of typical language structures and figures of speech in the specific domain. Moreover, the digital books allow for quick access as the basis for the analysis of texts and pictures and for information retrieval purposes. We will argue that the Text+Berg project is a prototypical case of digital humanities with a large collections of heritage documents being structured and annotated for multi-purpose access and long-term storage.

Martin Volk, Universtität Zürich

Date: 2013-08-29 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

Studies of language in Alzheimer's disease have concluded that, along with a general cognitive decline, linguistic features are also negatively affected. Studies of the language of healthy elders also observe a linguistic decline, but one which, in contrast, is markedly less severe than that induced by dementia.  We examine whether the disease can be detected from the diachronic changes in written texts and, more importantly, whether it can be clearly distinguished from normal aging.

Lexical and syntactic analyses were conducted on 51 novels by three prolific literary authors: Iris Murdoch, P.D. James, and Agatha Christie. Murdoch was diagnosed with Alzheimer's disease shortly after finishing her last novel; James, at 89 years of age, continues to publish critically-acclaimed works; Christie, whose last few novels are deemed strikingly subpar compared to her previous works, presents an interesting case study of possible dementia.

The lexical analysis reveals significant patterns of decline in Murdoch's and Christie's later novels, while James's rates remain relatively consistent throughout her career. The syntactic measures, though unveiling fewer significant linear trends, discover a cubic model of change in Murdoch's novels, with a deep decline around her 50s. Our findings provide support for the hypothesis that dementia, which manifests clearly in lexical features, can be detected in writing.

Graeme Hirst (Toronto)

(Joint work with Xuan Le, Ian Lancashire, and Regina Jokel)

Date: 2013-05-30 10:15 - 12:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

X
Loading