• seminar

seminar

SEMINAR

Readability aims to provide reproducible and automatic methods to assess the difficulty of texts for a given population. Such methods are based on various linguistic characteristics of the texts to assess. They have mostly been developed for English (Flesch, 1948 ; Dale et Chall, 1948 ; Heilman et al., 2007), whereas very little work was carried out for French (Henry, 1975 ; François et Fairon, 2012). In this talk, we first summarize a set of experiments that have been conducted on the readability of French as a foreign language (FFL). They led to the design of a readability model for FFL able to predict the difficulty level of texts accordingly to the Common European Framework of Reference for language (CEFR). To achieve this goal, the model relies on techniques from natural language processing – to extract the linguistic features – and from machine learning – to combine these features within a statistical model.

In the second part of the talk, we will focus on a specific issue, namely the difficulty of the lexicon for learners of FFL. The lexicon is acknowledged to be an essential linguistic component for an adequate reading comprehension. In the context of the L2 education, the progression of vocabulary teaching is generally guided by vocabulary lists, such as Gougenheim (1958). These lists rely a lot on the frequency of words in a large corpus of L1 texts. Their use for L2 applications is therefore questionable. With the advent of the CEFR, this issue could be alleviated thanks to the reference supplement supposed to mark out the lexicon acquisition process. However, these references lack precision about word uses and their design is also subject to question. This has led to various attempts to evaluate their validity (e.g. KELLY, VALILEX). We will introduce an alternative approach to simple frequency list : FLELex, a freely-available resource for FFL that describes frequency distribution of words across the six levels of the CEFR. The methodology and corpus used to estimate the frequencies will be detailed and illustrated through a website that allows to query FLELex on-line.

mots-clefs: lexicon difficulty, readability of FFL, CALL, CEFR

Thomas François,
CENTAL (Université catholique de Louvain)
http://cental.fltr.ucl.ac.be/team/tfrancois/

Date: 2015-02-05 10:30 - 12:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

We turn the Eisner algorithm for parsing to projective dependency trees into a cubic-time algorithm for parsing to a restricted class of directed graphs. To extend the algorithm into a data-driven parser, we combine it with an edge-factored feature model and online learning. We report and discuss results on the SemEval-2014 Task 8 data sets (Oepen et al., 2014).

Date: 2014-10-21 13:15 - 15:00

Location: EDIT room (3364), Chalmers Johanneberg

Permalink

SEMINAR

Dialogue as a means of information presentation goes back at least as far as the ancient Greeks - Plato conveyed his philosophy through fictitious conversations between Socrates and his contemporaries. In this kind of expository dialogue, the main purpose is to inform the reader or audience; the information the characters convey to one another is subservient to this purpose. The popularity of expository dialogue in today's world is evidenced by the widespread use of dialogue in news bulletins (between presenters), commercials, educational entertainment, and games. In this talk, I will describe the CODA project. As part of CODA, algorithms were developed for automatically generating expository dialogue. Given a text in monologue form, the CODA system (together with the HILDA discourse parser) produces a dialogue script. The dialogue script expresses the same information as the monologue, but now as a conversation between an expert and a lay person. In contrast with previous efforts to generate expository dialogue with hand-crafted rules, CODA is corpus-based. I will describe the approach, its evaluation and an application.

More information at http://computing.open.ac.uk/coda/

Date: 2014-11-19 13:15 - 15:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

The analysis of readability has traditionally relied on surface properties of language, such as average sentence and word lengths and specific word lists. At the same time, there is a long tradition analyzing the Complexity, Accuracy, and Fluency (CAF) of language produced by language learners in second language acquisition (SLA) research. Reusing SLA measures of learner language complexity to analyze readability, Sowmya Vajjala and I explored which aspects of linguistic modeling can successfully be employed to predict the readability of a native language text.

Using various machine learning setups and corpora, we show that a broad range of linguistic properties are highly indicative of the readability of documents, from graded readers to web pages and TV programs targeting different age groups. The readability model using our full linguistic feature set currently is the best non-commercial readability model available for English (and second overall, with the commercial ETS model coming in first), based on the performance on the Common Core State Standard data set.

This talk focuses on our document-level readability models, and links it with our proficiency classification work, i.e., the task of determining the language proficiency of a writer based on a text they wrote in the second language. Some publications available at http://purl.org/dm/papers provide more detail:

Sowmya Vajjala and Detmar Meurers (in press) "Readability Assessment for Text Simplification: From Analyzing Documents to Identifying Sentential Simplifications". International Journal of Applied Linguistics, Special Issue on Current Research in Readability and Text Simplification edited by Thomas François & Delphine Bernhard.

Sowmya Vajjala and Detmar Meurers (2014) "Assessing the relative reading level of sentence pairs for text simplification". Proceedings of EACL. Gothenburg, Sweden.

Sowmya Vajjala and Detmar Meurers (2014) "Exploring Measures of 'Readability' for Spoken Language: Analyzing linguistic features of subtitles to identify age-specific TV programs. Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), EACL. Gothenburg, Sweden

Sowmya Vajjala and Detmar Meurers (2013) "On The Applicability of Readability Models to Web Texts." Proceedings of the Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), ACL. Sofia, Bulgaria.

Julia Hancke, Sowmya Vajjala and Detmar Meurers (2012) "Readability Classification for German using lexical, syntactic, and morphological features". Proceedings of COLING, Mumbai, India.

Sowmya Vajjala and Detmar Meurers (2012) "On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition". Proceedings of BEA7, ACL. Montreal, Canada.

Date: 2014-09-25 15:15 - 16:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

I will present recent work done in the Dialogue Systems Group at Bielefeld University on modelling situated dialogue. In our understanding of the term, situated dialogue is dialogue where the participants at least share a common timeline, that is, directly perceive the utterances of their partners, and are expected to react immediately. This covers for example telephone conversations and excludes email or text-chat exchanges, and it implies that processing must proceed incrementally and continuously. In a narrower understanding of the term, it is also taken to entail physical co-location, where the participants immediately perceive all actions of their interlocutors, and not just linguistic ones, and also perceive their shared surroundings. Our goal is to provide a computational, implemented model of the skills required to take part in situated dialogue both in the wider and the more narrow sense.

I will briefly introduce the "incremental units" model of dialogue processing (Schlangen & Skantze; EACL 2009, Dialogue & Discourse 2011) which we use as the basis for our work on situated dialogue. I will show how we used it to realise fast turn-taking in an implemented dialogue system (Skantze & Schlangen, EACL 2009). I will then discuss our work on statistical incremental language understanding, which brings in as an additional problems to tackle reference to objects in a shared space (Kennington & Schlangen; SIGdial 2012, Computer Speech & Language 2014), non-linguistic information about the speaker such as gaze and gesture (Kennington, Kousidis & Schlangen; SIGdial 2013, Coling 2014), and more recently, real-time computer-vision processing.

This work takes us a few small steps towards a more comprehensive model of situated dialogue. I will conclude by discussing some of the many extensions that are still required.

Date: 2014-09-26 13:15 - 15:00

Location: T116, Olof Wijksgatan 6

Permalink

SEMINAR

I present recent experimental work on unsupervised language models trained on large corpora. We apply scoring functions to the probability distributions that the models generate for a corpus of test sentences. The functions discount the role of sentence length and word frequency, while highlighting other properties, in determining a grammaticality score for a sentence. The test sentences are annotated by Amazon Mechanical Turk crowd sourcing. We then apply support vector regression to the set of model scores for the test sentences. Some of the models and scoring functions produce encouraging Pearson correlations with the mean human judgements. I will also briefly describe current work on other corpus domains, cross domain training and testing, and grammaticality prediction in other languages. Our results provide experimental support for the view that syntactic knowledge is represented as a probabilistic system, rather than as a classical formal grammar.

Shalom Lappin King's College London and the University of Gothenburg (Joint work with Jey Han Lau and Alexander Clark)

Date: 2014-10-23 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

The advantages of using speech technology for controlling in-vehicle devices are intuitive obvious - the driver is able to keep the hands on the steering wheel and the eyes on the road, and therefore the driver’s attention can be kept on the driving task. By analysing human-human in-vehicle conversations we believe that we can get clues on how to further develop in-vehicle dialogue systems and make them cognitive workload-aware. By implementing certain dialogue strategies we believe that it is possible to decrease the driver’s cognitive workload and thereby increase safety on the roads.

In this working seminar, I will present the thesis work I have been doing so far. It includes a classification of different types of workload, and a method for distinguishing between workload types using workload measurement tools in combination with an analysis of in-car conversations.

Date: 2014-10-30 10:30 - 12:00

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

When, in the begin of the second millennium, Gray and Atkinson (2003) used lexicostatistical data along with sophisticated statistical methods to date the age of the Indo- European language family, they caused a great stir in the linguistic world. Their method was part of a general quantitative turn in historical linguistics, which started at the begin of the second millennium. This quantitative turn is reflected in a large bunch of literature on such different topics as phonetic alignment (Kondrak 2002, Prokić et al. 2009), automatic cognate detection (Steiner et al. 2011), and phylogenetic reconstruction (Brown et al. 2008, Nelson-Sathi et al. 2011).

Unfortunately, the quantitative turn created a gap between the "new and innovative" quantitative methods and the traditional approaches which linguists have been developing since the beginning of the 19th century. Traditional historical linguists are often very skeptical of the new approaches, partly because the results are not always in concordance with those achieved by the traditional methods, partly because many of the new approaches are based on large datasets which often exhibit numerous errors. Quantitative historical linguists, on the other hand, complain about traditional historical linguists' lack of interest in the multiple opportunities which quantitative and digital approaches have to offer.

In our research project on "Quantitative Historical Linguistics" (http://quanthistling.info), which aims to uncover and clarify phylogenetic relationships between native South American languages using quantitative methods, we have been developing a set of tools which are intended to help to bridge the gap between traditional and quantitative approaches in historical linguistics. Our goal is to resolve the conflict between traditional and quantitative historical linguistics by establishing a new framework of "computer-aided historical linguistics". This framework employs interactive web-based applications to compensate with both the lack of structure in traditional and the lack of quality in quantitative historical linguistics, but also various reference data sets that can be used to train and evaluate new computational methods. In the talk, some these tools will be introduced in detail, and the challenges and opportunities of quantitative, qualitative, and computer-assisted methods will be discussed.

References

  • Brown, C. H., E. W. Holman, S. Wichmann, V. Velupillai, and M. Cysouw (2008). "Automated classification of the world's languages. A description of the method and preliminary results". Sprachtypologie und Universalienforschung 61.4, 285-308.
  • Gray, R. D. and Q. D. Atkinson (2003). "Language-tree divergence times support the Anatolian theory of Indo-European origin". Nature 426.6965, 435-439.
  • Kondrak, G. (2000). "A new algorithm for the alignment of phonetic sequences". In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference (Seattle, 04/29–05/03/2000), 288-295.
  • Nelson-Sathi, S., J.-M. List, H. Geisler, H. Fangerau, R. D. Gray, W. Martin, and T. Dagan (2011). "Networks uncover hidden lexical borrowing in Indo-European language evolution". Proceedings of the Royal Society B 278.1713, 1794-1803.
  • Prokić, J., M. Wieling, and J. Nerbonne (2009). "Multiple sequence alignments in linguistics". In: Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education. "LaTeCH-SHELT&R 2009" (Athens, 03/30/2009), 18-25.
  • Steiner, L., P. F. Stadler, and M. Cysouw (2011). "A pipeline for computational historical linguistics". Language Dynamics and Change 1.1, 89-127.

Date: 2014-10-09 10:30 - 11:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

SEMINAR

A fast growing area in Natural Language Processing is the use of automated tools for identifying and correcting grammatical errors made by language learners. This growth, in part, has been fueled by the needs of a large number of people in the world who are learning and using a second or foreign language. For example, it is estimated that there are currently over one billion people who are non-native speakers of English. These numbers drive the demand for accurate tools that can help learners to write and speak proficiently in another language. Such demand also makes this an exciting time for those in the NLP community who are developing automated methods for grammatical error correction (GEC).

In the last five years alone, the field has grown tremendously from a few conference and workshop papers to four shared tasks (two of which were co-located with CoNLL), papers at conferences such as ACL and EMNLP, and two Morgan Claypool Synthesis Series books. While there have been many exciting developments in GEC over the last few years, there is still considerable room for improvement as state-of-the-art performance in detecting and correcting several important error types is still inadequate for many real world applications. In this talk, I will provide an overview of the field of automated grammatical error correction, including its history, leading methodologies and its particular set of challenges. Although applications of GEC are often geared toward the classroom, its methods are more generally applicable to a wide variety of NLP problems, especially where systems must contend with noisy data, such as MT evaluation and correction, analysis of microblogs and other user-generated content, and disfluency detection in speech.

Bio:

Joel Tetreault is a Senior Research Scientist at Yahoo Labs in New York City. His research focus is Natural Language Processing with specific interests in anaphora, dialogue and discourse processing, machine learning, and applying these techniques to the analysis of English language learning and automated essay scoring. Previously he was Principal Manager of the Core Natural Language group at Nuance Communications, Inc. where he worked on the research and development of NLP tools and components for the next generation of intelligent dialogue systems. Prior to Nuance, he worked at Educational Testing Service for six years as a Managing Senior Research Scientist where he researched automated methods for detecting grammatical errors by non-native speakers, plagiarism detection, and content scoring.

Tetreault received his B.A. in Computer Science from Harvard University (1998) and his M.S. and Ph.D. in Computer Science from the University of Rochester (2004). He was also a postdoctoral research scientist at the University of Pittsburgh's Learning Research and Development Center (2004-2007), where he worked on developing spoken dialogue tutoring systems. In addition he has co-organized the Building Educational Application workshop series for 7 years, the CoNLL 2013 Shared Task on Grammatical Error Correction, and is currently NAACL Treasurer.

Date: 2014-09-05 10:30 - 11:30

Location: T307, Olof Wijksgatan 6

Permalink

SEMINAR

Hate speech can be defined as any abusive language directed towards specific minority groups with the intention to demean. While several countries actually protect this type of language under the right to free speech, many internet providers prohibit the use of the language on their properties under their terms of service. The reason for this is that such language makes internet forums and comment sections unwelcoming and thus stunts discussion.

In this talk, we describe preliminary work into detecting hate speech and malicious language on the internet. Specifically, we discuss issues with defining hate speech and its impact on annotation and evaluation, and then describe a statistical classifier for evaluating hate speech comments on comments sections on proprietary news and finance web articles.

Note: The talk has some very sensitive and offensive material in it, just by nature of the topic.

Date: 2014-09-04 15:15 - 16:30

Location: L308, Lennart Torstenssonsgatan 8

Permalink

X
Loading