The focus of the fundamental research in the text technology lab is on LT data resources and tools for working with large amounts of text (understood in the widest sense: transcribed speech is also to be considered text here).
Since 1975, the University of Gothenburg’s Språkbanken (Swedish Language Bank ) has made Swedish and other language resources available online to both the research community and the general public. Språkbanken possesses a unique combination of competences in the areas of Swedish text corpora, parallel text corpora, Swedish computational lexicons and LT tools for the processing, annotation and presentation of text corpora. These competences are made even more effective by being coupled with the kind of stable organization required for sustained large-scale corpus processing and presentation. Over the years Språkbanken’s corpora and lexicons have been widely used for research, teaching and other related purposes. In particular, a good number of PhD theses in Sweden and Finland have used Språkbanken as a data source.
A prominent application area of the research in the lab is eScience, understood here as the use of LT-based methodologies and tools for working effectively with large amounts of digitized text as primary research data, as in many humanities and social sciences disciplines. Much of the LT research in the lab and in Gothenburg in general can potentially be applied for this kind of eScience.
The language sciences have long made use of LT as a research tool in the form of corpus linguistics, whereby large digital text collections (corpora) are made searchable via linguistically relevant search criteria. During the 1960s Gothenburg became a pioneer in Swedish corpus linguistics by compiling Press-65, one of the ﬁrst large electronic text corpora in a language other than English. The approximately 200 MW corpora now available online in Språkbanken have since the mid-1960s been used in research in computer-based lexicography and lexicology. Many of the recent large Swedish published dictionaries have been worked out by lexicographers at Gothenburg. A notable recent focus in our research are corpus-based studies of learner (second-language) Swedish and LT-informed corpus-based applications in the area of computer-assisted language learning.
The Department of Philosophy, Linguistics and Theory of Science maintains an extensive set of spoken language corpora, which have served as the basis for numerous studies of spoken language phenomena. They have also been used for comparing the properties of written and spoken language. Approximately 60% of the corpora are video recorded, making them an interesting data source for the study of multimodal communication. Along these lines, several studies have been conducted on the relationship between gestures and spoken language that have strongly inﬂuenced current research on Embodied Communication at the Zentrum für Interdisziplinäre Forschung at Bielefeld University, Germany.
Also at the Department of Philosophy, Linguistics and Theory of Science is the SweDia 2000 dialect database. The database consists of recorded speech material from 107 Swedish dialects (1200 speakers, 750 hours recording time). The recordings were made as a joint effort by the departments of linguistics at Umeå, Stockholm and Lund universities in 1998–2002. It has long been used as an eScience resource resulting in more than 70 publications. There are several fundamental scientiﬁc questions that may be addressed in a fruitful way by an analysis based on data of the kind contained in SweDia. The SweDia database is now hosted by the department, and work is in progress to develop the database further as an eScience resource.
Research on applying LT in the life sciences domain has been a focus of research at the Department of Swedish Language during recent years. The research is based on authentic text material, both clinical narratives and scientiﬁc medical content. In the medical setting, for instance, vast amounts of health-related data are constantly collected. These data constitute a valuable source of primary research material; however, to empower clinical researchers to locate and make highly efﬁcient use of the knowledge that is encoded therein, the material must be better integrated and linked via effective automated processing. Such advanced capabilities would facilitate the construction of hypotheses based upon novel associations between extracted information, the undertaking of retrospective studies based upon patient narratives, drug discoveries, the early detection of adverse drug events, the improvement of searching and browsing capabilities, and the like. It would also provide a more focused and effective means of searching and collecting patient related information from unstructured text – a process that is currently restricted by language-inherent complexity, variation and ambiguity.
Gothenburg has an established network of academics and professionals, both in Sweden and in Europe, that are actively interested in issues related to this ﬁeld. Recently, the Swedish National Board of Health and Welfare (SoS) and Gothenburg’s LT researchers collaborated on SoS-funded research project to transform a large archive of scientiﬁc medical text material into a format that is suitable for text mining and use the text for quality assurance of terminology. This material was annotated with a systematically organized computer processable collection of medical terminology known as SNOMED CT (the Systematized Nomenclature of Medicine – Clinical Terms). Plans are also in the works to establish a competence center for clinical language technology that will supply clinical researchers in academia and the pharmaceutical industry with LT tools and services that are customised to meet different individual needs.
An important aspect of LT is the way in which it connects to other, more visible, technologies. In particular, we have reason to believe that LT will become increasingly important to web technology, making the web seem smarter, more interactive, more conversational, and more sensitive to the information needs of people with a ﬁrst (and perhaps only) language not so widely used.
The web is currently largely made up of linked documents, often text documents. Language technology may add value to the web by extracting some form of meaning from the human-readable text of the pages, based on rules or statistics – meaning that can make search more effective, or help building the semantic web, a web made up of linked data.
In line with the general interest in LT-based eScience, the Department of Philosophy, Linguistics and Theory of Science has recently formed a group approaching the emerging ﬁeld of Web Science from an LT perspective.
The planned fundamental LT research in the lab will have as its centerpiece the creation of a full-scale computational lexical resource for modern Swedish – and to some extent earlier forms of Swedish, primarily the 19th century language – with rich semantic, syntactic and morphological information (“Swedish FrameNet++”). We see this as a necessary core component for the development of methodology and tools for automatic semantic annotation of text. Further, this work will allow us to explore various corpus-based lexicon creation methodologies, mixing the manual – which we master almost to perfection, but which is prohibitively labor-intensive – with experimental resource-economical machine learning approaches. This research will be conducted in collaboration with the grammar technology lab.
The main application areas that we wish to initiate or develop further in the lab are the following: