• Home
  • A Swedish textual gold-standard for the medical domain (A Small CLT Project)

A Swedish textual gold-standard for the medical domain (A Small CLT Project)

INTRODUCTION
Provision of application and domain-dependent labelled language resources, such as annotated corpora, is a crucial key for progressing R&D in the human language technology (HLT) field. Such resources constitute an indispensable part for evaluation, software prototyping and design validation. The manually verified linguistic annotation of electronic text material (corpora) is a prerequisite for the development and evaluation of standard language technology tools, such as taggers, and the process is highly relevant for a number of applications including information extraction, text mining and information retrieval.

METHOD AND DATA SAMPLE
The medical terminology consist of XML-annotations taken from the Swedish MeSH thesaurus <http://mesh.kib.ki.se/swemesh/swemesh_en.cfm>, particularly categories A (Anatomy), B (Organisms), C (Diseases), D (Chemicals), E (Analytical, Diagnostic and Therapeutic Techniques and Equipment) and F (Psychiatry and Psychology). The named entities will consist of names of persons, locations, organizations, time expressions and measure expressions, which will also be applied in a uniform XML-format (cf. Kokkinakis, 2004). All texts will be XML-valid and simple annotation guidelines will be provided. In order to achieve the stated goals, existing methodology and tools will be applied for the automatic identification and annotation of the medical terminology and named entities on an existing sample of Swedish medical texts. The textual sample will consist of 300 articles from the Journal of the Swedish Medical Association, L

To the top

Page updated: 2009-09-14 11:05

Send as email
Print page
Show as pdf