|
|
Marie-Claude L'Homme
Designing specialized dictionaries with natural language processing techniques:A state-of-the-art During the last decades, terminology work has changed drastically due mostly to the introduction of computer applications and the availability of corpora in electronic form. Although the main steps of the methodology have remained basically the same (compiling corpora, finding relevant terms in these corpora, locating data that can help process terms, inserting the information collected during the previous steps in a record, updating of records, etc.), the way in which the data is handled is completely different. In this talk, I will present a methodology for compiling an online specialized dictionary that incorporates natural language processing applications. The dictionary considered is representative of a new generation of specialized dictionaries which aim to give users access to rich linguistic information based mostly on information collected from specialized corpora. These reference works differ from most specialized dictionaries which aim at providing users with explanation on concepts similar to that given in encyclopaedias. The dictionary I will present includes terms related to computing and the Internet and provides for each of them: fine-grained semantic distinctions, argument structure, combinatorial possibilities of terms with other terms of the domain, lists of lexical relationships (e.g., synonyms, antonyms, hyperonyms, collocates), etc. The dictionary also provides syntactic and semantic annotations of contexts in which terms appear. First, the six basic steps of the methodology will be described: 1. compilation of corpora; 2. identification of relevant terminological units; 3. collection of data from corpora; 4. analysis of the data collected; 5. compilation of term records; 6. establishment of relationships between terms records. I will proceed to show how some resources and tools that can assist terminologists during some of these steps and present some of the challenges that their introduction in terminology work has raised. I will focus on: a. management of corpora in electronic form for terminology purposes; b. annotation of corpora (part-of-speech tagging and lemmatization); c. term extraction; d. automatic or semi-automatic identification of information on terms in corpora, especially for finding semantic relationships (e.g. hyperonymic relationships, collocations, or predicate-argument relationships); e. formalisms for encoding terminological data. The point of view taken when presenting computer applications will be that of users rather than that of developers. Then, I will proceed to illustrate how other resources and computer applications can assist terminologists carrying bilingual terminology work. These applications include (in addition to those reviewed for monolingual specialized dictionary compilation): a. bilingual corpora; b. bilingual term extraction; c) comparing term extraction results between languages. Specific challenges posed by these techniques will be discussed.
|
10/03/2009
|
|