Multilingual Information Retrieval and Cross-Language Retrieval
Martin Braschler(1) and Jacques Savoy(2)
(1) Institute of Applied Information Technology, Zurich University of Applied Sciences, Switzerland
(2) Computer Science Dept., University of Neuchâtel, Switzerland
Abstract
A lot of early work in Information Retrieval was exclusively focused on retrieval of English text documents. This limitation started to become addressed in the early 80s in earnest, with the advent of evaluation campaigns such as TREC in the 90s being a major force behind this development. In this lecture, we will show how to systematically extend basic monolingual indexing and matching, and adapt them for working with other languages. We will cover issues pertaining to the indexing process such as language identification, tokenization/segmentation (including Asian languages), word normalization, stemming/decompounding. We will discuss the effect of these measures on retrieval effectiveness.
Understanding how to adapt IR systems successfully for many languages is a necessary pre-requisite to tackle the problem of Cross-Language Information retrieval (CLIR), i.e. the retrieval of documents written in a language different to the language of the user's request. The lecture will cover the different translation strategies to address the CLIR problem. Issues of cultural differences, translation ambiguity, privot languages and extension to a large number of languages are covered.
Course Material
Multilingual Information Retrieval and Cross-Language Retrieval
Martin Braschler, Zurich University of Applied Sciences, Switzerland, and
Jacques Savoy, University of Neuchâtel, Switzerland
Labs 1
Labs 2