About parallel corpora
Parallel corpora: what is it?
A parallel corpus is a type of a linguistic corpus, one of the main tools used by linguists in the XXI century. Like the main part of linguistic corpora, the parallel corpus is usually provided with the so-called metainformation (information about each text — by whom and when it was created, what volume it is, etc.), as well as linguistic annotation (each word is assigned its initial form, grammatical information, etc.).
A parallel corpus is a collection of texts and their translations to another language. An important element of a parallel corpus annotation is alignment: each sentence (or a paragraph) in language X corresponds to a sentence in language Y. Thanks to the alignment, the parallel corpus becomes a useful tool for several categories of users:
- students and teachers of a foreign language (words and expressions can now be searched not in a dictionary, but in context; this is crucial for understanding the collocations of the words in a foreign language);
- translators (since the parallel corpus is a large database of all the findings that were invented by previous translators for certain expressions and techniques);
- specialists in statistical or neural network NLP — in the last decade, almost all serious companies have abandoned the development of rule-based translators (i.e., those that are based on a pre-loaded dictionary and a set of specific rules for translation). Now we need big data in two languages, where each sentence (or a smaller segment) will be given with translated correspondences. Of course, a parallel corpus for programmers differs in design from that for the translators (annotation and meta-information are not always needed there);
- linguists and translation scholars (based on such databases, many conclusions can be drawn in the field of comparative study of grammar, semantics and vocabulary).
Here are the most well-known examples of parallel corpora:
- Reverso Context — the most user-friendly corpus for a variety of language pairs; it is used by foreign language learners and translators;
- OPUS - a combined database of parallel corpora that are often used for machine translation;
- Bible Translations — one of the most ancient parallel corpora, aligned in the XIII-XVI centuries by verses;
- EuroParl - the corpus of official documents of the European Parliament — an EU body with 27 official languages;
- Here you can find many other parallel corpora.