Russian-Chinese parallel corpus of Russian National Corpus

Russian National Corpus (RNC) is one of the largest and highest-quality families of corpora for the Russian language. There are a large number of so-called subcorpora in the corpus — small databases dedicated to a specific area of language research (syntax, stress, etc.). One of these subcorpora is parallel corpus; it is itself divided into twenty Russian-foreign corpora.

You can find out about what parallel corpora are here.

Brief History

Our corpus was created within the RNC project in 2016. Since 2019, it is available on two pages:

Russian National Corpus page, with a common RNC interface, but rare text updates;
HSE corpus projects page, with regular text updates, and experimental interface.

In 2020 and 2021, we received support from HSE University within three projects: firstly, for the enhancement of the corpus infrastructure, secondly, for the linguistic annotation of the Chinese texts, thirdly, for the development of the corpus-assisted language learning programs based on Corpus.

The current state of the Corpus

The volume of the Corpus is over 3.5 million words. It consists of more than 1 000 texts; the majority of the texts is the fictional Russian and Chinese literature of XIX-XXI centuries, news and official texts.

Today, the Corpus has a Russian, English and Chinese interfaces (on HSE corpus projects).

If you want to know the functions of our Corpus, please follow the instructions on the search page: click on the orange question icon at the top of the page.

Our Advantages

To date, our project is the only parallel corpus being developed in Russia that has four crucial features at once:

it represents a pair of languages - Russian and Chinese (both Putonghua and Guoyu);
it is available online and has a free access;
it has a user-friendly search system;
it has grammatical annotation.

We only know about one analogue of our project, which is currently being developed in Beijing.

Our Team

Our project involves students, teachers and researchers of the following institutes:

Dozens of people work on the corpus. But we still have a huge number of unresolved tasks, for which there are not enough active and courageous participants. Therefore, if you are interested in our project, be sure to look at our vacancies!