Sub-corpus of the national corpus of the Kazakh language

The linguistic corpus is an information and reference base of written and spoken texts in a certain language in electronic form, which facilitates the search for linguistic units (especially words, phrases). It displays the search result in a natural context.

Most of the world languages have created their own national corpuses, which differ from each other mainly in the scientific processing of texts, the completeness of the database and the variety of sub-corps.

The significance of the national corpus:

The corpus can offer all types of texts in the language (artistic, publicistic, educational, scientific, business, colloquial, dialect, etc.). Accordingly, these texts are entered into the corpus in stages. Good representativeness of the corpus is associated with a significant increase in the number of word usage (hundreds of millions).

The corpus database contains additional information in the form of annotations or in the form of descriptions about the nature of the texts. You can use it to get consolidated information. The corpus database is constantly being improved and replenished.

Why do we need corpus?

The National Corpus is needed primarily for the scientific study of the vocabulary and grammar of the language, and for the changes that have taken place in it for hundreds of years;

Optimizes the search for information due to modern technological development;

Simplifies the analysis, processing of extensive materials, obtaining statistical data;

Allows you to create the necessary dictionaries based on the corpus database;

National corpuses are also important for teaching a language as a native or foreign one;

Textbooks and curricula are now focused on the corpus. Foreigners, schoolchildren, teachers, journalists, editors or writers can quickly and effectively check the features of the use of an unfamiliar word or grammatical form using the corpus.

This website was prepared within the framework of the state order based on the project "Development of the publicistic sub-corpus of the national corpus of the Kazakh language". The project is planned to be implemented in the period from 2021 to 2025. In 2021, an introduction of 2 million words was sent to the corpus database, which was executed.

The corpus is intended for a wide range of people, starting with professional linguists and ending with teachers, students, language learners, as well as those interested in the Kazakh language in general. It is freely available to the public.

Sub-corpus of the national corpus of the Kazakh language

A PHP Error was encountered

A PHP Error was encountered