Welcome to the Corpus Search Engine

  1. Running the Corpus Search Engine
  2. Description of the Corpus

Running the Corpus Search Engine (Running the Corpus Search Engine)

This web page is the first step in using the corpus. You can look at the statistics, word frequencies and perform word search for any of the corpora. (Running the Corpus Search Engine)click
For more details see below

II Description of the Corpus

II.1 What is a Corpus

Corpus is a collection of large number of texts in a language. The texts in the corpus of a language are usually chosen from a diverse set of fields so that they are representative of the language. We have corpora of the following 10 languages on this site,
  1. Assamese
  2. Bengali
  3. Hindi
  4. Kannada
  5. Malayalam
  6. Marathi
  7. Oriya
  8. Punjabi
  9. Tamil
  10. Telugu
Size of each corpus is about 3 million words. Texts in each corpus are categorized broadly under aesthetics, mass media, social science, natural science, commerce and translated materials which are further divided into sub-categories. The texts themselves typically consist of a few pages randomly chosen from publications during 1980-1990 in each of these categories. The above corpora were prepared by several oraganizations under funding from MoIT (Ministry of Information Technology formerly Department of Electronics), Government of India.

II.2 Uses of Corpus (TOP)


The major use of corpus is in language analysis and research which can be useful in many applications.
For Example :
  • From the hindi corpus we can find the list of frequently used words in hindi. This can be useful in preparing children's books graded according to vocabulary.
  • In the preparation of a dictionary or a translation system different meanings of a word can be seen and studied in their contexts.

II.3 An Example Corpus (TOP)

The hindi corpus on this site has 30 lakh words.
Total Texts : 1270.
Total words : 2992778
These words may have repetitions.
Distinct words (words without repetition) : 127241


(a) Corpus Size

Gives the number of lines, words and number of bytes in the entire corpus and in each of the texts.
number of lines number of words number of bytes
232418 2992778 15756016
means hindi corpus has 232418 lines, 2992778 words and 15756016 bytes.

(b) Word frequencies

Word with it's frequency and cumulative frequency is given. Frequency means the number of times the word occurs in the Corpus. Cumulative frequency means the percentage of times the word occurs in the corpus. For example the 4 most frequent words in Hindi are given below.
Cumulative Frequency Frequency Word
3.59% 107341 ke
6.68% 92170 hE
9.47% 83359 meM
11.83% 70325 kI
This means that the word 'ke' occurs 1,07,341 times in the corpus and covers 3.59% of the corpus (that is if we count all occurances of 'ke' in the corpus divided by the total number of words in the corpus multiplied by 100 it is 3.59%). the word 'hE' occurs 92170 times in the corpus and along with 'ke' it covers 6.68% of the corpus.

(c) Coverage

Gives the number of words required to cover a specified percentage of the corpus.
Corpus Coverage number of words
10% 4
20% 10
means that only 4 words are required to cover 10% of the hindi corpus and 10 words are required to cover 20% of the hindi corpus.

(d) Headers of Texts

Gives the headers of the texts in the corpus. The header contains the category of the text, it's author, it's source, it's publisher,etc.
esthetics><Literature><Criticism><SNATAK VIJAYENDRA><LEKHAK PRAKASHAK AUR PATHAKYEAR><MADHUMATI (RAJ SAHITYA AKADAMI)><UDAYPUR><22-25><1840><SNATAK V. , LEKHAK AUR PATHAK, CR>
This header is taken from a text of category criticism ( criticism is a sub-category of literature and literature is a sub-category of aesthetics ). The author of the text is Snatak Vijayendra. The text is taken from a book named 'Lekhak Prakashak aur Pathakyear'. The book is published by Madhumati (Raj Sahitya Akadami) in Udaypur. The text has pages 22-25 from this book.

(e) Subject-wise Classification

The broad categories are I. Aesthetics, II. Social Sciences, III. Natural Sciences IV. Commerce V. Mass Media and VI. Translated with Aesthetics having sub-categories I.a Literature I.b Fine Arts. Aesthetics has sub-categories I.a.1 Novel ,etc.
Aesthetics covers 33% of hindi corpus with 5026819 bytes, 981158 words,72862 lines and 440 texts
Classification Subject Word % Bytes Words Lines Texts
I Aesthetics (33%) 5026819 981158 72862 440
I.a Literature (33%) 4934616 969546 70103 437
I.a.1 Novel (6%) 786339 159989 12300 75
--- --- --- --- --- --- ---
I.b Fine Arts (4%) 523149 97621 8142 40
I.b.1 Music (2%) 189455 35582 2941 16
--- --- --- --- --- --- ---
II Social Sciences (25%) 3971401 740838 58277 293
--- --- --- --- --- --- ---
III Natural, Physical (10%) 1503735 280899 23292 105
--- --- --- --- --- --- ---
IV Commerce (6%) 860084 158454 13683 60
--- --- --- --- --- --- ---
V Official and Media (25%) 3985610 748067 59263 300
--- --- --- --- --- --- ---
VI Translated (4%) 540464 106933 7022 44
--- --- --- --- --- --- ---

(f) Average Wordlength

Gives the average length of a word in the entire corpus.
Average = 4.6943651222362
that is average word-length of a word in hindi corpus is 4.69 nearly.


You can get from this site
  1. Statistics about the different language corpora
  2. Search for the usages of words and get actual sentences.

The corpora themselves (meaning the actual text files) are not available for download from this site.
For that you must contact : TDIL Group,
Ministry of Information Technology (Formerly Dept. of Electronics),
Govt. of India,
6 CGO Complex,
New Delhi.

Email : omvikas@doe.ernet.in
kumar@doe.ernet.in

Web : http://tdil.mit.gov.in/

Corpus Manager
LTRC Home
IIIT-H Home