CORPUS SEARCH ENGINE

Running the Corpus Search Engine
Description of the Corpus

What is a Corpus
Uses of Corpus
Example of a Corpus

Corpus Size
Word Frequencies
Coverage
Headers of texts
Subject-wise Classification
Average Wordlength

Corpora Download

Running the Corpus Search Engine (Running the Corpus Search Engine)

This web page is the first step in using the corpus. You can look at the statistics, word frequencies and perform word search for any of the corpora. (Running the Corpus Search Engine)click
For more details see below

II Description of the Corpus

II.1 What is a Corpus

Corpus is a collection of large number of texts in a language. The texts in the corpus of a language are usually chosen from a diverse set of fields so that they are representative of the language. We have corpora of the following 10 languages on this site,

Assamese
Bengali
Hindi
Kannada
Malayalam
Marathi
Oriya
Punjabi
Tamil
Telugu

Size of each corpus is about 3 million words. Texts in each corpus are categorized broadly under aesthetics, mass media, social science, natural science, commerce and translated materials which are further divided into sub-categories. The texts themselves typically consist of a few pages randomly chosen from publications during 1980-1990 in each of these categories. The above corpora were prepared by several oraganizations under funding from MoIT (Ministry of Information Technology formerly Department of Electronics), Government of India.

II.2 Uses of Corpus (TOP)

The major use of corpus is in language analysis and research which can be useful in many applications.
For Example :

From the hindi corpus we can find the list of frequently used words in hindi. This can be useful in preparing children's books graded according to vocabulary.
In the preparation of a dictionary or a translation system different meanings of a word can be seen and studied in their contexts.

II.3 An Example Corpus (TOP)

The hindi corpus on this site has 30 lakh words.
Total Texts : 1270.
Total words : 2992778
These words may have repetitions.
Distinct words (words without repetition) : 127241

(a) Corpus Size

Gives the number of lines, words and number of bytes in the entire corpus and in each of the texts.

number of lines	number of words	number of bytes
232418	2992778	15756016

means hindi corpus has 232418 lines, 2992778 words and 15756016 bytes.

(b) Word frequencies

Word with it's frequency and cumulative frequency is given. Frequency means the number of times the word occurs in the Corpus. Cumulative frequency means the percentage of times the word occurs in the corpus. For example the 4 most frequent words in Hindi are given below.

Cumulative Frequency	Frequency	Word
3.59%	107341	ke
6.68%	92170	hE
9.47%	83359	meM
11.83%	70325	kI

This means that the word 'ke' occurs 1,07,341 times in the corpus and covers 3.59% of the corpus (that is if we count all occurances of 'ke' in the corpus divided by the total number of words in the corpus multiplied by 100 it is 3.59%). the word 'hE' occurs 92170 times in the corpus and along with 'ke' it covers 6.68% of the corpus.

(c) Coverage

Gives the number of words required to cover a specified percentage of the corpus.

Corpus Coverage	number of words
10%	4
20%	10

means that only 4 words are required to cover 10% of the hindi corpus and 10 words are required to cover 20% of the hindi corpus.

(d) Headers of Texts

Gives the headers of the texts in the corpus. The header contains the category of the text, it's author, it's source, it's publisher,etc.
esthetics><Literature><Criticism><SNATAK VIJAYENDRA><LEKHAK PRAKASHAK AUR PATHAKYEAR><MADHUMATI (RAJ SAHITYA AKADAMI)><UDAYPUR><22-25><1840><SNATAK V. , LEKHAK AUR PATHAK, CR>
This header is taken from a text of category criticism ( criticism is a sub-category of literature and literature is a sub-category of aesthetics ). The author of the text is Snatak Vijayendra. The text is taken from a book named 'Lekhak Prakashak aur Pathakyear'. The book is published by Madhumati (Raj Sahitya Akadami) in Udaypur. The text has pages 22-25 from this book.

(e) Subject-wise Classification

The broad categories are I. Aesthetics, II. Social Sciences, III. Natural Sciences IV. Commerce V. Mass Media and VI. Translated with Aesthetics having sub-categories I.a Literature I.b Fine Arts. Aesthetics has sub-categories I.a.1 Novel ,etc.
Aesthetics covers 33% of hindi corpus with 5026819 bytes, 981158 words,72862 lines and 440 texts

Classification	Subject	Word %	Bytes	Words	Lines	Texts
I	Aesthetics	(33%)	5026819	981158	72862	440
I.a	Literature	(33%)	4934616	969546	70103	437
I.a.1	Novel	(6%)	786339	159989	12300	75
---	---	---	---	---	---	---
I.b	Fine Arts	(4%)	523149	97621	8142	40
I.b.1	Music	(2%)	189455	35582	2941	16
---	---	---	---	---	---	---
II	Social Sciences	(25%)	3971401	740838	58277	293
---	---	---	---	---	---	---
III	Natural, Physical	(10%)	1503735	280899	23292	105
---	---	---	---	---	---	---
IV	Commerce	(6%)	860084	158454	13683	60
---	---	---	---	---	---	---
V	Official and Media	(25%)	3985610	748067	59263	300
---	---	---	---	---	---	---
VI	Translated	(4%)	540464	106933	7022	44
---	---	---	---	---	---	---

(f) Average Wordlength

Gives the average length of a word in the entire corpus.
Average = 4.6943651222362
that is average word-length of a word in hindi corpus is 4.69 nearly.

II.4 Corpora Download (Top)

You can get from this site

Statistics about the different language corpora
Search for the usages of words and get actual sentences.

The corpora themselves (meaning the actual text files) are not available for download from this site.
For that you must contact : TDIL Group,
Ministry of Information Technology (Formerly Dept. of Electronics),
Govt. of India,
6 CGO Complex,
New Delhi.

Email : omvikas@doe.ernet.in
kumar@doe.ernet.in

Web : http://tdil.mit.gov.in/

Corpus Manager
LTRC Home
IIIT-H Home

Welcome to the Corpus Search Engine