Sunday, June 17, 2007

Part 4 - Creating a Simple Search Engine

I would like to mention how words and vocabularly are distributed inside a document. In order to decide the distribution nature of words and vocabularly in a document, we can apply below theories:

(1) Zipf Law
(2) Skewed
(3) Heap's Law

Zipf Law mentions that the frequency of any word is roughly inversely proportional to its rank in the frequency table. So, the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. That is, i-th frequent word appears 1/i to the power theta times of most frequent word. The formula for a corpus with n words and v vocabulary is as follow.



For example, in the Brown Corpus "the" is the most frequently-occurring word, and all by itself accounts for nearly 7% of all word occurrences (69971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36411 occurrences), followed by "and" (28852). Only 135 vocabulary items are needed to account for half the Brown Corpus.

Check below image for distribution nature of words in a document. You will see that words like a,an,the,etc are at the highest point where the word "Zaw" which is considered informative is nearly at the zero point.



Example Zipf's law calculation where vocabularly=1000, Theta=2.0 and words=500.




Below is a song for you



Stayed Tuned,
- Zaw Win Htike

1 comment:

Anonymous said...

Thanks..

one thing. your video is no longer available...