Thursday, June 7, 2007

Part 3 - Creating a Simple Search Engine

In part 2, I have discussed about the pre-processing tasks needed to perform on documents so that lexicon size can be reduced. Now, I want to discuss the informativeness of a document.

In reading an article in newspaper, magazine or journal, you may not sure how much level of informativeness the article you read carries. Depending on your judgement about the content of an article, you may want to skip to another article.

Same concept applies in creating a search engine. A search engine needs to decide the informativeness of a document depending on the contents contained in it so that a search engine can provide you with the most relevant and useful documents in ranking order. Then, how does a search engine determine a document carries much information compare to another document? It uses Information Theory (Entropy). Entropy is the amount of information related to symbol distribution in documents. Its formula is as follow:

Symbols may mean a character or a word in a document. Suppose that your document contains 001001011011 where it has six zeros and six 1s and we can derive its entropy value as:

E=− 6/12*log2 * 6/12− 6/12*log2 *6/12=1

(Where 12 is total no of characters in our document and the two 6: one for total no of zero and another for total no of 1.)

Getting 1 in entropy means that your document contain maximun informativeness and the symbols are randomly distributed in your document.

Or if your document contains 111111111111 where there is no variation except 1 and we can get its entropy value as:

E=− 0/12*log2*0/12−12/12*log2*12/12=0

If we have no different characters/words in our document, then we need to include zero case in our formula(eg. 0/12*log2*0/12). Resulting zero means that your document does not contain much information.

Note: if you don't understand mathematical symbols, you can check them at http://en.wikipedia.org/wiki/Table_of_mathematical_symbols.

Below is a present song from me for you.

Stay Tuned,
- Zaw Win Htike

No comments: