2014年1月6日星期一

WEEK 2 READING NOTES

"Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list , the members of which are then discarded during indexing. An example of a stop list is shown in Figure 2.5 . Using a stop list significantly reduces the number of postings that a system has to store; we will present some statistics on this in Chapter 5 (see Table 5.1 , page 5.1 ). And a lot of the time not indexing stop words does little harm: keyword searches with terms like the and by don't seem very useful. However, this is not true for phrase searches. The phrase query ``President of the United States'', which contains two stop words, is more precise than President AND ``United States''. The meaning of flights to London is likely to be lost if the word to is stopped out. A search for Vannevar Bush's article As we may think will be difficult if the first three words are stopped out, and the system searches simply for documents containing the word think. Some special query types are disproportionately affected. Some song titles and well known pieces of verse consist entirely of words that are commonly on stop lists (To be or not to be, Let It Be, I don't want to be, ...)."



The improvement of segmentation plays an important role during the evolution of IR system. The proper segmentation method is significant for IR system. Without it, the result may be non-relevant with the keywords at all. If a user wants to search information about "Pittsburgh Steelers", for a not well-developed segmentation, the results may be about Pittsburgh the city and steeler the job. But well-developed segmentation could recognize that the phrase "Pittsburgh Steelers" means the football team. Just recognizing the meaning of each word is not enough, the proper meanings of phrases must be understood well by IR system. 
However, for cross-language retrieval,  the segmentation need to be developed a lot in order to fulfill with many other languages. This is a big challenge because different languages may have different basic elements and different ways to express the same meaning.
Thus, the development of segmentation for many most useful languages could not stop. It plays one of the key roles in retrieval area.

没有评论:

发表评论