"Sometimes, some extremely common words which would appear to be of little
value in helping select documents matching a user need
are excluded from the vocabulary entirely. These words are called
stop words . The general strategy for
determining a stop list is to sort the terms by collection frequency
(the total number of times each term appears in the document collection),
and then to
take the most frequent terms, often hand-filtered for their semantic
content relative to the domain of the documents being indexed, as a
stop list , the members of which are
then discarded during indexing. An example of a
stop list is shown in Figure 2.5 .
Using a stop list significantly reduces the number of postings that a
system has to store; we will present some statistics on this in
Chapter 5 (see Table 5.1 , page 5.1 ).
And a lot of the time not indexing stop words does little harm: keyword
searches with terms like the
and by don't seem very useful.
However, this is not true for phrase searches. The phrase query
``President of the United States'', which contains two stop words, is more
precise than President AND
``United States''. The meaning of flights to London is likely
to be lost if the word to is stopped out. A search for Vannevar
Bush's article As we may think will be difficult if the
first three words are stopped out, and the system searches simply for
documents containing the word think.
Some special query
types are disproportionately affected. Some song titles and well known
pieces of verse consist entirely of words that are commonly on stop lists
(To be or not to be, Let It Be,
I don't want to be, ...)."
The
improvement of segmentation plays an important role during the evolution of IR
system. The proper segmentation method is significant for IR system. Without
it, the result may be non-relevant with the keywords at all. If a user wants to
search information about "Pittsburgh Steelers", for a not
well-developed segmentation, the results may be about Pittsburgh the city and
steeler the job. But well-developed segmentation could recognize that the
phrase "Pittsburgh Steelers" means the football team. Just recognizing
the meaning of each word is not enough, the proper meanings of phrases must be
understood well by IR system.
However,
for cross-language retrieval, the segmentation need to be developed a lot
in order to fulfill with many other languages. This is a big challenge because
different languages may have different basic elements and different ways to
express the same meaning.
Thus,
the development of segmentation for many most useful languages could not stop.
It plays one of the key roles in retrieval area.
没有评论:
发表评论