"Many users, particularly professionals, prefer Boolean query models.
Boolean queries are precise: a document either matches the query or it
does not. This offers the user greater control and transparency
over what is
retrieved. And some domains, such as legal materials, allow an effective
means of document ranking within a Boolean model: Westlaw returns
documents in reverse chronological order, which is in practice quite
effective. In 2007, the majority of law librarians still seem to
recommend terms and connectors for high recall searches, and the majority of
legal users think they are getting greater control by using them.
However, this does not mean that Boolean queries are more
effective for professional searchers. Indeed, experimenting on a Westlaw
subcollection, Turtle (1994)
found that free text queries produced better results than Boolean
queries prepared by Westlaw's own reference librarians for the majority
of the information needs in his experiments. A general problem with
Boolean search is that using AND operators tends to produce
high precision but low recall searches, while using OR
operators gives low precision but high recall searches, and it is
difficult or impossible to find a satisfactory middle ground."
Boolean query models allow users to simplify their queries in an efficient way, but there are still many limitations for using the model. The high precision and high recall cannot be achieved at the same time by far. Boolean model and vector space model could only achieve one of them. For users, sometimes they may want high recall, sometimes they may want high precision. They need to know how to choose model in a proper way. For professional searchers, they know the backgroud information about those models so they can choose the proper one. But for searchers who are not familiar with those models, it is hard to make choices. And they may just know how to use one of them. Therefore, the trainning of retrieval models for users is very necessary.
2014年1月29日星期三
2014年1月27日星期一
WEEK 3 MUDDIEST POINT
The spelling check might cause some error when it comes to abbreviated terms, how to avoid this kind of situations?
2014年1月20日星期一
WEEK 3 READING NOTES
"Security is an important consideration for retrieval
systems in corporations.
A low-level employee should not be able to find
the salary roster of the
corporation, but authorized managers need to be able to
search for it.
Users' results lists must not contain documents they
are barred from opening; the very existence of a
document can be sensitive information.
User authorization is often mediated through access control lists or ACLs. ACLs can be dealt with in an information retrieval system by representing each document as the set of users that can access them (Figure 4.8 ) and then inverting the resulting user-document matrix. The inverted ACL index has, for each user, a ``postings list'' of documents they can access - the user's access list. Search results are then intersected with this list. However, such an index is difficult to maintain when access permissions change - we discussed these difficulties in the context of incremental indexing for regular postings lists in Section 4.5. It also requires the processing of very long postings lists for users with access to large document subsets. User membership is therefore often verified by retrieving access information directly from the file system at query time - even though this slows down retrieval. "
In any kind of information related system, secruity is always one the most important parts. Especially people emphasize about information privacy much more than ever before. How to protect the information and improve the effiency of retrieval at the same time becomes a big challenge for nowaday's development of information retrieval technology. In many management information systems, different staffs may have different anthorization to get different information. Thus, different information may have different access frequency. It's difficult to define which information could be most accessable. It makes harder to develop the proper retrieval model to increase the effiency.
User authorization is often mediated through access control lists or ACLs. ACLs can be dealt with in an information retrieval system by representing each document as the set of users that can access them (Figure 4.8 ) and then inverting the resulting user-document matrix. The inverted ACL index has, for each user, a ``postings list'' of documents they can access - the user's access list. Search results are then intersected with this list. However, such an index is difficult to maintain when access permissions change - we discussed these difficulties in the context of incremental indexing for regular postings lists in Section 4.5. It also requires the processing of very long postings lists for users with access to large document subsets. User membership is therefore often verified by retrieving access information directly from the file system at query time - even though this slows down retrieval. "
In any kind of information related system, secruity is always one the most important parts. Especially people emphasize about information privacy much more than ever before. How to protect the information and improve the effiency of retrieval at the same time becomes a big challenge for nowaday's development of information retrieval technology. In many management information systems, different staffs may have different anthorization to get different information. Thus, different information may have different access frequency. It's difficult to define which information could be most accessable. It makes harder to develop the proper retrieval model to increase the effiency.
2014年1月13日星期一
WEEK 2 MUDDIEST POINT
If the word "white" is the last word of web page 1, the word "house" is the first word of consecutive web page 2, how could IR system identify the phrase "white house" as a whole not seperate?
2014年1月6日星期一
WEEK 2 READING NOTES
"Sometimes, some extremely common words which would appear to be of little
value in helping select documents matching a user need
are excluded from the vocabulary entirely. These words are called
stop words . The general strategy for
determining a stop list is to sort the terms by collection frequency
(the total number of times each term appears in the document collection),
and then to
take the most frequent terms, often hand-filtered for their semantic
content relative to the domain of the documents being indexed, as a
stop list , the members of which are
then discarded during indexing. An example of a
stop list is shown in Figure 2.5 .
Using a stop list significantly reduces the number of postings that a
system has to store; we will present some statistics on this in
Chapter 5 (see Table 5.1 , page 5.1 ).
And a lot of the time not indexing stop words does little harm: keyword
searches with terms like the
and by don't seem very useful.
However, this is not true for phrase searches. The phrase query
``President of the United States'', which contains two stop words, is more
precise than President AND
``United States''. The meaning of flights to London is likely
to be lost if the word to is stopped out. A search for Vannevar
Bush's article As we may think will be difficult if the
first three words are stopped out, and the system searches simply for
documents containing the word think.
Some special query
types are disproportionately affected. Some song titles and well known
pieces of verse consist entirely of words that are commonly on stop lists
(To be or not to be, Let It Be,
I don't want to be, ...)."
The
improvement of segmentation plays an important role during the evolution of IR
system. The proper segmentation method is significant for IR system. Without
it, the result may be non-relevant with the keywords at all. If a user wants to
search information about "Pittsburgh Steelers", for a not
well-developed segmentation, the results may be about Pittsburgh the city and
steeler the job. But well-developed segmentation could recognize that the
phrase "Pittsburgh Steelers" means the football team. Just recognizing
the meaning of each word is not enough, the proper meanings of phrases must be
understood well by IR system.
However,
for cross-language retrieval, the segmentation need to be developed a lot
in order to fulfill with many other languages. This is a big challenge because
different languages may have different basic elements and different ways to
express the same meaning.
Thus,
the development of segmentation for many most useful languages could not stop.
It plays one of the key roles in retrieval area.
WEEK 1 MUDDIEST POINT
Because there are too many ways to express a single meaning, how could the IR system identify those different terms as a whole same meaning to return related results as much as possible?
WEEK 1 READING NOTES
"The first step is initiated by people who (anticipating our interest in building a search engine) we'll call "users," and their questions. We don't know a lot about these people, but we do know they are in a particular frame of mind, a special cognitive state - they may be aware of a specific gap in their knowledge (or they be only vaguely puzzled), and they're motivated to fill it. They want to FOA ... some topic.
Supposing for a moment that we were there to ask, the users may not even be able to characterize the topic, i.e., their knowladge gap. More precisely, they may not be able to fully define characteristics of the "answer" they seek. A paradoxical feature of the FOA problem is that if users knew their question, they might not even need the search engine we are designing - forming a clearly posed question is often the hardest part of answering it. "
There are lots of related results through search engines. It is impossible for users to look through all these results. Some of them may have professional training before, thus they can identify useful information. But most of them have no professional training, they might find it is difficult to choose through all these information. In the end, lacking of professional training or lacking of the ability to identify information becomes one of the barriers which users face in getting information.
Thus, to have proper training is very necessary for users. Once they can express what information they really want as accurate as possible, the most useful information results will be on the surface through IR system.
Also, IR system can be a learning machine such as Google, as long as it is developed well enough. By using as a learning machine, IR system could identify much more terms in the future.
Supposing for a moment that we were there to ask, the users may not even be able to characterize the topic, i.e., their knowladge gap. More precisely, they may not be able to fully define characteristics of the "answer" they seek. A paradoxical feature of the FOA problem is that if users knew their question, they might not even need the search engine we are designing - forming a clearly posed question is often the hardest part of answering it. "
There are lots of related results through search engines. It is impossible for users to look through all these results. Some of them may have professional training before, thus they can identify useful information. But most of them have no professional training, they might find it is difficult to choose through all these information. In the end, lacking of professional training or lacking of the ability to identify information becomes one of the barriers which users face in getting information.
Thus, to have proper training is very necessary for users. Once they can express what information they really want as accurate as possible, the most useful information results will be on the surface through IR system.
Also, IR system can be a learning machine such as Google, as long as it is developed well enough. By using as a learning machine, IR system could identify much more terms in the future.
订阅:
评论 (Atom)