IS 2140 BLOG: 一月 2014

2014年1月29日星期三

WEEK 4 READING NOTES

"Many users, particularly professionals, prefer Boolean query models. Boolean queries are precise: a document either matches the query or it does not. This offers the user greater control and transparency over what is retrieved. And some domains, such as legal materials, allow an effective means of document ranking within a Boolean model: Westlaw returns documents in reverse chronological order, which is in practice quite effective. In 2007, the majority of law librarians still seem to recommend terms and connectors for high recall searches, and the majority of legal users think they are getting greater control by using them. However, this does not mean that Boolean queries are more effective for professional searchers. Indeed, experimenting on a Westlaw subcollection, Turtle (1994) found that free text queries produced better results than Boolean queries prepared by Westlaw's own reference librarians for the majority of the information needs in his experiments. A general problem with Boolean search is that using AND operators tends to produce high precision but low recall searches, while using OR operators gives low precision but high recall searches, and it is difficult or impossible to find a satisfactory middle ground."

Boolean query models allow users to simplify their queries in an efficient way, but there are still many limitations for using the model. The high precision and high recall cannot be achieved at the same time by far. Boolean model and vector space model could only achieve one of them. For users, sometimes they may want high recall, sometimes they may want high precision. They need to know how to choose model in a proper way. For professional searchers, they know the backgroud information about those models so they can choose the proper one. But for searchers who are not familiar with those models, it is hard to make choices. And they may just know how to use one of them. Therefore, the trainning of retrieval models for users is very necessary.

2014年1月27日星期一

WEEK 3 MUDDIEST POINT

The spelling check might cause some error when it comes to abbreviated terms, how to avoid this kind of situations?

2014年1月20日星期一

WEEK 3 READING NOTES

"Security is an important consideration for retrieval systems in corporations. A low-level employee should not be able to find the salary roster of the corporation, but authorized managers need to be able to search for it. Users' results lists must not contain documents they are barred from opening; the very existence of a document can be sensitive information.
User authorization is often mediated through access control lists or ACLs. ACLs can be dealt with in an information retrieval system by representing each document as the set of users that can access them (Figure 4.8 ) and then inverting the resulting user-document matrix. The inverted ACL index has, for each user, a ``postings list'' of documents they can access - the user's access list. Search results are then intersected with this list. However, such an index is difficult to maintain when access permissions change - we discussed these difficulties in the context of incremental indexing for regular postings lists in Section 4.5. It also requires the processing of very long postings lists for users with access to large document subsets. User membership is therefore often verified by retrieving access information directly from the file system at query time - even though this slows down retrieval. "

In any kind of information related system, secruity is always one the most important parts. Especially people emphasize about information privacy much more than ever before. How to protect the information and improve the effiency of retrieval at the same time becomes a big challenge for nowaday's development of information retrieval technology. In many management information systems, different staffs may have different anthorization to get different information. Thus, different information may have different access frequency. It's difficult to define which information could be most accessable. It makes harder to develop the proper retrieval model to increase the effiency.

2014年1月13日星期一

WEEK 2 MUDDIEST POINT

If the word "white" is the last word of web page 1, the word "house" is the first word of consecutive web page 2, how could IR system identify the phrase "white house" as a whole not seperate?

2014年1月6日星期一

WEEK 2 READING NOTES

"Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list , the members of which are then discarded during indexing. An example of a stop list is shown in Figure 2.5 . Using a stop list significantly reduces the number of postings that a system has to store; we will present some statistics on this in Chapter 5 (see Table 5.1 , page 5.1 ). And a lot of the time not indexing stop words does little harm: keyword searches with terms like the and by don't seem very useful. However, this is not true for phrase searches. The phrase query ``President of the United States'', which contains two stop words, is more precise than President AND ``United States''. The meaning of flights to London is likely to be lost if the word to is stopped out. A search for Vannevar Bush's article As we may think will be difficult if the first three words are stopped out, and the system searches simply for documents containing the word think. Some special query types are disproportionately affected. Some song titles and well known pieces of verse consist entirely of words that are commonly on stop lists (To be or not to be, Let It Be, I don't want to be, ...)."

The improvement of segmentation plays an important role during the evolution of IR system. The proper segmentation method is significant for IR system. Without it, the result may be non-relevant with the keywords at all. If a user wants to search information about "Pittsburgh Steelers", for a not well-developed segmentation, the results may be about Pittsburgh the city and steeler the job. But well-developed segmentation could recognize that the phrase "Pittsburgh Steelers" means the football team. Just recognizing the meaning of each word is not enough, the proper meanings of phrases must be understood well by IR system.

However, for cross-language retrieval, the segmentation need to be developed a lot in order to fulfill with many other languages. This is a big challenge because different languages may have different basic elements and different ways to express the same meaning.

Thus, the development of segmentation for many most useful languages could not stop. It plays one of the key roles in retrieval area.

WEEK 1 MUDDIEST POINT

Because there are too many ways to express a single meaning, how could the IR system identify those different terms as a whole same meaning to return related results as much as possible?

WEEK 1 READING NOTES

"The first step is initiated by people who (anticipating our interest in building a search engine) we'll call "users," and their questions. We don't know a lot about these people, but we do know they are in a particular frame of mind, a special cognitive state - they may be aware of a specific gap in their knowledge (or they be only vaguely puzzled), and they're motivated to fill it. They want to FOA ... some topic.
Supposing for a moment that we were there to ask, the users may not even be able to characterize the topic, i.e., their knowladge gap. More precisely, they may not be able to fully define characteristics of the "answer" they seek. A paradoxical feature of the FOA problem is that if users knew their question, they might not even need the search engine we are designing - forming a clearly posed question is often the hardest part of answering it. "

There are lots of related results through search engines. It is impossible for users to look through all these results. Some of them may have professional training before, thus they can identify useful information. But most of them have no professional training, they might find it is difficult to choose through all these information. In the end, lacking of professional training or lacking of the ability to identify information becomes one of the barriers which users face in getting information.
Thus, to have proper training is very necessary for users. Once they can express what information they really want as accurate as possible, the most useful information results will be on the surface through IR system.
Also, IR system can be a learning machine such as Google, as long as it is developed well enough. By using as a learning machine, IR system could identify much more terms in the future.