2014年4月4日星期五

WEEK 11 MUDDIEST POINT

Is there any other way to solve the translation problem except using Google translation or Bing translation API?

2014年3月27日星期四

WEEK 11 READING NOTES

"If the search engine maintains a dynamic index that allows updates (e.g., document insertions/deletions), then it may even be possible to carry out the updates in a distributed fashion,in which each node takes care of the updates that pertain to its part of the overall index. This approach eliminates the need for a complicated centralized index construction/maintenance process that involves the whole index. However, it is applicable only if documents may be assumed to be independent of each other, not if inter-document information, such as hyperlinks and anchor text, is part of the index."

The indexing of documents is complicated than what we thought. There are too many kinds of documents. For each kind of documents, the index would be different from each other. Thus, the index methords need to be considered in so many different angles in order to meet the requirement of so many documents as much as possible. In addition, while meeting different documents, it is necessary to develop different indexing methords to treat different documents. For instance, how to avoid the link broken, how to extract foreign language words, etc.

WEEK 10 MUDDIEST POINT

Is there any filter in web crawling?

2014年3月13日星期四

WEEK 10 READING NOTES

"The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. We have created maps containing as many as 518 million of these hyperlinks, a significant sample of the total. These maps allow rapid calculation of a web page’s "PageRank", an objective measure of its citation importance that corresponds well with people’s subjective idea of
importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at
google.stanford.edu). For the type of full text searches in the main Google system, PageRank also helps a great deal."

In the past, people may think that the page which mentions about the keyword must be the relevant page for users. The evaluation of relevance is about the word frequency. Hence, in the search engine first appearance age, the working mechanism of it was far from "artificial intelligence". Search engines before Google like Altavista and Excite, they were designed to rank information basing on priority. The ranking could be impacted in many different ways. If the visits are huge or the frequency of key words is high, the page may be ranked in a high place, though the page could be hardly relevant to the requirement of users. Obviously, this kind of ranking mechanism could cause cheating activities in a easy way.

WEEK 8 MUDDIEST POINT

How to calculate suggested query terms?

2014年2月25日星期二

WEEK 8 READING NOTES

"In user-centered design, decisions are made based on responses obtained from target users of the system. (This is in contrast with standard software practice in which the designers assume they know what users need, and so write the code first and assess it with users later.) In user-centered design, first a needs assessment is performed in which the designers investigate who the users are, what their goals are, and what tasks they have to complete in order to achieve those goals. The next stage is a task analysis in which the designers characterize which steps the users need to take to complete their tasks, decide which user goals they will attempt to support, and then create scenarios which exemplify these tasks being executed by the target user population (Kuniavsky, 2003, Mayhew, 1999). "

The first step to do user-centered design is to define the target users group. Once getting known about the target users, analyzers could conclude their requirements and than help developers find proper way to design several functions to meet the requirements. The target users and the design goal are the fundermental factors of user-centered design. Thus, it is necessary to focus on this part. Without concentrating a lot on this part, the product might be useless or unsatified for users. This kind of situation is not what we want to see. Analyzing the requirements is an important step too. It requires analyzers to be experienced and working efficiently. By combining the target users group and requirement analysis, the design goal could be clear for designers to do further works.

WEEK 7 MUDDIEST POINT

How to evaluate the usability of relevance feedback?

2014年2月20日星期四

WEEK 7 READING NOTES

"This work inspires several future directions. First, we can study a more principled way to model multiple negative models and use these multiple negative models to conduct constrained query expansion, for example, avoiding terms which are in negative models. Second, we are interested in a learning framework which can utilize both a little positive information (original queries) and a certain
amount of negative information to learn a ranking function to help dif cult queries. Third, queries are dif cult due to different reasons. Identifying these reasons and customizing negative feedback strategies would be much worth studying."

Collecting feedback is a very important step during the evaluation. According to those feedback, developers could get the direct ideas from users. Those ideas could help the developing department focus more on users' requirements than areas that developers thought important but users did not. Positive feedbacks always highlight the advantages of the system. However, frome negative feedbacks, analyzers could get better known about users' opinions. That is why the negative feeadback based model could be more useful than positive based model. By comparing all the negative feedback through negative model, developers could get an overview about the disadvantages of the system. Thus, further development of negative model is needed.

WEEK 6 MUDDIEST POINT

Given the fact that precision and recall cannot be fulfilled at the same time, so how to decide which one is better to be focused on while doing the sepecific retrieving?

2014年2月14日星期五

WEEK 6 READING NOTES

"This sort of thing is extremely hard. But I do not believe that we should therefore not attempt to do it or argue, in a supposedly more principled manner, that setups within which modern retrieval systems have to operate are so di use, or so variegated, that it is a fundamental mistake to address anything but the immediate D * Q * R environment from which solid, transportable, general-purpose retrieval system knowhow can be acquired. In fact, indeed, TREC's newer tracks subvert both of these arguments: even if the lawyers' interpretation of \relevant" as referencing might be inferable from assessment data samples, one feels rather less con dent about being able to infer, even with the best modern machine learning tools, that the name of the retrieval game is getting information that \appears reasonably calculated to lead to the discovery of admissible evidence"."

The development of TREC has faced several challenges by far. Some people may think that it is not necessary to spend too much time and money on the related research. However, it is unfair to think this way. What TREC could bring to us is far more than our imagination. TREC track has finished several significant tasks in information retrieval field. It helps developers deal with huge amount of data more efficiently and accurately. And the standard of TREC file makes random data more standardized and more recognizable. One of the biggest challenges of TREC is that it is hard to convert huge data into TREC. It might cost too much time and machines to reach the goal. Thus, the further development of TREC is absolutely needed. And the research is definitely valuable.

2014年2月10日星期一

WEEK 5 MUDDIEST POINT

If the probability of the query term in a document is very close to 1, the document is very possible a useless document. Is there a proper range about the probability to return the proper results without this kind extreme situation?

2014年2月5日星期三

WEEK 5 READING NOTES

"The model decomposes into two parts: a document collection network and a query network. The document collection network is large, but can be precomputed: it maps from documents to terms to concepts. The concepts are a thesaurus-based expansion of the terms appearing in the document. The query network is relatively small but a new network needs to be built each time a query comes in, and then attached to the document network. The query network maps from query terms, to query subexpressions (built using probabilistic or ``noisy'' versions of AND and OR operators), to the user's information need. "

This kind of network is very useful in IR field. It helps the system to gather related information from the word. Once the network constructed, the whole could be retrieved. However, the adding of new network may bring problems for the system because of the storage and the speed. The network represents the boolean module and Probabilistic information retrieval. But the usage of them is not practical enough. The evalution of every models never stops. Thus, the development of construction of the network could based on newly developed models not only previous models. Also, the potential problems are needed to be noticed by developers too.

2014年2月3日星期一

WEEK 4 MUDDIEST POINT

While calculating the weight of terms in the documents, how to identify the meaning of the words which may have several meanings because sometimes the different meaning word may have different weight, such as apple? Is there any standards for this kind of identification?

2014年1月29日星期三

WEEK 4 READING NOTES

"Many users, particularly professionals, prefer Boolean query models. Boolean queries are precise: a document either matches the query or it does not. This offers the user greater control and transparency over what is retrieved. And some domains, such as legal materials, allow an effective means of document ranking within a Boolean model: Westlaw returns documents in reverse chronological order, which is in practice quite effective. In 2007, the majority of law librarians still seem to recommend terms and connectors for high recall searches, and the majority of legal users think they are getting greater control by using them. However, this does not mean that Boolean queries are more effective for professional searchers. Indeed, experimenting on a Westlaw subcollection, Turtle (1994) found that free text queries produced better results than Boolean queries prepared by Westlaw's own reference librarians for the majority of the information needs in his experiments. A general problem with Boolean search is that using AND operators tends to produce high precision but low recall searches, while using OR operators gives low precision but high recall searches, and it is difficult or impossible to find a satisfactory middle ground."

Boolean query models allow users to simplify their queries in an efficient way, but there are still many limitations for using the model. The high precision and high recall cannot be achieved at the same time by far. Boolean model and vector space model could only achieve one of them. For users, sometimes they may want high recall, sometimes they may want high precision. They need to know how to choose model in a proper way. For professional searchers, they know the backgroud information about those models so they can choose the proper one. But for searchers who are not familiar with those models, it is hard to make choices. And they may just know how to use one of them. Therefore, the trainning of retrieval models for users is very necessary.

2014年1月27日星期一

WEEK 3 MUDDIEST POINT

The spelling check might cause some error when it comes to abbreviated terms, how to avoid this kind of situations?

2014年1月20日星期一

WEEK 3 READING NOTES

"Security is an important consideration for retrieval systems in corporations. A low-level employee should not be able to find the salary roster of the corporation, but authorized managers need to be able to search for it. Users' results lists must not contain documents they are barred from opening; the very existence of a document can be sensitive information.
User authorization is often mediated through access control lists or ACLs. ACLs can be dealt with in an information retrieval system by representing each document as the set of users that can access them (Figure 4.8 ) and then inverting the resulting user-document matrix. The inverted ACL index has, for each user, a ``postings list'' of documents they can access - the user's access list. Search results are then intersected with this list. However, such an index is difficult to maintain when access permissions change - we discussed these difficulties in the context of incremental indexing for regular postings lists in Section 4.5. It also requires the processing of very long postings lists for users with access to large document subsets. User membership is therefore often verified by retrieving access information directly from the file system at query time - even though this slows down retrieval. "

In any kind of information related system, secruity is always one the most important parts. Especially people emphasize about information privacy much more than ever before. How to protect the information and improve the effiency of retrieval at the same time becomes a big challenge for nowaday's development of information retrieval technology. In many management information systems, different staffs may have different anthorization to get different information. Thus, different information may have different access frequency. It's difficult to define which information could be most accessable. It makes harder to develop the proper retrieval model to increase the effiency.

2014年1月13日星期一

WEEK 2 MUDDIEST POINT

If the word "white" is the last word of web page 1, the word "house" is the first word of consecutive web page 2, how could IR system identify the phrase "white house" as a whole not seperate?

2014年1月6日星期一

WEEK 2 READING NOTES

"Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list , the members of which are then discarded during indexing. An example of a stop list is shown in Figure 2.5 . Using a stop list significantly reduces the number of postings that a system has to store; we will present some statistics on this in Chapter 5 (see Table 5.1 , page 5.1 ). And a lot of the time not indexing stop words does little harm: keyword searches with terms like the and by don't seem very useful. However, this is not true for phrase searches. The phrase query ``President of the United States'', which contains two stop words, is more precise than President AND ``United States''. The meaning of flights to London is likely to be lost if the word to is stopped out. A search for Vannevar Bush's article As we may think will be difficult if the first three words are stopped out, and the system searches simply for documents containing the word think. Some special query types are disproportionately affected. Some song titles and well known pieces of verse consist entirely of words that are commonly on stop lists (To be or not to be, Let It Be, I don't want to be, ...)."



The improvement of segmentation plays an important role during the evolution of IR system. The proper segmentation method is significant for IR system. Without it, the result may be non-relevant with the keywords at all. If a user wants to search information about "Pittsburgh Steelers", for a not well-developed segmentation, the results may be about Pittsburgh the city and steeler the job. But well-developed segmentation could recognize that the phrase "Pittsburgh Steelers" means the football team. Just recognizing the meaning of each word is not enough, the proper meanings of phrases must be understood well by IR system. 
However, for cross-language retrieval,  the segmentation need to be developed a lot in order to fulfill with many other languages. This is a big challenge because different languages may have different basic elements and different ways to express the same meaning.
Thus, the development of segmentation for many most useful languages could not stop. It plays one of the key roles in retrieval area.

WEEK 1 MUDDIEST POINT

Because there are too many ways to express a single meaning, how could the IR system identify those different terms as a whole same meaning to return related results as much as possible?

WEEK 1 READING NOTES

"The first step is initiated by people who (anticipating our interest in building a search engine) we'll call "users," and their questions. We don't know a lot about these people, but we do know they are in a particular frame of mind, a special cognitive state - they may be aware of a specific gap in their knowledge (or they be only vaguely puzzled), and they're motivated to fill it. They want to FOA ... some topic.
Supposing for a moment that we were there to ask, the users may not even be able to characterize the topic, i.e., their knowladge gap. More precisely, they may not be able to fully define characteristics of the "answer" they seek. A paradoxical feature of the FOA problem is that if users knew their question, they might not even need the search engine we are designing - forming a clearly posed question is often the hardest part of answering it. "



There are lots of related results through search engines. It is impossible for users to look through all these results. Some of them may have professional training before, thus they can identify useful information. But most of them have no professional training, they might find it is difficult to choose through all these information. In the end, lacking of professional training or lacking of the ability to identify information becomes one of the barriers which users face in getting information. 
Thus, to have proper training is very necessary for users. Once they can express what information they really want  as accurate as possible, the most useful information results will be on the surface through IR system. 
Also, IR system can be a learning machine such as Google, as long as it is developed well enough. By using as a learning machine, IR system could identify much more terms in the future.