IS 2140 BLOG: 三月 2014

2014年3月27日星期四

WEEK 11 READING NOTES

"If the search engine maintains a dynamic index that allows updates (e.g., document insertions/deletions), then it may even be possible to carry out the updates in a distributed fashion,in which each node takes care of the updates that pertain to its part of the overall index. This approach eliminates the need for a complicated centralized index construction/maintenance process that involves the whole index. However, it is applicable only if documents may be assumed to be independent of each other, not if inter-document information, such as hyperlinks and anchor text, is part of the index."

The indexing of documents is complicated than what we thought. There are too many kinds of documents. For each kind of documents, the index would be different from each other. Thus, the index methords need to be considered in so many different angles in order to meet the requirement of so many documents as much as possible. In addition, while meeting different documents, it is necessary to develop different indexing methords to treat different documents. For instance, how to avoid the link broken, how to extract foreign language words, etc.

WEEK 10 MUDDIEST POINT

Is there any filter in web crawling?

2014年3月13日星期四

WEEK 10 READING NOTES

"The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. We have created maps containing as many as 518 million of these hyperlinks, a significant sample of the total. These maps allow rapid calculation of a web page’s "PageRank", an objective measure of its citation importance that corresponds well with people’s subjective idea of
importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at
google.stanford.edu). For the type of full text searches in the main Google system, PageRank also helps a great deal."

In the past, people may think that the page which mentions about the keyword must be the relevant page for users. The evaluation of relevance is about the word frequency. Hence, in the search engine first appearance age, the working mechanism of it was far from "artificial intelligence". Search engines before Google like Altavista and Excite, they were designed to rank information basing on priority. The ranking could be impacted in many different ways. If the visits are huge or the frequency of key words is high, the page may be ranked in a high place, though the page could be hardly relevant to the requirement of users. Obviously, this kind of ranking mechanism could cause cheating activities in a easy way.

WEEK 8 MUDDIEST POINT

How to calculate suggested query terms?