aaa aaa

 

aaa aaa

 

What Lays in the Layout:
Using anchor-paragraph arrangements to extract descriptions of Web documents

Einat Amitay

PDF versions:

Thesis(3.4 MB)
Appendix 1(81K)
Appendix 2(24K)
Appendix 3(59K)
 
 

Abstract
This thesis describes a new technique for summarising the information found in Web pages in a coherent snippet. This technique relies on two main assumptions. Firstly, in their own Web space, people describe Web pages. Secondly, people link to the Web pages they describe with an anchor that is clearly marked in HTML markup. In this thesis, we identified four different anchor-paragraph arrangements with which people refer to other Web pages. We named each arrangement and explained its function. Based on our findings, we designed an extraction tool, called SnipIt. SnipIt uses one of the four arrangements we identified to extract descriptions of Web pages. 
According to recent tests performed by the commercial search engine Google, SnipIt is estimated to cover about 5% of all the pages found on the Web, with approximately 4-5 descriptions per page. This estimate marks a great advancement on the state-of-the-art of Web page description collections (directories like Yahoo! and the Open Directory Project) which currently cover (together) less than 0.5% of the Web, by employing tens of thousands of human editors to maintain their collections. In comparison, SnipIt requires no manual editing, and is able to describe at least 30 times as many Web sites. 

After identifying, defining, and testing anchor-paragraph layout arrangements for best implementing SnipIt, we designed an application that makes use of the extracted snippets. This application is called InCommonSense, and it is a mechanism for producing short coherent snippets to describe Web search results. InCommonSense chooses the best description out of the snippets found by SnipIt. InCommonSense is based on experiments conducted with 746 users who rated descriptions for their quality. Their preferences and choices were then trained and tested through a machine learning process. The rules derived from this process were hard coded into InCommonSense.

The output of InCommonSense, descriptions for search engine results, was rigorously evaluated against the current output of commercial search engines. This was done by way of an online experiment with over 1000 participants. In terms of ease of interaction, our evaluation shows that InCommonSense is superior to the output of the commercial search engines tested.