| PDF versions:
Thesis(3.4
MB)
Appendix 1(81K)
Appendix 2(24K)
Appendix 3(59K)
|
Abstract
This thesis describes a new technique
for summarising the information found in Web pages in a coherent snippet.
This technique relies on two main assumptions. Firstly, in their own Web
space, people describe Web pages. Secondly, people link to the Web pages
they describe with an anchor that is clearly marked in HTML markup. In this
thesis, we identified four different anchor-paragraph arrangements with which
people refer to other Web pages. We named each arrangement and explained
its function. Based on our findings, we designed an extraction tool, called
SnipIt. SnipIt uses one of the four arrangements we identified to extract
descriptions of Web pages.
According to recent tests performed
by the commercial search engine Google, SnipIt is estimated to cover about
5% of all the pages found on the Web, with approximately 4-5 descriptions
per page. This estimate marks a great advancement on the state-of-the-art
of Web page description collections (directories like Yahoo! and the Open
Directory Project) which currently cover (together) less than 0.5% of the
Web, by employing tens of thousands of human editors to maintain their collections.
In comparison, SnipIt requires no manual editing, and is able to describe
at least 30 times as many Web sites.
After identifying, defining, and testing
anchor-paragraph layout arrangements for best implementing SnipIt, we designed
an application that makes use of the extracted snippets. This application
is called InCommonSense, and it is a mechanism for producing short coherent
snippets to describe Web search results. InCommonSense chooses the best description
out of the snippets found by SnipIt. InCommonSense is based on experiments
conducted with 746 users who rated descriptions for their quality. Their
preferences and choices were then trained and tested through a machine learning
process. The rules derived from this process were hard coded into InCommonSense.
The output of InCommonSense, descriptions
for search engine results, was rigorously evaluated against the current
output of commercial search engines. This was done by way of an online experiment
with over 1000 participants. In terms of ease of interaction, our evaluation
shows that InCommonSense is superior to the output of the commercial search
engines tested.
|