[Tagdb] search keywords vs tags - automated tagging of docs
Enis Soztutar
enis.soz.nutch at gmail.com
Fri Dec 22 08:20:06 GMT 2006
Hi all, continuing the discussion, my experience with both nutch and
tagging systems(especially scuttle) has made me understand the
differences between them better.
First of all, in crawling and indexing of web documents for keyword
extraction, the indexer indexes all the words in the document by first
tokenizing the text. Thus all the words, including those that can be
"tags" and those that cannot be, are indexed. As discussed stop words is
a good example of this. But apart from stop words there are lots of
keywords that are not representative of the content of the document.
Moreover not all the keywords or tags occur in the context of the site
you are indexing. For example in the Honda's page (www.honda.com), the
phrase "car manufacturer" does not exist. But links to the honda's page
possibly contain that information. Thus a page is indexed with the
anchoring text. Likewise, tags of a page contain info about what the
page is about, and may contain words that do not exists in the text of
the page. Moreover tokenization in tagging may be different. A page will
be tagged with multiple words(in some tagging systems) although keywords
are single words.
Another major difference comes from the quality of the web pages that
are tagged. The major problem of most search engines(except big ones) is
dealing with spam sites. Although lots of spamming methods exist, the
most frequent one it to list thousands of popular word in the site. Thus
the spam site becomes related with most of the search queries. However,
in my opinion, the web sites entered to a tagging system is expected to
be of higher quality. And i claim that the rank of a page may be
estimated using the number of people that has bookmarked that page.
Well, there are some search engines that bring together the idea of
tagging with indexing and we will see if they prosper.
Enis Soztutar
Nitin Borwankar wrote:
> Increasingly I have been getting interested in the vertical search space
> and have been looking at nutch
> www.nutch.org built on top of Lucene the java text indexing/searching
> library.
>
> A question arises in my mind when I look at tokenization and inverted
> indexes etc... which are the bread and butter of IR and text search.....
>
> What is the fundamental difference between a set of search keywords as
> typed into a search bar vs a set of tags by which I search for something
> on del.icio.us ?
> It seems to me that if one wore to throw out the obvious stop words
> etc., then the set of keywords ( tokens ) that say Lucene generates for
> a document are a good first order set of (system generated) tags for the
> document.
>
> Any comments arguments one way or another ?
> This has major implications for automated tagging, so I am really
> curious as to why this won't work.
>
> Nitin
>
>
>
More information about the Tagdb
mailing list