[Tagdb] search keywords vs tags - automated tagging of docs

Enis Soztutar enis.soz.nutch at gmail.com
Fri Dec 22 08:20:06 GMT 2006


Hi all, continuing the discussion, my experience with both nutch and 
tagging systems(especially scuttle) has made me understand the 
differences between them better.


First of all, in crawling and indexing of web documents for keyword 
extraction, the indexer indexes all the words in the document by first 
tokenizing the text. Thus all the words, including those that can be 
"tags" and those that cannot be, are indexed. As discussed stop words is 
a good example of this. But apart from stop words there are lots of 
keywords that are not representative of the content of the document. 
Moreover not all the keywords or tags occur in the context of the site 
you are indexing. For example in the Honda's page (www.honda.com), the 
phrase "car manufacturer" does not exist. But links to the honda's page 
possibly contain that information. Thus a page is indexed with the 
anchoring text. Likewise, tags of a page contain info about what the 
page is about, and may contain words that do not exists in the text of 
the page. Moreover tokenization in tagging may be different. A page will 
be tagged with multiple words(in some tagging systems) although keywords 
are single words.

Another major difference comes from the quality of the web pages that 
are tagged. The major problem of most search engines(except big ones) is 
dealing with spam sites. Although lots of spamming methods exist, the 
most frequent one it to list thousands of popular word in the site. Thus 
the spam site becomes related with most of the search queries. However, 
in my opinion, the web sites entered to a tagging system is expected to 
be of higher quality. And i claim that the rank of a page may be 
estimated using the number of people that has bookmarked that page. 
Well, there are some search engines that bring together the idea of 
tagging with indexing and we will see if they prosper.

Enis Soztutar

Nitin Borwankar wrote:
> Increasingly I have been getting interested in the vertical search space 
> and have been looking at nutch
> www.nutch.org built on top of Lucene the java text indexing/searching 
> library.
>
> A question arises in my mind when I look at tokenization and inverted 
> indexes etc... which are the bread and butter of IR and text search.....
>
> What is the fundamental difference between a set of search keywords as 
> typed into a search bar vs a set of tags by which I search for something 
> on del.icio.us ?
> It seems to me that if one wore to throw out the obvious stop words 
> etc., then the set of keywords ( tokens ) that say Lucene generates for 
> a document are a good first order set of (system generated) tags for the 
> document.
>
> Any comments arguments one way or another ?
> This has major implications for automated tagging, so I am really 
> curious as to why this won't work.
>
> Nitin
>
>  
>   



More information about the Tagdb mailing list