[Tagdb] search keywords vs tags - automated tagging of docs

ogjunk-tagdb at yahoo.com ogjunk-tagdb at yahoo.com
Wed Dec 20 21:31:31 GMT 2006


Keywords - tags - no major difference.  Stop words can be tricky :)
Using tokens that Lucene ends up with for tags may be doable, but you'd really want significant terms, not just any token (term frequency in Lucene comes in handy here).  However, automated tagging kind of defeats the purpose of tags.  You want a human brain to produce those.  Any machine can extract most frequent terms from a document, but those terms are not necessarily the best tags.

See http://www.simpy.com/ - it's all built on top of Lucene, and tags are everywhere.

Otis

----- Original Message ----
From: Nitin Borwankar <nitin at borwankar.com>
To: tagdb at lists.tagschema.com
Sent: Wednesday, December 20, 2006 1:10:02 PM
Subject: [Tagdb] search keywords vs tags - automated tagging of docs

Increasingly I have been getting interested in the vertical search space 
and have been looking at nutch
www.nutch.org built on top of Lucene the java text indexing/searching 
library.

A question arises in my mind when I look at tokenization and inverted 
indexes etc... which are the bread and butter of IR and text search.....

What is the fundamental difference between a set of search keywords as 
typed into a search bar vs a set of tags by which I search for something 
on del.icio.us ?
It seems to me that if one wore to throw out the obvious stop words 
etc., then the set of keywords ( tokens ) that say Lucene generates for 
a document are a good first order set of (system generated) tags for the 
document.

Any comments arguments one way or another ?
This has major implications for automated tagging, so I am really 
curious as to why this won't work.

Nitin

 
-- 
Nitin Borwankar
Find, Learn, Act .... Greener
http://greener.com
nitin at borwankar.com
510-872-7066

_______________________________________________
Tagdb mailing list
Tagdb at lists.tagschema.com
http://lists.tagschema.com/mailman/listinfo/tagdb





More information about the Tagdb mailing list