[Tagdb] search keywords vs tags - automated tagging of docs
ogjunk-tagdb at yahoo.com
ogjunk-tagdb at yahoo.com
Wed Dec 20 21:31:31 GMT 2006
Keywords - tags - no major difference. Stop words can be tricky :)
Using tokens that Lucene ends up with for tags may be doable, but you'd really want significant terms, not just any token (term frequency in Lucene comes in handy here). However, automated tagging kind of defeats the purpose of tags. You want a human brain to produce those. Any machine can extract most frequent terms from a document, but those terms are not necessarily the best tags.
See http://www.simpy.com/ - it's all built on top of Lucene, and tags are everywhere.
Otis
----- Original Message ----
From: Nitin Borwankar <nitin at borwankar.com>
To: tagdb at lists.tagschema.com
Sent: Wednesday, December 20, 2006 1:10:02 PM
Subject: [Tagdb] search keywords vs tags - automated tagging of docs
Increasingly I have been getting interested in the vertical search space
and have been looking at nutch
www.nutch.org built on top of Lucene the java text indexing/searching
library.
A question arises in my mind when I look at tokenization and inverted
indexes etc... which are the bread and butter of IR and text search.....
What is the fundamental difference between a set of search keywords as
typed into a search bar vs a set of tags by which I search for something
on del.icio.us ?
It seems to me that if one wore to throw out the obvious stop words
etc., then the set of keywords ( tokens ) that say Lucene generates for
a document are a good first order set of (system generated) tags for the
document.
Any comments arguments one way or another ?
This has major implications for automated tagging, so I am really
curious as to why this won't work.
Nitin
--
Nitin Borwankar
Find, Learn, Act .... Greener
http://greener.com
nitin at borwankar.com
510-872-7066
_______________________________________________
Tagdb mailing list
Tagdb at lists.tagschema.com
http://lists.tagschema.com/mailman/listinfo/tagdb
More information about the Tagdb
mailing list