[Tagdb] search keywords vs tags - automated tagging of docs
Nitin Borwankar
nitin at borwankar.com
Wed Dec 20 21:45:27 GMT 2006
ogjunk-tagdb at yahoo.com wrote:
Keywords - tags - no major difference. Stop words can be tricky :)
Using tokens that Lucene ends up with for tags may be doable, but you'd really want significant terms, not just any token (term frequency in Lucene comes in handy here). However, automated tagging kind of defeats the purpose of tags. You want a human brain to produce those. Any machine can extract most frequent terms from a document, but those terms are not necessarily the best tags.
See http://www.simpy.com/ - it's all built on top of Lucene, and tags are everywhere.
Otis
OK, Otis,
Glad you brought that up because I wanted to set up the discussion for
what I call
"intrinsic tags" vs "extrinsic tags"
Intrinsic tags are like Amazon's SIP's - they are intrinsic
characteristics of the content
Intrinsic tags are always a) derived from text in the document b) devoid
of interpretation or implied meaning
Extrinsic tags i.e. folksonomy tags, are a human description or
interpretation - so they have many layers of meaning
* tags I apply to a document could be "workflow tags" i.e "save this for
later" or "send to Joe"
* or they could be "descriptors that have global meaning" - "adult
content"
* or they could be "descriptors that have group meaning" - "project X
needs this"
* or they could be "descriptors that have private meaning" - "summer
holiday"
If you wanted to bootstrap a large corpus of text into a folksonomy
context, automated tagging would get you in the game and at least allow
rapid navigation of the whole document space albeit *in a very crude way*.
But if you wait for the whole doc space to be manually tagged it could
take a long time or never happen.
So the question is would this be a viable way to bootstrap large text
corpuses into a folksonomy context, i.e. make them usable enough that I
can now find *roughly* what I am looking for and then apply my own tags
to it.
At all time it would be useful to distinguish between system-generated
( intrinsic ) tags and user-generated (extrinsic ) tags and allow
independent navigation over the separate tagging spaces as well as allow
navigation over the combined space.
Thoughts ?
--
Nitin Borwankar
Find, Learn, Act .... Greener
http://greener.com
nitin at borwankar.com
510-872-7066
>----- Original Message ----
>From: Nitin Borwankar <nitin at borwankar.com>
>To: tagdb at lists.tagschema.com
>Sent: Wednesday, December 20, 2006 1:10:02 PM
>Subject: [Tagdb] search keywords vs tags - automated tagging of docs
>
>Increasingly I have been getting interested in the vertical search space
>and have been looking at nutch
>www.nutch.org built on top of Lucene the java text indexing/searching
>library.
>
>A question arises in my mind when I look at tokenization and inverted
>indexes etc... which are the bread and butter of IR and text search.....
>
>What is the fundamental difference between a set of search keywords as
>typed into a search bar vs a set of tags by which I search for something
>on del.icio.us ?
>It seems to me that if one wore to throw out the obvious stop words
>etc., then the set of keywords ( tokens ) that say Lucene generates for
>a document are a good first order set of (system generated) tags for the
>document.
>
>Any comments arguments one way or another ?
>This has major implications for automated tagging, so I am really
>curious as to why this won't work.
>
>Nitin
>
>
>
>
More information about the Tagdb
mailing list