[Tagdb] search keywords vs tags - automated tagging of docs
ogjunk-tagdb at yahoo.com
ogjunk-tagdb at yahoo.com
Wed Dec 20 23:39:15 GMT 2006
SIPs are really something different from tags, but I guess you can use them here to illustrate the internal/external point. If you look at Simpy and upload some bookmarks, you'll see it, too, extracts some tags from uploaded data (did I just eat what I said earlier? I think so...). I suppose that's automated tagging as well. It helps with just what you described - here is my data, extract something from it, let me browse it and slice it in possibly interesting ways. So, yes, this is a quick way to go from no tags to some extrapolated tags.
The other stuff you are talking about is useful, too - having some set of well-known system-generated tags. For instance, one could look at extensions of files/bookmarks people bookmark (in Simpy's case) and group png, gif, pjg, jpeg, bmp, ico, etc. under some sort of 'image' tag.
Here is one lovely URL (I know Erik H. loves sexy URLs). Various video formats and videos people think are funny:
http://www.simpy.com/links/search/ext%253A(wmv%2520OR%2520mov%2520OR%2520mpg%2520OR%2520mpeg)%2520AND%2520tags%253Afunny
Of course, one should go a step further than this geeky "ext", and call this "videos" or "funny videos" or some such, and hide nerdy extensions and fielded search.
Otis
----- Original Message ----
From: Nitin Borwankar <nitin at borwankar.com>
To: tagdb at lists.tagschema.com
Sent: Wednesday, December 20, 2006 4:45:27 PM
Subject: Re: [Tagdb] search keywords vs tags - automated tagging of docs
ogjunk-tagdb at yahoo.com wrote:
Keywords - tags - no major difference. Stop words can be tricky :)
Using tokens that Lucene ends up with for tags may be doable, but you'd really want significant terms, not just any token (term frequency in Lucene comes in handy here). However, automated tagging kind of defeats the purpose of tags. You want a human brain to produce those. Any machine can extract most frequent terms from a document, but those terms are not necessarily the best tags.
See http://www.simpy.com/ - it's all built on top of Lucene, and tags are everywhere.
Otis
OK, Otis,
Glad you brought that up because I wanted to set up the discussion for
what I call
"intrinsic tags" vs "extrinsic tags"
Intrinsic tags are like Amazon's SIP's - they are intrinsic
characteristics of the content
Intrinsic tags are always a) derived from text in the document b) devoid
of interpretation or implied meaning
Extrinsic tags i.e. folksonomy tags, are a human description or
interpretation - so they have many layers of meaning
* tags I apply to a document could be "workflow tags" i.e "save this for
later" or "send to Joe"
* or they could be "descriptors that have global meaning" - "adult
content"
* or they could be "descriptors that have group meaning" - "project X
needs this"
* or they could be "descriptors that have private meaning" - "summer
holiday"
If you wanted to bootstrap a large corpus of text into a folksonomy
context, automated tagging would get you in the game and at least allow
rapid navigation of the whole document space albeit *in a very crude way*.
But if you wait for the whole doc space to be manually tagged it could
take a long time or never happen.
So the question is would this be a viable way to bootstrap large text
corpuses into a folksonomy context, i.e. make them usable enough that I
can now find *roughly* what I am looking for and then apply my own tags
to it.
At all time it would be useful to distinguish between system-generated
( intrinsic ) tags and user-generated (extrinsic ) tags and allow
independent navigation over the separate tagging spaces as well as allow
navigation over the combined space.
Thoughts ?
--
Nitin Borwankar
Find, Learn, Act .... Greener
http://greener.com
nitin at borwankar.com
510-872-7066
>----- Original Message ----
>From: Nitin Borwankar <nitin at borwankar.com>
>To: tagdb at lists.tagschema.com
>Sent: Wednesday, December 20, 2006 1:10:02 PM
>Subject: [Tagdb] search keywords vs tags - automated tagging of docs
>
>Increasingly I have been getting interested in the vertical search space
>and have been looking at nutch
>www.nutch.org built on top of Lucene the java text indexing/searching
>library.
>
>A question arises in my mind when I look at tokenization and inverted
>indexes etc... which are the bread and butter of IR and text search.....
>
>What is the fundamental difference between a set of search keywords as
>typed into a search bar vs a set of tags by which I search for something
>on del.icio.us ?
>It seems to me that if one wore to throw out the obvious stop words
>etc., then the set of keywords ( tokens ) that say Lucene generates for
>a document are a good first order set of (system generated) tags for the
>document.
>
>Any comments arguments one way or another ?
>This has major implications for automated tagging, so I am really
>curious as to why this won't work.
>
>Nitin
>
>
>
>
_______________________________________________
Tagdb mailing list
Tagdb at lists.tagschema.com
http://lists.tagschema.com/mailman/listinfo/tagdb
More information about the Tagdb
mailing list