[Tagdb] search keywords vs tags - automated tagging of docs

Nitin Borwankar nitin at borwankar.com
Thu Dec 21 00:49:53 GMT 2006


Gordon Mohr wrote:

>Auto-extracting notable excerpts from a document as 'tags' recalls the 
>shortcuts taken back when full-text inverted-indexes were too expensive.
>
>Of course, today's tech can handle such 'reduced' forms of documents (by 
>auto-extraction or manual tag/labelling) in full-text search engines 
>very easily.
>
>Does this call into question whether any relational schema is needed for 
>tag systems at all?
>  
>

This is the very question that I have been mulling ever since I started 
playing with Lucene and Nutch.

So far my brain has thrown up two different answers :-

Relational DB not needed if you all you want is to find docs based on 
tags and also do boolean ops on docs based on tags - apparently Lucene 
does this kind of stuff very efficiently.

Relational DB needed if you now want to join these tagged items to user 
tables and also to do aggregate operations
on tag instances - e.g. number of users who used tag T, or how many 
instances of tag S were used by user U, ... users U and V .....

So it depends on whether the application is pure text search or not - 
typically in the folksonomy situation we also have the dratted user ;-)

Nitin Borwankar.




>Just treat every tagging-event (human or automated) as a fielded text 
>document and text-index. The degenerate schema, placing all tags in one 
>internally-delimited column, doesn't deserve the bad rap it sometimes 
>gets, if in fact full-text inverted-indexes are the usual way to query.
>
>- Gordon @ Bitzi
>
>Nitin Borwankar wrote:
>  
>
>>OK, Otis,
>>
>>Glad you brought that up because I wanted to set up the discussion for 
>>what I call
>>"intrinsic tags" vs "extrinsic tags"
>>
>>Intrinsic tags are like Amazon's SIP's - they are intrinsic 
>>characteristics of the content
>>Intrinsic tags are always a) derived from text in the document b) devoid 
>>of interpretation or implied meaning
>>
>>Extrinsic tags i.e. folksonomy tags, are a human description or 
>>interpretation - so they have many layers of meaning
>>
>>* tags I apply to a document could be "workflow tags" i.e "save this for 
>>later" or "send to Joe"
>>* or they could be "descriptors that have global meaning"  - "adult 
>>content"
>>* or they could be "descriptors that have group meaning" - "project X 
>>needs this"
>>* or they could be "descriptors that have private meaning" - "summer 
>>holiday"
>>
>>
>>If you wanted to bootstrap a large corpus of text into a folksonomy 
>>context, automated tagging would get you in the game and at least allow 
>>rapid navigation of the whole document space albeit *in a very crude way*.
>>But if  you wait for the whole doc space to be manually tagged it could 
>>take a long time or never happen.
>>So the question is would this be a viable way to bootstrap large text 
>>corpuses into a folksonomy context, i.e. make them usable enough that I 
>>can now find *roughly* what I am looking for and then apply my own tags 
>>to it.
>>
>>At  all time it would be useful to distinguish between system-generated 
>>( intrinsic ) tags and user-generated (extrinsic ) tags and allow 
>>independent navigation over the separate tagging spaces as well as allow 
>>navigation over the combined space.
>>
>>Thoughts ?
>>
>>    
>>
>_______________________________________________
>Tagdb mailing list
>Tagdb at lists.tagschema.com
>http://lists.tagschema.com/mailman/listinfo/tagdb
>  
>


-- 
Nitin Borwankar
Find, Learn, Act .... Greener
http://greener.com
nitin at borwankar.com
510-872-7066



More information about the Tagdb mailing list