[Tagdb] search keywords vs tags - automated tagging of docs
Gordon Mohr
gojomo at bitzi.com
Fri Dec 22 03:57:07 GMT 2006
Nitin Borwankar wrote:
> Gordon Mohr wrote:
>> Does this call into question whether any relational schema is needed for
>> tag systems at all?
>>
>>
>
> This is the very question that I have been mulling ever since I started
> playing with Lucene and Nutch.
>
> So far my brain has thrown up two different answers :-
>
> Relational DB not needed if you all you want is to find docs based on
> tags and also do boolean ops on docs based on tags - apparently Lucene
> does this kind of stuff very efficiently.
>
> Relational DB needed if you now want to join these tagged items to user
> tables and also to do aggregate operations
> on tag instances - e.g. number of users who used tag T, or how many
> instances of tag S were used by user U, ... users U and V .....
These are awkwardly possible from an inverted index -- though perhaps
more expensive to the point you'd ration to occasional batch analysis.
My hunch is that the most popular views in tagging sites are:
(1) 'most recent' logs by tag, user, target (URI), or grouping
(subscription/watchlist). This *could* be served from the text index,
especially if it is broken-up to allow efficiently searching the most
recent additions first. But, I suspect high-traffic sites utilize raw
logs (in files or RDB) of the tag-events in the most-loaded views.
* 'popular' collections, globally or by tag. Again, these could be
constructed from scanning the whole text index, but maintaining the
'popular' list and stats in an efficient running manner likely requires
other custom data structures.
In my use of tagging sites, I consult these kinds of views 10x to 100x
more than I do multi-tag 'queries'. Does this match others' usage (or
real site stats)?
- Gordon @ Bitzi
> So it depends on whether the application is pure text search or not -
> typically in the folksonomy situation we also have the dratted user ;-)
>
> Nitin Borwankar.
>
>
>
>
>> Just treat every tagging-event (human or automated) as a fielded text
>> document and text-index. The degenerate schema, placing all tags in one
>> internally-delimited column, doesn't deserve the bad rap it sometimes
>> gets, if in fact full-text inverted-indexes are the usual way to query.
>>
>> - Gordon @ Bitzi
>>
>> Nitin Borwankar wrote:
>>
>>
>>> OK, Otis,
>>>
>>> Glad you brought that up because I wanted to set up the discussion for
>>> what I call
>>> "intrinsic tags" vs "extrinsic tags"
>>>
>>> Intrinsic tags are like Amazon's SIP's - they are intrinsic
>>> characteristics of the content
>>> Intrinsic tags are always a) derived from text in the document b) devoid
>>> of interpretation or implied meaning
>>>
>>> Extrinsic tags i.e. folksonomy tags, are a human description or
>>> interpretation - so they have many layers of meaning
>>>
>>> * tags I apply to a document could be "workflow tags" i.e "save this for
>>> later" or "send to Joe"
>>> * or they could be "descriptors that have global meaning" - "adult
>>> content"
>>> * or they could be "descriptors that have group meaning" - "project X
>>> needs this"
>>> * or they could be "descriptors that have private meaning" - "summer
>>> holiday"
>>>
>>>
>>> If you wanted to bootstrap a large corpus of text into a folksonomy
>>> context, automated tagging would get you in the game and at least allow
>>> rapid navigation of the whole document space albeit *in a very crude way*.
>>> But if you wait for the whole doc space to be manually tagged it could
>>> take a long time or never happen.
>>> So the question is would this be a viable way to bootstrap large text
>>> corpuses into a folksonomy context, i.e. make them usable enough that I
>>> can now find *roughly* what I am looking for and then apply my own tags
>>> to it.
>>>
>>> At all time it would be useful to distinguish between system-generated
>>> ( intrinsic ) tags and user-generated (extrinsic ) tags and allow
>>> independent navigation over the separate tagging spaces as well as allow
>>> navigation over the combined space.
>>>
>>> Thoughts ?
>>>
>>>
>>>
>> _______________________________________________
>> Tagdb mailing list
>> Tagdb at lists.tagschema.com
>> http://lists.tagschema.com/mailman/listinfo/tagdb
>>
>>
>
>
More information about the Tagdb
mailing list