[Tagdb] Re: LibraryThing guy

Charles Brian Quinn me at seebq.com
Wed Dec 14 15:59:08 GMT 2005


Timothy Spalding wrote:
> 
> I'm minimally interested in the purely "technical" side of tagging,  
> and even less in debates about folksonomy vs. ontology, etc. But I am  
> interested in the algorithmic side—what cool things can be done, and  
> how? How does one derive related tags? How does one compute the  
> contaguinity (smile) between two taggers or tagged items? How can  
> tags impact non-tags. Notably, LibraryThing has more types of data to  
> draw on that some comparable services. Ideally, LibraryThing has all  
> the books-based metadata in a librarian's Marc record—from author,  
> series, original work and so forth to Dewey numbers, LC numbers and  
> LC subject headings. How can tagging dance with other metadata and  
> with professional classifications? I've done some work on all these,  
> but it's been in a vacuum.
> 
> So, hello. I'd love to talk about these topics or others. If this  
> isn't the right place, can anyone suggest another?

Hi.  Another lurker and a fan of algorithmic "cool things" involved with 
tagging.

Hope this is the right place for discussion, too.  I just wanted to 
throw some comments out there about related-ness.  I've done some 
research (read: googling) on cluster analysis, card-sorting, etc., as 
methods for determining related-ness of groups of data, but being the 
ever computer-science junkie, thought I'd continue brainstorming aloud 
here.  (I hope this is the right place as well).

So if you've built/imagined a simple tagging system, you've most likely 
got a somewhat third-normalized form of some type of database with a tag 
table and some type of object_tag_map table to link whatever object 
(web-sites, books). Please see the archives and of course Nitin's 
wonderful blog -- and I know this is a simple version, bare with me. 
And yes, you've got lots of other meta-data and non-meta-data (is that 
just data?), like maybe, timestamps, counts/total uses of tags, and 
counts/totals of tags per object.

So, if you've got the processing power (and the existing data), you 
could run some type of query to group tags with the same relative 
"timestamp" -- meaning a user inserted all these tags "together." 
Perhaps you could create a relations table, that simply has a one to one 
mapping (of a tag's index to another related tag's index).  Now, you 
could enact a rule to always put both relations ("web" -> "design" and 
"design -> "web") to save on a query on both columns at the expense of 
more data, and instead of doing clustering after the fact, perhaps you 
can build in (or if you're still designing your system, you can start) 
inserting relations as tags are inserted to your tables.  So all tags 
assigned to an object have a level of relatedness.  And perhaps it's 
points based, so that every time someone puts "web" and "design" 
together the "web" <-> "design" relation goes up a "point."  Maybe you 
are also clever and always insert your relations in alphabetical (or 
index-based order) so as not to duplicate relations (always "design" -> 
"web" and never "web" -> "design" in the relations table).  So in 
reality the system is just keeping a tally of how many times users have 
tagged these two words together.

Now, maybe that's part one.  Perhaps part two does cluster analysis on 
the objects.  Perhaps you now write some type of query that says, for 
each object (site, book, etc.), give me all tags and let's do some 
relatedness banging to the tags on that.  You could get clever and 
assign more "relatedness points" for the most "popular" tags -- meaning, 
the highest tag count items per object have more relatedness because 
more people have tagged this item together.

Perhaps part three is taking each user's tag base (their tag cloud) and 
doing relatedness on that.  So every user's tag, has a level of 
relatedness, just for adding to your data.  I know, for people who have 
tags ranging all over the place, perhaps this doesn't add to much to the 
formula, but it could....

Part four :  how about continguatajieity (what he said above) between 
users -- helping your relations come along....

Perhaps the other parts of the equation revolve around relating tag data 
  and non-tag data as mentioned above.  Perhaps you start clustering 
other user's data or object meta-data like titles, definitions, ... etc.

So maybe you come up some of these metrics and some coefficients for the 
points totals, and formulas: to figure out a number based quantity how 
much "design" relates to "web" :

.5 * (part 1 points)  +  .3 * (part 2 points)  +  .1 * (part 3 points) + 
   .1 * (part 4 points) + .... =

I haven't implemented this -- I am merely throwing it out there.  Any 
takers, commenters, suggestioners, flamers?

-- 
Charles Brian Quinn
www.seebq.com


More information about the Tagdb mailing list