[Tagdb] Re: LibraryThing guy
Charles Brian Quinn
me at seebq.com
Wed Dec 14 15:59:08 GMT 2005
Timothy Spalding wrote:
>
> I'm minimally interested in the purely "technical" side of tagging,
> and even less in debates about folksonomy vs. ontology, etc. But I am
> interested in the algorithmic side—what cool things can be done, and
> how? How does one derive related tags? How does one compute the
> contaguinity (smile) between two taggers or tagged items? How can
> tags impact non-tags. Notably, LibraryThing has more types of data to
> draw on that some comparable services. Ideally, LibraryThing has all
> the books-based metadata in a librarian's Marc record—from author,
> series, original work and so forth to Dewey numbers, LC numbers and
> LC subject headings. How can tagging dance with other metadata and
> with professional classifications? I've done some work on all these,
> but it's been in a vacuum.
>
> So, hello. I'd love to talk about these topics or others. If this
> isn't the right place, can anyone suggest another?
Hi. Another lurker and a fan of algorithmic "cool things" involved with
tagging.
Hope this is the right place for discussion, too. I just wanted to
throw some comments out there about related-ness. I've done some
research (read: googling) on cluster analysis, card-sorting, etc., as
methods for determining related-ness of groups of data, but being the
ever computer-science junkie, thought I'd continue brainstorming aloud
here. (I hope this is the right place as well).
So if you've built/imagined a simple tagging system, you've most likely
got a somewhat third-normalized form of some type of database with a tag
table and some type of object_tag_map table to link whatever object
(web-sites, books). Please see the archives and of course Nitin's
wonderful blog -- and I know this is a simple version, bare with me.
And yes, you've got lots of other meta-data and non-meta-data (is that
just data?), like maybe, timestamps, counts/total uses of tags, and
counts/totals of tags per object.
So, if you've got the processing power (and the existing data), you
could run some type of query to group tags with the same relative
"timestamp" -- meaning a user inserted all these tags "together."
Perhaps you could create a relations table, that simply has a one to one
mapping (of a tag's index to another related tag's index). Now, you
could enact a rule to always put both relations ("web" -> "design" and
"design -> "web") to save on a query on both columns at the expense of
more data, and instead of doing clustering after the fact, perhaps you
can build in (or if you're still designing your system, you can start)
inserting relations as tags are inserted to your tables. So all tags
assigned to an object have a level of relatedness. And perhaps it's
points based, so that every time someone puts "web" and "design"
together the "web" <-> "design" relation goes up a "point." Maybe you
are also clever and always insert your relations in alphabetical (or
index-based order) so as not to duplicate relations (always "design" ->
"web" and never "web" -> "design" in the relations table). So in
reality the system is just keeping a tally of how many times users have
tagged these two words together.
Now, maybe that's part one. Perhaps part two does cluster analysis on
the objects. Perhaps you now write some type of query that says, for
each object (site, book, etc.), give me all tags and let's do some
relatedness banging to the tags on that. You could get clever and
assign more "relatedness points" for the most "popular" tags -- meaning,
the highest tag count items per object have more relatedness because
more people have tagged this item together.
Perhaps part three is taking each user's tag base (their tag cloud) and
doing relatedness on that. So every user's tag, has a level of
relatedness, just for adding to your data. I know, for people who have
tags ranging all over the place, perhaps this doesn't add to much to the
formula, but it could....
Part four : how about continguatajieity (what he said above) between
users -- helping your relations come along....
Perhaps the other parts of the equation revolve around relating tag data
and non-tag data as mentioned above. Perhaps you start clustering
other user's data or object meta-data like titles, definitions, ... etc.
So maybe you come up some of these metrics and some coefficients for the
points totals, and formulas: to figure out a number based quantity how
much "design" relates to "web" :
.5 * (part 1 points) + .3 * (part 2 points) + .1 * (part 3 points) +
.1 * (part 4 points) + .... =
I haven't implemented this -- I am merely throwing it out there. Any
takers, commenters, suggestioners, flamers?
--
Charles Brian Quinn
www.seebq.com
More information about the Tagdb
mailing list