[Tagdb] search keywords vs tags - automated tagging of docs
ogjunk-tagdb at yahoo.com
ogjunk-tagdb at yahoo.com
Sat Dec 23 05:04:27 GMT 2006
He he, didn't I talk about this stuff here about a year ago? :)
I'll point to http://simpy.com yet again. Do a search and look at the related tags on the right.
My answer to your first question is: yes.
My answer to your second question is: somewhat. Those documents just happen to share some term(s), but that may be it. Put another way, there may be other documents in the corpus that are much closer to each other (think clustering). A better approach would be a little different (go to your links page on Simpy and click on "similar" link next to each of your bookmarks.... of just go to my links and use "similar" there: http://www.simpy.com/user/otis/links ). You could take your k1 document, extract the seminal terms and use those to form a query. Or use a complete document (all its terms) and use that as a query. That would then really give you other documents that are the most similar to the exemplar document.
Otis
----- Original Message ----
From: Nitin Borwankar <nitin at borwankar.com>
To: tagdb at lists.tagschema.com
Sent: Friday, December 22, 2006 6:21:25 PM
Subject: Re: [Tagdb] search keywords vs tags - automated tagging of docs
So to continue the thread, I started getting interested in search
engines from the insight that a search engine
is really a recommendation system for web pages and I was already
dabbling in recommendation systems.
Page Rank is a recommendation algorithm that recommends pages with the
maximum incoming links.
Given that you are using a set of keywords as a query a search engine
returns a set of URLs that point to pages that
a) have those keywords in the invereted index
and
b) those keywords have some significance in that page ( title, ... )
So given a document D with inverted index containing tokens k1, k2, k3, k4
If I search for the keyword k1 I will get D amongst other docs.
Could I then look at all the docs within the search result ( for k1)
and look at all the other tokens
in the inverted indexes of these docs and do a frequency count over the
result set, then take the most popular
ones as "related" keywords.
i.e D---> k1, k2,k3,k4
E---> k1,k2,k5,k6
F---> k1,k2,k3,k6
G--->k1, k2,k3,k6
so we have aside from k1, in the set (E,F,G) the following frequency map
k2 --> 3
k3--->2
k4--->0
k5-->1
k6--> 3
in sorted order by frequency
k2 --> 3
k6 --> 3
k3 --> 2
k5 --> 1
k4 --> 0
could we consider k2, k6 to be "related keywords" or "related tags",
related to k1 ?
and consider all other docs with k2 and k6 ( but not k1 ) to be "related
docs"
I am basically doing in the inverted index and keywords what we do in
database and tags and items to find related
items.
Nitin
I.e. can I use the inverted index to compute "related" documents ?
Enis Soztutar wrote:
>Hi all, continuing the discussion, my experience with both nutch and
>tagging systems(especially scuttle) has made me understand the
>differences between them better.
>
>
>First of all, in crawling and indexing of web documents for keyword
>extraction, the indexer indexes all the words in the document by first
>tokenizing the text. Thus all the words, including those that can be
>"tags" and those that cannot be, are indexed. As discussed stop words is
>a good example of this. But apart from stop words there are lots of
>keywords that are not representative of the content of the document.
>Moreover not all the keywords or tags occur in the context of the site
>you are indexing. For example in the Honda's page (www.honda.com), the
>phrase "car manufacturer" does not exist. But links to the honda's page
>possibly contain that information. Thus a page is indexed with the
>anchoring text. Likewise, tags of a page contain info about what the
>page is about, and may contain words that do not exists in the text of
>the page. Moreover tokenization in tagging may be different. A page will
>be tagged with multiple words(in some tagging systems) although keywords
>are single words.
>
>Another major difference comes from the quality of the web pages that
>are tagged. The major problem of most search engines(except big ones) is
>dealing with spam sites. Although lots of spamming methods exist, the
>most frequent one it to list thousands of popular word in the site. Thus
>the spam site becomes related with most of the search queries. However,
>in my opinion, the web sites entered to a tagging system is expected to
>be of higher quality. And i claim that the rank of a page may be
>estimated using the number of people that has bookmarked that page.
>Well, there are some search engines that bring together the idea of
>tagging with indexing and we will see if they prosper.
>
>Enis Soztutar
>
>Nitin Borwankar wrote:
>
>
>>Increasingly I have been getting interested in the vertical search space
>>and have been looking at nutch
>>www.nutch.org built on top of Lucene the java text indexing/searching
>>library.
>>
>>A question arises in my mind when I look at tokenization and inverted
>>indexes etc... which are the bread and butter of IR and text search.....
>>
>>What is the fundamental difference between a set of search keywords as
>>typed into a search bar vs a set of tags by which I search for something
>>on del.icio.us ?
>>It seems to me that if one wore to throw out the obvious stop words
>>etc., then the set of keywords ( tokens ) that say Lucene generates for
>>a document are a good first order set of (system generated) tags for the
>>document.
>>
>>Any comments arguments one way or another ?
>>This has major implications for automated tagging, so I am really
>>curious as to why this won't work.
>>
>>Nitin
>>
>>
>>
>>
>>
>
>_______________________________________________
>Tagdb mailing list
>Tagdb at lists.tagschema.com
>http://lists.tagschema.com/mailman/listinfo/tagdb
>
>
--
Nitin Borwankar
Find, Learn, Act .... Greener
http://greener.com
nitin at borwankar.com
510-872-7066
_______________________________________________
Tagdb mailing list
Tagdb at lists.tagschema.com
http://lists.tagschema.com/mailman/listinfo/tagdb
More information about the Tagdb
mailing list