[Tagdb] search keywords vs tags - automated tagging of docs
Nitin Borwankar
nitin at borwankar.com
Fri Dec 22 23:21:25 GMT 2006
So to continue the thread, I started getting interested in search
engines from the insight that a search engine
is really a recommendation system for web pages and I was already
dabbling in recommendation systems.
Page Rank is a recommendation algorithm that recommends pages with the
maximum incoming links.
Given that you are using a set of keywords as a query a search engine
returns a set of URLs that point to pages that
a) have those keywords in the invereted index
and
b) those keywords have some significance in that page ( title, ... )
So given a document D with inverted index containing tokens k1, k2, k3, k4
If I search for the keyword k1 I will get D amongst other docs.
Could I then look at all the docs within the search result ( for k1)
and look at all the other tokens
in the inverted indexes of these docs and do a frequency count over the
result set, then take the most popular
ones as "related" keywords.
i.e D---> k1, k2,k3,k4
E---> k1,k2,k5,k6
F---> k1,k2,k3,k6
G--->k1, k2,k3,k6
so we have aside from k1, in the set (E,F,G) the following frequency map
k2 --> 3
k3--->2
k4--->0
k5-->1
k6--> 3
in sorted order by frequency
k2 --> 3
k6 --> 3
k3 --> 2
k5 --> 1
k4 --> 0
could we consider k2, k6 to be "related keywords" or "related tags",
related to k1 ?
and consider all other docs with k2 and k6 ( but not k1 ) to be "related
docs"
I am basically doing in the inverted index and keywords what we do in
database and tags and items to find related
items.
Nitin
I.e. can I use the inverted index to compute "related" documents ?
Enis Soztutar wrote:
>Hi all, continuing the discussion, my experience with both nutch and
>tagging systems(especially scuttle) has made me understand the
>differences between them better.
>
>
>First of all, in crawling and indexing of web documents for keyword
>extraction, the indexer indexes all the words in the document by first
>tokenizing the text. Thus all the words, including those that can be
>"tags" and those that cannot be, are indexed. As discussed stop words is
>a good example of this. But apart from stop words there are lots of
>keywords that are not representative of the content of the document.
>Moreover not all the keywords or tags occur in the context of the site
>you are indexing. For example in the Honda's page (www.honda.com), the
>phrase "car manufacturer" does not exist. But links to the honda's page
>possibly contain that information. Thus a page is indexed with the
>anchoring text. Likewise, tags of a page contain info about what the
>page is about, and may contain words that do not exists in the text of
>the page. Moreover tokenization in tagging may be different. A page will
>be tagged with multiple words(in some tagging systems) although keywords
>are single words.
>
>Another major difference comes from the quality of the web pages that
>are tagged. The major problem of most search engines(except big ones) is
>dealing with spam sites. Although lots of spamming methods exist, the
>most frequent one it to list thousands of popular word in the site. Thus
>the spam site becomes related with most of the search queries. However,
>in my opinion, the web sites entered to a tagging system is expected to
>be of higher quality. And i claim that the rank of a page may be
>estimated using the number of people that has bookmarked that page.
>Well, there are some search engines that bring together the idea of
>tagging with indexing and we will see if they prosper.
>
>Enis Soztutar
>
>Nitin Borwankar wrote:
>
>
>>Increasingly I have been getting interested in the vertical search space
>>and have been looking at nutch
>>www.nutch.org built on top of Lucene the java text indexing/searching
>>library.
>>
>>A question arises in my mind when I look at tokenization and inverted
>>indexes etc... which are the bread and butter of IR and text search.....
>>
>>What is the fundamental difference between a set of search keywords as
>>typed into a search bar vs a set of tags by which I search for something
>>on del.icio.us ?
>>It seems to me that if one wore to throw out the obvious stop words
>>etc., then the set of keywords ( tokens ) that say Lucene generates for
>>a document are a good first order set of (system generated) tags for the
>>document.
>>
>>Any comments arguments one way or another ?
>>This has major implications for automated tagging, so I am really
>>curious as to why this won't work.
>>
>>Nitin
>>
>>
>>
>>
>>
>
>_______________________________________________
>Tagdb mailing list
>Tagdb at lists.tagschema.com
>http://lists.tagschema.com/mailman/listinfo/tagdb
>
>
--
Nitin Borwankar
Find, Learn, Act .... Greener
http://greener.com
nitin at borwankar.com
510-872-7066
More information about the Tagdb
mailing list