[Tagdb] search keywords vs tags - automated tagging of docs

Nitin Borwankar nitin at borwankar.com
Fri Dec 22 23:21:25 GMT 2006


So to continue the thread, I started getting interested in search 
engines from the insight that a search engine
is really a recommendation system for web pages and I was already 
dabbling in recommendation systems.

Page Rank is a recommendation algorithm that recommends pages with the 
maximum incoming links.

Given that you are using a set of keywords as a query a search engine 
returns a set of URLs that point to pages that

a)  have those keywords in the invereted index
and
b)  those keywords have some significance in that page ( title, ... )

So given a document D with inverted index containing tokens k1, k2, k3, k4
If I search for the keyword k1 I will get D amongst other docs.

Could I then look at all the docs within the search result ( for k1)  
and look at all the other tokens
in the inverted indexes of these docs and do a frequency count over the 
result set, then take the most popular
ones as "related" keywords.

i.e  D---> k1, k2,k3,k4

E---> k1,k2,k5,k6
F---> k1,k2,k3,k6
G--->k1, k2,k3,k6

so we have aside from k1, in the set (E,F,G) the following frequency map

k2 --> 3
k3--->2
k4--->0
k5-->1
k6--> 3

in sorted order by frequency

k2 --> 3
k6 --> 3
k3 --> 2
k5 --> 1
k4 --> 0

could we consider k2, k6 to be "related keywords"   or "related tags",  
related to k1 ?

and consider all other docs with k2 and k6 ( but not k1 ) to be "related 
docs"


I am basically doing in the inverted index and keywords what we do in 
database and tags and items to find related
items.

Nitin






















I.e. can I use the inverted index to compute "related" documents ?




Enis Soztutar wrote:

>Hi all, continuing the discussion, my experience with both nutch and 
>tagging systems(especially scuttle) has made me understand the 
>differences between them better.
>
>
>First of all, in crawling and indexing of web documents for keyword 
>extraction, the indexer indexes all the words in the document by first 
>tokenizing the text. Thus all the words, including those that can be 
>"tags" and those that cannot be, are indexed. As discussed stop words is 
>a good example of this. But apart from stop words there are lots of 
>keywords that are not representative of the content of the document. 
>Moreover not all the keywords or tags occur in the context of the site 
>you are indexing. For example in the Honda's page (www.honda.com), the 
>phrase "car manufacturer" does not exist. But links to the honda's page 
>possibly contain that information. Thus a page is indexed with the 
>anchoring text. Likewise, tags of a page contain info about what the 
>page is about, and may contain words that do not exists in the text of 
>the page. Moreover tokenization in tagging may be different. A page will 
>be tagged with multiple words(in some tagging systems) although keywords 
>are single words.
>
>Another major difference comes from the quality of the web pages that 
>are tagged. The major problem of most search engines(except big ones) is 
>dealing with spam sites. Although lots of spamming methods exist, the 
>most frequent one it to list thousands of popular word in the site. Thus 
>the spam site becomes related with most of the search queries. However, 
>in my opinion, the web sites entered to a tagging system is expected to 
>be of higher quality. And i claim that the rank of a page may be 
>estimated using the number of people that has bookmarked that page. 
>Well, there are some search engines that bring together the idea of 
>tagging with indexing and we will see if they prosper.
>
>Enis Soztutar
>
>Nitin Borwankar wrote:
>  
>
>>Increasingly I have been getting interested in the vertical search space 
>>and have been looking at nutch
>>www.nutch.org built on top of Lucene the java text indexing/searching 
>>library.
>>
>>A question arises in my mind when I look at tokenization and inverted 
>>indexes etc... which are the bread and butter of IR and text search.....
>>
>>What is the fundamental difference between a set of search keywords as 
>>typed into a search bar vs a set of tags by which I search for something 
>>on del.icio.us ?
>>It seems to me that if one wore to throw out the obvious stop words 
>>etc., then the set of keywords ( tokens ) that say Lucene generates for 
>>a document are a good first order set of (system generated) tags for the 
>>document.
>>
>>Any comments arguments one way or another ?
>>This has major implications for automated tagging, so I am really 
>>curious as to why this won't work.
>>
>>Nitin
>>
>> 
>>  
>>    
>>
>
>_______________________________________________
>Tagdb mailing list
>Tagdb at lists.tagschema.com
>http://lists.tagschema.com/mailman/listinfo/tagdb
>  
>


-- 
Nitin Borwankar
Find, Learn, Act .... Greener
http://greener.com
nitin at borwankar.com
510-872-7066



More information about the Tagdb mailing list