[Tagdb] search keywords vs tags - automated tagging of docs

Enis Soztutar enis.soz.nutch at gmail.com
Mon Dec 25 07:17:12 GMT 2006


Hi Nitin,

The think you talk about seems very similar to a subdomain in the IR 
literature, called relevance feedback. Taking a part of the documents 
retrieved, the query may be enhanced and expanded with the relevant 
terms in the documents. I think you could look at an IR book or some 
papers to elaborate on this.

Nitin Borwankar wrote:
> So to continue the thread, I started getting interested in search 
> engines from the insight that a search engine
> is really a recommendation system for web pages and I was already 
> dabbling in recommendation systems.
>
> Page Rank is a recommendation algorithm that recommends pages with the 
> maximum incoming links.
>
> Given that you are using a set of keywords as a query a search engine 
> returns a set of URLs that point to pages that
>
> a)  have those keywords in the invereted index
> and
> b)  those keywords have some significance in that page ( title, ... )
>
> So given a document D with inverted index containing tokens k1, k2, k3, k4
> If I search for the keyword k1 I will get D amongst other docs.
>
> Could I then look at all the docs within the search result ( for k1)  
> and look at all the other tokens
> in the inverted indexes of these docs and do a frequency count over the 
> result set, then take the most popular
> ones as "related" keywords.
>
> i.e  D---> k1, k2,k3,k4
>
> E---> k1,k2,k5,k6
> F---> k1,k2,k3,k6
> G--->k1, k2,k3,k6
>
> so we have aside from k1, in the set (E,F,G) the following frequency map
>
> k2 --> 3
> k3--->2
> k4--->0
> k5-->1
> k6--> 3
>
> in sorted order by frequency
>
> k2 --> 3
> k6 --> 3
> k3 --> 2
> k5 --> 1
> k4 --> 0
>
> could we consider k2, k6 to be "related keywords"   or "related tags",  
> related to k1 ?
>
> and consider all other docs with k2 and k6 ( but not k1 ) to be "related 
> docs"
>
>
> I am basically doing in the inverted index and keywords what we do in 
> database and tags and items to find related
> items.
>
> Nitin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I.e. can I use the inverted index to compute "related" documents ?
>
>
>
>
> Enis Soztutar wrote:
>
>   
>> Hi all, continuing the discussion, my experience with both nutch and 
>> tagging systems(especially scuttle) has made me understand the 
>> differences between them better.
>>
>>
>> First of all, in crawling and indexing of web documents for keyword 
>> extraction, the indexer indexes all the words in the document by first 
>> tokenizing the text. Thus all the words, including those that can be 
>> "tags" and those that cannot be, are indexed. As discussed stop words is 
>> a good example of this. But apart from stop words there are lots of 
>> keywords that are not representative of the content of the document. 
>> Moreover not all the keywords or tags occur in the context of the site 
>> you are indexing. For example in the Honda's page (www.honda.com), the 
>> phrase "car manufacturer" does not exist. But links to the honda's page 
>> possibly contain that information. Thus a page is indexed with the 
>> anchoring text. Likewise, tags of a page contain info about what the 
>> page is about, and may contain words that do not exists in the text of 
>> the page. Moreover tokenization in tagging may be different. A page will 
>> be tagged with multiple words(in some tagging systems) although keywords 
>> are single words.
>>
>> Another major difference comes from the quality of the web pages that 
>> are tagged. The major problem of most search engines(except big ones) is 
>> dealing with spam sites. Although lots of spamming methods exist, the 
>> most frequent one it to list thousands of popular word in the site. Thus 
>> the spam site becomes related with most of the search queries. However, 
>> in my opinion, the web sites entered to a tagging system is expected to 
>> be of higher quality. And i claim that the rank of a page may be 
>> estimated using the number of people that has bookmarked that page. 
>> Well, there are some search engines that bring together the idea of 
>> tagging with indexing and we will see if they prosper.
>>
>> Enis Soztutar
>>
>> Nitin Borwankar wrote:
>>  
>>
>>     
>>> Increasingly I have been getting interested in the vertical search space 
>>> and have been looking at nutch
>>> www.nutch.org built on top of Lucene the java text indexing/searching 
>>> library.
>>>
>>> A question arises in my mind when I look at tokenization and inverted 
>>> indexes etc... which are the bread and butter of IR and text search.....
>>>
>>> What is the fundamental difference between a set of search keywords as 
>>> typed into a search bar vs a set of tags by which I search for something 
>>> on del.icio.us ?
>>> It seems to me that if one wore to throw out the obvious stop words 
>>> etc., then the set of keywords ( tokens ) that say Lucene generates for 
>>> a document are a good first order set of (system generated) tags for the 
>>> document.
>>>
>>> Any comments arguments one way or another ?
>>> This has major implications for automated tagging, so I am really 
>>> curious as to why this won't work.
>>>
>>> Nitin
>>>
>>>
>>>  
>>>    
>>>
>>>       
>> _______________________________________________
>> Tagdb mailing list
>> Tagdb at lists.tagschema.com
>> http://lists.tagschema.com/mailman/listinfo/tagdb
>>  
>>
>>     
>
>
>   



More information about the Tagdb mailing list