[Tagdb] search keywords vs tags - automated tagging of docs
Enis Soztutar
enis.soz.nutch at gmail.com
Mon Dec 25 07:17:12 GMT 2006
Hi Nitin,
The think you talk about seems very similar to a subdomain in the IR
literature, called relevance feedback. Taking a part of the documents
retrieved, the query may be enhanced and expanded with the relevant
terms in the documents. I think you could look at an IR book or some
papers to elaborate on this.
Nitin Borwankar wrote:
> So to continue the thread, I started getting interested in search
> engines from the insight that a search engine
> is really a recommendation system for web pages and I was already
> dabbling in recommendation systems.
>
> Page Rank is a recommendation algorithm that recommends pages with the
> maximum incoming links.
>
> Given that you are using a set of keywords as a query a search engine
> returns a set of URLs that point to pages that
>
> a) have those keywords in the invereted index
> and
> b) those keywords have some significance in that page ( title, ... )
>
> So given a document D with inverted index containing tokens k1, k2, k3, k4
> If I search for the keyword k1 I will get D amongst other docs.
>
> Could I then look at all the docs within the search result ( for k1)
> and look at all the other tokens
> in the inverted indexes of these docs and do a frequency count over the
> result set, then take the most popular
> ones as "related" keywords.
>
> i.e D---> k1, k2,k3,k4
>
> E---> k1,k2,k5,k6
> F---> k1,k2,k3,k6
> G--->k1, k2,k3,k6
>
> so we have aside from k1, in the set (E,F,G) the following frequency map
>
> k2 --> 3
> k3--->2
> k4--->0
> k5-->1
> k6--> 3
>
> in sorted order by frequency
>
> k2 --> 3
> k6 --> 3
> k3 --> 2
> k5 --> 1
> k4 --> 0
>
> could we consider k2, k6 to be "related keywords" or "related tags",
> related to k1 ?
>
> and consider all other docs with k2 and k6 ( but not k1 ) to be "related
> docs"
>
>
> I am basically doing in the inverted index and keywords what we do in
> database and tags and items to find related
> items.
>
> Nitin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I.e. can I use the inverted index to compute "related" documents ?
>
>
>
>
> Enis Soztutar wrote:
>
>
>> Hi all, continuing the discussion, my experience with both nutch and
>> tagging systems(especially scuttle) has made me understand the
>> differences between them better.
>>
>>
>> First of all, in crawling and indexing of web documents for keyword
>> extraction, the indexer indexes all the words in the document by first
>> tokenizing the text. Thus all the words, including those that can be
>> "tags" and those that cannot be, are indexed. As discussed stop words is
>> a good example of this. But apart from stop words there are lots of
>> keywords that are not representative of the content of the document.
>> Moreover not all the keywords or tags occur in the context of the site
>> you are indexing. For example in the Honda's page (www.honda.com), the
>> phrase "car manufacturer" does not exist. But links to the honda's page
>> possibly contain that information. Thus a page is indexed with the
>> anchoring text. Likewise, tags of a page contain info about what the
>> page is about, and may contain words that do not exists in the text of
>> the page. Moreover tokenization in tagging may be different. A page will
>> be tagged with multiple words(in some tagging systems) although keywords
>> are single words.
>>
>> Another major difference comes from the quality of the web pages that
>> are tagged. The major problem of most search engines(except big ones) is
>> dealing with spam sites. Although lots of spamming methods exist, the
>> most frequent one it to list thousands of popular word in the site. Thus
>> the spam site becomes related with most of the search queries. However,
>> in my opinion, the web sites entered to a tagging system is expected to
>> be of higher quality. And i claim that the rank of a page may be
>> estimated using the number of people that has bookmarked that page.
>> Well, there are some search engines that bring together the idea of
>> tagging with indexing and we will see if they prosper.
>>
>> Enis Soztutar
>>
>> Nitin Borwankar wrote:
>>
>>
>>
>>> Increasingly I have been getting interested in the vertical search space
>>> and have been looking at nutch
>>> www.nutch.org built on top of Lucene the java text indexing/searching
>>> library.
>>>
>>> A question arises in my mind when I look at tokenization and inverted
>>> indexes etc... which are the bread and butter of IR and text search.....
>>>
>>> What is the fundamental difference between a set of search keywords as
>>> typed into a search bar vs a set of tags by which I search for something
>>> on del.icio.us ?
>>> It seems to me that if one wore to throw out the obvious stop words
>>> etc., then the set of keywords ( tokens ) that say Lucene generates for
>>> a document are a good first order set of (system generated) tags for the
>>> document.
>>>
>>> Any comments arguments one way or another ?
>>> This has major implications for automated tagging, so I am really
>>> curious as to why this won't work.
>>>
>>> Nitin
>>>
>>>
>>>
>>>
>>>
>>>
>> _______________________________________________
>> Tagdb mailing list
>> Tagdb at lists.tagschema.com
>> http://lists.tagschema.com/mailman/listinfo/tagdb
>>
>>
>>
>
>
>
More information about the Tagdb
mailing list