[Tagdb] question about solr vs nutch

ogjunk-tagdb at yahoo.com ogjunk-tagdb at yahoo.com
Wed Nov 15 07:40:34 GMT 2006


Hi Nitin,

So are you asking about request/sec over HTTP?  Because that would be applicable to Solr.
It depends.  Serial or parallel?  How often do you commit your Documents to the underlying index?  Are the disks fast or bottleneck?

I remember seeing very high numbers for docs/second (indexing) with Solr.  I don't know what the numbers were, any more, but I do remember thinking they were higher than one would think, precisely because Solr acts as a web service.  Should be in Solr's mail archives.  I _think_ the person who mentioned the number was Wunder - Walter Underwood.

Nutch is different.  The network is the bottleneck there, and how many pages/second you can get out of it depends on a lot of factors: your bandwidth, the number of fetching threads, the number of fetchers in the cluster, the distribution of hosts in the fetch list, etc.

Otis

----- Original Message ----
From: Nitin Borwankar <nitin at borwankar.com>
To: ogjunk-tagdb at yahoo.com
Cc: tagdb at lists.tagschema.com
Sent: Wednesday, November 15, 2006 12:47:07 AM
Subject: Re: [Tagdb] question about solr vs nutch

ogjunk-tagdb at yahoo.com wrote:

>That's a bit of an apples and oranges comparison.  Ian already pointed out the most obvious/basic/biggest difference.  They are meant to solve different problems.  Moreover, if you play with Nutch, you will see that's a rather complex and ambitious piece of software.  Solr is a lot smaller (code-wise) and simpler.  Again, it's hard to compare them, because they are really two pretty different things, even though they both do text indexing and searching.
>  
>

OK, I'll ask the question elsewhere but what I was asking was about the 
overhead of the web-service submission not the backend or the functional 
differences.

Nitin

>Otis (Lucene/Solr/Nutch developer)
>
>----- Original Message ----
>From: Nitin Borwankar <nitin at borwankar.com>
>To: tagdb at lists.tagschema.com
>Sent: Tuesday, November 14, 2006 6:10:13 PM
>Subject: [Tagdb] question about solr vs nutch
>
>Hi all,
>
>As there are some experts in text indexing on the list thought this 
>might be the best place to ask ....
>I see that solr ( http://incubator.apache.org/solr/ ) is an enterprise 
>search engine based on Lucene with a web-service api for submitting docs 
>to be indexed.
>Also that Nutch ( www.nutch.org )  is another search engine based on 
>Lucene which directly stores docs to disk before indexing.
>What is the performance hit of submitting docs by web-service in 
>comparison to the nutch approach, if at all this is a comparison that 
>makes sense.
>My interest is in the fielded search capabilities of solr, applied to 
>either LAN based docs or docs crawled from the web, but I am concerned 
>about the performance hit of
>web-service submission + XML overhead compared to direct disk writes.
>
>Any enlighteneing thoughts ?
>
>Nitin Borwankar
>_______________________________________________
>Tagdb mailing list
>Tagdb at lists.tagschema.com
>http://lists.tagschema.com/mailman/listinfo/tagdb
>
>
>
>  
>






More information about the Tagdb mailing list