One of the biggest bangs you can get out of indexing is to multi-
thread it pretty heavily. Solr can accept lots of simultaneous
connections.
Folks worried about the HTTP communication are typically those that
are new to Solr and see that as a bottleneck without measuring.
Those that have done the measuring have found that HTTP is really a
huge factor on indexing performance.
When Lucene 2.3 is dropped into Solr (coming soon), indexing speed
will improve even more substantially thanks to core improvements at
the heart.
But, POSTing more than one document at a time also good advice.
Erik
On Nov 23, 2007, at 7:03 AM, Ewout Van Troostenberghe wrote:
>>> How do you fill the index? Our main database has about 700,000
>>> records
>>> and I don't know if I should build one huge XML-file and feed
>>> that into
>>> SOLR or use a script that sends one record at a time with a
>>> commit after
>>> every 1000 records or so. Or do something in between and split it
>>> into
>>> chunks of a few thousand records each? What are your experiences?
>>> What
>>> if a record gives an error? Will the whole file be recjected or just
>>> that one record?
>> There is a Java command line tool or you can see the VuFind's
>> solution. If you can, I suggest you to prefer a pure java
>> solution, writing directly to the Solr index (with the Solr
>> API), because its much-much more quicker than the PHP's
>> (Rail's, Perl's) solution which based on a web-service
>> (which need the PHP parsing and HTTP request curve).
>> The PHP solution does nothing with Solr directly, it
>> use the web service, and all the code can be rewriten
>> in Perl.
>
> When you want use a scripting language to fill the solr index, rather
> then using solr API directly, you should consider buffering as an
> intermediate solution. It can speed up indexing with orders of
> magnitude. Create your XML in the script, and keep them in-memory
> until
> you have 50 or 100 documents. Then post these together.
>
> Attached is a small ruby script we use to do solr indexing. It reads
> yaml records from standard input, does some processing (burried in our
> libraries), buffers the result and posts after 100 records are
> gathered.
>
> Regards,
> Ewout
> <index_solr_fast.rb>
|