LISTSERV 16.5 - CODE4LIB Archives

On Nov 22, 2007, at 12:11 PM, Binkley, Peter wrote:
> There is a way to pass Solr a path to a file that it can read from
> disk
> rather than posting the file. I hunted a bit in the wiki and couldn't
> find it, though; it may still be a patch you have to apply.

Solr ships with examples that can be posted in using post.sh or
post.jar.  See the README.txt in the example directory.  You can run
it either as:

        post.sh *.xml

   - or -

        java -jar post.jar *.xml

As far as I know there is no way to avoid POSTing in the XML - no
direct import of an XML file without HTTP (without getting down and
dirty and writing to the embedded Solr API, which is a bit
discouraged for many reasons).

You can also bring data into Solr using the CSV importer.  I highly
recommend folks take a good look at this route.  It's clean, easy, fast:
<http://wiki.apache.org/solr/UpdateCSV>

For 700,000 records, one first nice step to try is to convert that
data into CSV and feed it into Solr.  Create a CSV file on the file
system with all those records and use the CSV importer.  I think
you'll find that the absolute fastest way to bring data in.   But
another good way is to POST in the XML, and do a bunch per POST, say
100 or 1000 or so.  Be sure to know how you have Solr set up for
commits also - for a bulk import, don't commit until the end to allow
Solr to operate most efficiently.  You can watch the Solr stats page
to see how many documents have arrived.

        Erik



>
> Peter
>
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On
> Behalf Of
> Michael Lackhoff
> Sent: Thursday, November 22, 2007 1:03 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] Getting started with SOLR
>
> Hello,
>
> I am just getting my feet wet with SOLR and have a couple of question
> how others have done certain things.
>
> I created a schema.xml where basically every field is of type
> "text" for
> the beginning. Do you use specialized types for authors or ISBNs or
> other fields?
> How do you handle multi-value fields? Do you feed everything in a
> single
> field (like "Smith, James ; Miller, Steve" as I have seen in a pure
> Lucene implementation of a collegue or do you use the multiValued
> feature of SOLR?
>
> What about boosting? I thought of giving the current year a
> boost="3.0"
> and then 0.1 less for every year the title is older, down to 1.0 for a
> 21-year-old book. The idea is to have a sort that tends to promote
> recent titles but still respects other aspects. Does this sound
> reasonable or are there other ideas? I would be very interested in an
> actual boosting-scheme from where I could start.
>
> We have a couple of databases that should eventually indexed. Do you
> build one huge database with an additional "database" field or is it
> better to have every database in its own SOLR instance?
>
> How do you fill the index? Our main database has about 700,000 records
> and I don't know if I should build one huge XML-file and feed that
> into
> SOLR or use a script that sends one record at a time with a commit
> after
> every 1000 records or so. Or do something in between and split it into
> chunks of a few thousand records each? What are your experiences? What
> if a record gives an error? Will the whole file be recjected or just
> that one record?
> Are there alternatives to the HTTP gateway?
> Are there any Perl-scripts around that could help? I built a little
> script that uses LWP to feed my test records into the database. It
> works
> but I don't have any error handling yet, very Quick and dirty XML
> creation so if there is something more mature I would like to use
> that.
>
> Any other ideas, further reading, experiences...?
>
> I know these are a lot of questions but after the conference last
> year I
> think there is lots of expertise in this group and perhaps I can
> avoid a
> few beginner mistakes with your help
>
> thanks in advance
> - Michael