Print

Print


On Nov 23, 2007, at 7:50 AM, Michael Lackhoff wrote:
>> You can also bring data into Solr using the CSV importer.  I highly
>> recommend folks take a good look at this route.  It's clean, easy,
>> fast:
>> <http://wiki.apache.org/solr/UpdateCSV>
>
> That sounds like what I need. Only problem I see: what about
> escapes? I
> don't know my data good enough to be sure that any possible delimiter
> will never occur within the data. Most exotic characters will probably
> be errors but I still don't want SOLR to choke on it.
> Can I use escapes for separator and/or encapsulator? If so is it \" or
> "" (backslash or doubling)? I found nothing in the docs about it.

I prefer tab-delimited files myself.  Tabs are those worthless
characters that actually hold great value as a separator, only as a
field separator.  Ever.

At the bottom of that wiki page you'll see how to do it with tab
delimited files.   But as you're creating that data, ensure that your
field data is void of tabs except as a separator.  Then you're in
business.  That beats having to worry about quotes and commas and
escaping.

>> For 700,000 records, one first nice step to try is to convert that
>> data into CSV and feed it into Solr.  Create a CSV file on the file
>> system with all those records and use the CSV importer.  I think
>> you'll find that the absolute fastest way to bring data in.   But
>
> It even looks like the direct way (almost) without HTTP since the file
> is read directly from the file system and doesn't have to be squeezed
> through a socket connection.

Right - you can feed the CSV data to it as a path to a local (to the
Solr server) file, or stream the data in via HTTP.

But again, don't concern yourself too much at this stage with the
overhead of HTTP.  Most have found it not to be the bottleneck in big
indexing, especially since the indexer code is running on the Solr
server itself or on a local network.  But the CSV route takes that
out of the picture and provides a very clean and flexible conduit
into Solr.

I call it "column separated values", now that I've finally understood
the real reason for the existence of tabs.

        Erik