On Nov 23, 2007, at 7:50 AM, Michael Lackhoff wrote: >> You can also bring data into Solr using the CSV importer. I highly >> recommend folks take a good look at this route. It's clean, easy, >> fast: >> <http://wiki.apache.org/solr/UpdateCSV> > > That sounds like what I need. Only problem I see: what about > escapes? I > don't know my data good enough to be sure that any possible delimiter > will never occur within the data. Most exotic characters will probably > be errors but I still don't want SOLR to choke on it. > Can I use escapes for separator and/or encapsulator? If so is it \" or > "" (backslash or doubling)? I found nothing in the docs about it. I prefer tab-delimited files myself. Tabs are those worthless characters that actually hold great value as a separator, only as a field separator. Ever. At the bottom of that wiki page you'll see how to do it with tab delimited files. But as you're creating that data, ensure that your field data is void of tabs except as a separator. Then you're in business. That beats having to worry about quotes and commas and escaping. >> For 700,000 records, one first nice step to try is to convert that >> data into CSV and feed it into Solr. Create a CSV file on the file >> system with all those records and use the CSV importer. I think >> you'll find that the absolute fastest way to bring data in. But > > It even looks like the direct way (almost) without HTTP since the file > is read directly from the file system and doesn't have to be squeezed > through a socket connection. Right - you can feed the CSV data to it as a path to a local (to the Solr server) file, or stream the data in via HTTP. But again, don't concern yourself too much at this stage with the overhead of HTTP. Most have found it not to be the bottleneck in big indexing, especially since the indexer code is running on the Solr server itself or on a local network. But the CSV route takes that out of the picture and provides a very clean and flexible conduit into Solr. I call it "column separated values", now that I've finally understood the real reason for the existence of tabs. Erik