LISTSERV 16.5 - CODE4LIB Archives

First, with Solr 5, it’s this easy:

   bin/post -c collection_name /path/to/file.doc

Under the covers, that (currently) uses the SimplePostTool that has shipped with Solr historically as example/exampledocs/post.jar

You can use that tool.  Here’s some details:

$ cd example/exampledocs
$ java -jar post.jar -h
SimplePostTool version 5.0.0
Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]
…

  java -Durl=http://localhost:8983/solr/techproducts/update/extract -Dparams=literal.id=pdf1 -jar post.jar solr-word.pdf

You can use curl too, see <https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika> for more details, but something like this:

    curl 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true' -F "[log in to unmask]”

You’ll need to have /update/extract defined in your solrconfig.xml.

One interesting thing you can do is add &extractOnly=true and it’ll return the XHTML version that Tika builds internally.  This could be leveraged for troubleshooting or maybe even more fun of letting Solr do the parsing/extraction and your code deal with the parsed result rather than indexing it directly.   A paste of that is below.

    Erik

# wt=ruby&indent=on makes the output look a lot nicer!  -out yes causes the tool to output what Solr returns (which normally isn’t useful to see)

$ bin/post -c products -params "extractOnly=true&wt=ruby&indent=on" -out yes example/exampledocs/solr-word.pdf 
java -classpath /Users/erikhatcher/solr-5.0.0/dist/solr-core-5.0.0.jar -Dauto=yes -Dparams=extractOnly=true&wt=ruby&indent=on -Dout=yes -Dc=products -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/solr-word.pdf
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/products/update?extractOnly=true&wt=ruby&indent=on...
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file solr-word.pdf (application/pdf) to [base]/extract
{
  'responseHeader'=>{
    'status'=>0,
    'QTime'=>10},
  ''=>'<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date"
content="2008-11-13T13:35:51Z"/>
<meta name="pdf:PDFVersion"
content="1.3"/>
<meta name="xmp:CreatorTool"
content="Microsoft Word"/>
<meta name="stream_content_type"
content="application/pdf"/>
<meta name="Keywords"
content="solr, word, pdf"/>
<meta name="subject"
content="solr word"/>
<meta name="AAPL:Keywords"
content="solr, word, pdf"/>
<meta name="dc:creator"
content="Grant Ingersoll"/>
<meta name="dcterms:created"
content="2008-11-13T13:35:51Z"/>
<meta name="Last-Modified"
            content="2008-11-13T13:35:51Z"/>
<meta
name="dcterms:modified" content="2008-11-13T13:35:51Z"/>
<meta
name="dc:format" content="application/pdf; version=1.3"/>
<meta
name="Last-Save-Date" content="2008-11-13T13:35:51Z"/>
<meta
name="meta:save-date" content="2008-11-13T13:35:51Z"/>
<meta
name="pdf:encrypted" content="false"/>
<meta name="dc:title"
content="solr-word"/>
<meta name="modified"
content="2008-11-13T13:35:51Z"/>
<meta name="cp:subject"
content="solr word"/>
<meta name="Content-Type"
content="application/pdf"/>
<meta name="stream_size"
content="21052"/>
<meta name="X-Parsed-By"
            content="org.apache.tika.parser.DefaultParser"/>
<meta
name="X-Parsed-By"
            content="org.apache.tika.parser.pdf.PDFParser"/>
<meta
name="creator" content="Grant Ingersoll"/>
<meta name="meta:author"
content="Grant Ingersoll"/>
<meta name="dc:subject"
content="solr, word, pdf"/>
<meta name="meta:creation-date"
content="2008-11-13T13:35:51Z"/>
<meta name="created"
            content="Thu Nov 13 13:35:51 UTC 2008"/>
<meta
name="xmpTPg:NPages" content="1"/>
<meta name="Creation-Date"
content="2008-11-13T13:35:51Z"/>
<meta name="resourceName"
            content="/Users/erikhatcher/solr-5.0.0/example/exampledocs/solr-word.pdf"/>
<meta
name="meta:keyword" content="solr, word, pdf"/>
<meta name="Author"
content="Grant Ingersoll"/>
<meta name="producer" content="Mac OS X 10.5.5 Quartz PDFContext"/>
<title>solr-word</title>
</head>
<body>
        <div class="page">
<p/>
<p>This is a test of PDF and Word extraction in Solr, it is only a test.  Do not panic. </p>
<p/>
</div>
</body>
</html>
',
  'null_metadata'=>[
    'date',['2008-11-13T13:35:51Z'],
    'pdf:PDFVersion',['1.3'],
    'xmp:CreatorTool',['Microsoft Word'],
    'stream_content_type',['application/pdf'],
    'Keywords',['solr, word, pdf'],
    'subject',['solr word'],
    'AAPL:Keywords',['solr, word, pdf'],
    'dc:creator',['Grant Ingersoll'],
    'dcterms:created',['2008-11-13T13:35:51Z'],
    'Last-Modified',['2008-11-13T13:35:51Z'],
    'dcterms:modified',['2008-11-13T13:35:51Z'],
    'dc:format',['application/pdf; version=1.3'],
    'title',['solr-word'],
    'Last-Save-Date',['2008-11-13T13:35:51Z'],
    'meta:save-date',['2008-11-13T13:35:51Z'],
    'pdf:encrypted',['false'],
    'dc:title',['solr-word'],
    'modified',['2008-11-13T13:35:51Z'],
    'cp:subject',['solr word'],
    'Content-Type',['application/pdf'],
    'stream_size',['21052'],
    'X-Parsed-By',['org.apache.tika.parser.DefaultParser',
      'org.apache.tika.parser.pdf.PDFParser'],
    'creator',['Grant Ingersoll'],
    'meta:author',['Grant Ingersoll'],
    'dc:subject',['solr, word, pdf'],
    'meta:creation-date',['2008-11-13T13:35:51Z'],
    'created',['Thu Nov 13 13:35:51 UTC 2008'],
    'xmpTPg:NPages',['1'],
    'Creation-Date',['2008-11-13T13:35:51Z'],
    'resourceName',['/Users/erikhatcher/solr-5.0.0/example/exampledocs/solr-word.pdf'],
    'meta:keyword',['solr, word, pdf'],
    'Author',['Grant Ingersoll'],
    'producer',['Mac OS X 10.5.5 Quartz PDFContext']]}
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/products/update?extractOnly=true&wt=ruby&indent=on...
Time spent: 0:00:00.036



> On Feb 10, 2015, at 11:12 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> 
> Can somebody point me to a good tutorial on how to index Word documents using Solr?
> 
> I have a few hundred Microsoft Word documents I want to search. Through the use of the Tika library it seems as if I ought to be able to index my Word documents directly into Solr, but none of the tutorials I have found on the Web are complete. Missing directories. Missing files. Documentation for versions unreleased. Etc.
> 
> Put another way, Tika can create a (nice) XHTML file complete with some useful metadata that can all be fed to Solr for indexing, but I can barely get out of the starting gate. Have you indexed Word documents using Solr, and if so, then how? 
> 
> —
> Eric Morgan