First, with Solr 5, it’s this easy:
bin/post -c collection_name /path/to/file.doc
Under the covers, that (currently) uses the SimplePostTool that has shipped with Solr historically as example/exampledocs/post.jar
You can use that tool. Here’s some details:
$ cd example/exampledocs
$ java -jar post.jar -h
SimplePostTool version 5.0.0
Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]
…
java -Durl=http://localhost:8983/solr/techproducts/update/extract -Dparams=literal.id=pdf1 -jar post.jar solr-word.pdf
You can use curl too, see <https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika> for more details, but something like this:
curl 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true' -F "[log in to unmask]”
You’ll need to have /update/extract defined in your solrconfig.xml.
One interesting thing you can do is add &extractOnly=true and it’ll return the XHTML version that Tika builds internally. This could be leveraged for troubleshooting or maybe even more fun of letting Solr do the parsing/extraction and your code deal with the parsed result rather than indexing it directly. A paste of that is below.
Erik
# wt=ruby&indent=on makes the output look a lot nicer! -out yes causes the tool to output what Solr returns (which normally isn’t useful to see)
$ bin/post -c products -params "extractOnly=true&wt=ruby&indent=on" -out yes example/exampledocs/solr-word.pdf
java -classpath /Users/erikhatcher/solr-5.0.0/dist/solr-core-5.0.0.jar -Dauto=yes -Dparams=extractOnly=true&wt=ruby&indent=on -Dout=yes -Dc=products -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/solr-word.pdf
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/products/update?extractOnly=true&wt=ruby&indent=on...
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file solr-word.pdf (application/pdf) to [base]/extract
{
'responseHeader'=>{
'status'=>0,
'QTime'=>10},
''=>'<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date"
content="2008-11-13T13:35:51Z"/>
<meta name="pdf:PDFVersion"
content="1.3"/>
<meta name="xmp:CreatorTool"
content="Microsoft Word"/>
<meta name="stream_content_type"
content="application/pdf"/>
<meta name="Keywords"
content="solr, word, pdf"/>
<meta name="subject"
content="solr word"/>
<meta name="AAPL:Keywords"
content="solr, word, pdf"/>
<meta name="dc:creator"
content="Grant Ingersoll"/>
<meta name="dcterms:created"
content="2008-11-13T13:35:51Z"/>
<meta name="Last-Modified"
content="2008-11-13T13:35:51Z"/>
<meta
name="dcterms:modified" content="2008-11-13T13:35:51Z"/>
<meta
name="dc:format" content="application/pdf; version=1.3"/>
<meta
name="Last-Save-Date" content="2008-11-13T13:35:51Z"/>
<meta
name="meta:save-date" content="2008-11-13T13:35:51Z"/>
<meta
name="pdf:encrypted" content="false"/>
<meta name="dc:title"
content="solr-word"/>
<meta name="modified"
content="2008-11-13T13:35:51Z"/>
<meta name="cp:subject"
content="solr word"/>
<meta name="Content-Type"
content="application/pdf"/>
<meta name="stream_size"
content="21052"/>
<meta name="X-Parsed-By"
content="org.apache.tika.parser.DefaultParser"/>
<meta
name="X-Parsed-By"
content="org.apache.tika.parser.pdf.PDFParser"/>
<meta
name="creator" content="Grant Ingersoll"/>
<meta name="meta:author"
content="Grant Ingersoll"/>
<meta name="dc:subject"
content="solr, word, pdf"/>
<meta name="meta:creation-date"
content="2008-11-13T13:35:51Z"/>
<meta name="created"
content="Thu Nov 13 13:35:51 UTC 2008"/>
<meta
name="xmpTPg:NPages" content="1"/>
<meta name="Creation-Date"
content="2008-11-13T13:35:51Z"/>
<meta name="resourceName"
content="/Users/erikhatcher/solr-5.0.0/example/exampledocs/solr-word.pdf"/>
<meta
name="meta:keyword" content="solr, word, pdf"/>
<meta name="Author"
content="Grant Ingersoll"/>
<meta name="producer" content="Mac OS X 10.5.5 Quartz PDFContext"/>
<title>solr-word</title>
</head>
<body>
<div class="page">
<p/>
<p>This is a test of PDF and Word extraction in Solr, it is only a test. Do not panic. </p>
<p/>
</div>
</body>
</html>
',
'null_metadata'=>[
'date',['2008-11-13T13:35:51Z'],
'pdf:PDFVersion',['1.3'],
'xmp:CreatorTool',['Microsoft Word'],
'stream_content_type',['application/pdf'],
'Keywords',['solr, word, pdf'],
'subject',['solr word'],
'AAPL:Keywords',['solr, word, pdf'],
'dc:creator',['Grant Ingersoll'],
'dcterms:created',['2008-11-13T13:35:51Z'],
'Last-Modified',['2008-11-13T13:35:51Z'],
'dcterms:modified',['2008-11-13T13:35:51Z'],
'dc:format',['application/pdf; version=1.3'],
'title',['solr-word'],
'Last-Save-Date',['2008-11-13T13:35:51Z'],
'meta:save-date',['2008-11-13T13:35:51Z'],
'pdf:encrypted',['false'],
'dc:title',['solr-word'],
'modified',['2008-11-13T13:35:51Z'],
'cp:subject',['solr word'],
'Content-Type',['application/pdf'],
'stream_size',['21052'],
'X-Parsed-By',['org.apache.tika.parser.DefaultParser',
'org.apache.tika.parser.pdf.PDFParser'],
'creator',['Grant Ingersoll'],
'meta:author',['Grant Ingersoll'],
'dc:subject',['solr, word, pdf'],
'meta:creation-date',['2008-11-13T13:35:51Z'],
'created',['Thu Nov 13 13:35:51 UTC 2008'],
'xmpTPg:NPages',['1'],
'Creation-Date',['2008-11-13T13:35:51Z'],
'resourceName',['/Users/erikhatcher/solr-5.0.0/example/exampledocs/solr-word.pdf'],
'meta:keyword',['solr, word, pdf'],
'Author',['Grant Ingersoll'],
'producer',['Mac OS X 10.5.5 Quartz PDFContext']]}
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/products/update?extractOnly=true&wt=ruby&indent=on...
Time spent: 0:00:00.036
> On Feb 10, 2015, at 11:12 AM, Eric Lease Morgan <[log in to unmask]> wrote:
>
> Can somebody point me to a good tutorial on how to index Word documents using Solr?
>
> I have a few hundred Microsoft Word documents I want to search. Through the use of the Tika library it seems as if I ought to be able to index my Word documents directly into Solr, but none of the tutorials I have found on the Web are complete. Missing directories. Missing files. Documentation for versions unreleased. Etc.
>
> Put another way, Tika can create a (nice) XHTML file complete with some useful metadata that can all be fed to Solr for indexing, but I can barely get out of the starting gate. Have you indexed Word documents using Solr, and if so, then how?
>
> —
> Eric Morgan
|