First, with Solr 5, it’s this easy: bin/post -c collection_name /path/to/file.doc Under the covers, that (currently) uses the SimplePostTool that has shipped with Solr historically as example/exampledocs/post.jar You can use that tool. Here’s some details: $ cd example/exampledocs $ java -jar post.jar -h SimplePostTool version 5.0.0 Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]] … java -Durl=http://localhost:8983/solr/techproducts/update/extract -Dparams=literal.id=pdf1 -jar post.jar solr-word.pdf You can use curl too, see <https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika> for more details, but something like this: curl 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true' -F "[log in to unmask]” You’ll need to have /update/extract defined in your solrconfig.xml. One interesting thing you can do is add &extractOnly=true and it’ll return the XHTML version that Tika builds internally. This could be leveraged for troubleshooting or maybe even more fun of letting Solr do the parsing/extraction and your code deal with the parsed result rather than indexing it directly. A paste of that is below. Erik # wt=ruby&indent=on makes the output look a lot nicer! -out yes causes the tool to output what Solr returns (which normally isn’t useful to see) $ bin/post -c products -params "extractOnly=true&wt=ruby&indent=on" -out yes example/exampledocs/solr-word.pdf java -classpath /Users/erikhatcher/solr-5.0.0/dist/solr-core-5.0.0.jar -Dauto=yes -Dparams=extractOnly=true&wt=ruby&indent=on -Dout=yes -Dc=products -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/solr-word.pdf SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/products/update?extractOnly=true&wt=ruby&indent=on... Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file solr-word.pdf (application/pdf) to [base]/extract { 'responseHeader'=>{ 'status'=>0, 'QTime'=>10}, ''=>'<?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="date" content="2008-11-13T13:35:51Z"/> <meta name="pdf:PDFVersion" content="1.3"/> <meta name="xmp:CreatorTool" content="Microsoft Word"/> <meta name="stream_content_type" content="application/pdf"/> <meta name="Keywords" content="solr, word, pdf"/> <meta name="subject" content="solr word"/> <meta name="AAPL:Keywords" content="solr, word, pdf"/> <meta name="dc:creator" content="Grant Ingersoll"/> <meta name="dcterms:created" content="2008-11-13T13:35:51Z"/> <meta name="Last-Modified" content="2008-11-13T13:35:51Z"/> <meta name="dcterms:modified" content="2008-11-13T13:35:51Z"/> <meta name="dc:format" content="application/pdf; version=1.3"/> <meta name="Last-Save-Date" content="2008-11-13T13:35:51Z"/> <meta name="meta:save-date" content="2008-11-13T13:35:51Z"/> <meta name="pdf:encrypted" content="false"/> <meta name="dc:title" content="solr-word"/> <meta name="modified" content="2008-11-13T13:35:51Z"/> <meta name="cp:subject" content="solr word"/> <meta name="Content-Type" content="application/pdf"/> <meta name="stream_size" content="21052"/> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/> <meta name="creator" content="Grant Ingersoll"/> <meta name="meta:author" content="Grant Ingersoll"/> <meta name="dc:subject" content="solr, word, pdf"/> <meta name="meta:creation-date" content="2008-11-13T13:35:51Z"/> <meta name="created" content="Thu Nov 13 13:35:51 UTC 2008"/> <meta name="xmpTPg:NPages" content="1"/> <meta name="Creation-Date" content="2008-11-13T13:35:51Z"/> <meta name="resourceName" content="/Users/erikhatcher/solr-5.0.0/example/exampledocs/solr-word.pdf"/> <meta name="meta:keyword" content="solr, word, pdf"/> <meta name="Author" content="Grant Ingersoll"/> <meta name="producer" content="Mac OS X 10.5.5 Quartz PDFContext"/> <title>solr-word</title> </head> <body> <div class="page"> <p/> <p>This is a test of PDF and Word extraction in Solr, it is only a test. Do not panic. </p> <p/> </div> </body> </html> ', 'null_metadata'=>[ 'date',['2008-11-13T13:35:51Z'], 'pdf:PDFVersion',['1.3'], 'xmp:CreatorTool',['Microsoft Word'], 'stream_content_type',['application/pdf'], 'Keywords',['solr, word, pdf'], 'subject',['solr word'], 'AAPL:Keywords',['solr, word, pdf'], 'dc:creator',['Grant Ingersoll'], 'dcterms:created',['2008-11-13T13:35:51Z'], 'Last-Modified',['2008-11-13T13:35:51Z'], 'dcterms:modified',['2008-11-13T13:35:51Z'], 'dc:format',['application/pdf; version=1.3'], 'title',['solr-word'], 'Last-Save-Date',['2008-11-13T13:35:51Z'], 'meta:save-date',['2008-11-13T13:35:51Z'], 'pdf:encrypted',['false'], 'dc:title',['solr-word'], 'modified',['2008-11-13T13:35:51Z'], 'cp:subject',['solr word'], 'Content-Type',['application/pdf'], 'stream_size',['21052'], 'X-Parsed-By',['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 'creator',['Grant Ingersoll'], 'meta:author',['Grant Ingersoll'], 'dc:subject',['solr, word, pdf'], 'meta:creation-date',['2008-11-13T13:35:51Z'], 'created',['Thu Nov 13 13:35:51 UTC 2008'], 'xmpTPg:NPages',['1'], 'Creation-Date',['2008-11-13T13:35:51Z'], 'resourceName',['/Users/erikhatcher/solr-5.0.0/example/exampledocs/solr-word.pdf'], 'meta:keyword',['solr, word, pdf'], 'Author',['Grant Ingersoll'], 'producer',['Mac OS X 10.5.5 Quartz PDFContext']]} 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/products/update?extractOnly=true&wt=ruby&indent=on... Time spent: 0:00:00.036 > On Feb 10, 2015, at 11:12 AM, Eric Lease Morgan <[log in to unmask]> wrote: > > Can somebody point me to a good tutorial on how to index Word documents using Solr? > > I have a few hundred Microsoft Word documents I want to search. Through the use of the Tika library it seems as if I ought to be able to index my Word documents directly into Solr, but none of the tutorials I have found on the Web are complete. Missing directories. Missing files. Documentation for versions unreleased. Etc. > > Put another way, Tika can create a (nice) XHTML file complete with some useful metadata that can all be fed to Solr for indexing, but I can barely get out of the starting gate. Have you indexed Word documents using Solr, and if so, then how? > > — > Eric Morgan