Print

Print


These requirements fit Swish-e [1] to a "T". I've used it to index
millions of XML records [2], and there are no particular requirements
for the XML -- it just needs to be well-formed. You can have it
automatically detect and index XML fields as well as index all words
across all fields. This is all handled by a very simple text config
file. The only downside is you will need to write the user interface
(CGI) in your favorite language to interact with Swish-e.

For example, here is my entire config file for Current Cites [3],
where I store citations in my own XML format:

DefaultContents XML*
UndefinedMetaTags auto
IndexDir /home/tennantr/public_html/currentcites/cites/
ReplaceRules remove /home/tennantr/public_html/currentcites/cites/
PropertyNames creator title description booktitle source
IndexOnly .xml

This tells Swish-e to expect XML, the line "UndefinedMetaTags auto"
tells it to keep track of any XML tag it sees, the next two lines
telll it where the files are and I remove the path from the index so I
only get returned each file title without the server path included.
The "PropertyNames" line defines with elements are actually stored in
the index, which I can then retrieve directly in the search results
for display to the user. The "IndexOnly .xml" line tells Swish-e to
ignore anything without that filename extension. Nothing could be
easier.
Roy

[1] http://swish-e.org/
[2] http://roytennant.com/proto/hathi/
[3] http://lists.webjunction.org/currentcites/

On Wed, Mar 16, 2011 at 8:00 AM, Edward M. Corrado <[log in to unmask]> wrote:
> Hi,
>
> I [will soon] have a small set (< 1000 records) of Dublin Core
> metadata published in OAI_DC format that I want to be searchable via a
> Web browser.  Normally we would use Ex Libris's Primo for this, but
> this particular set of data may have some confidential information and
> our repository only has minimal built in search functions. While we
> still may go with Primo for these records, I am looking for at other
> possibilities. The requirements as I see them are:
>
> 1) Can ingest records in OAI_DC format
> 2) Allow remote end-users who are familiar with the collection search
> these ingest records via a Web browser.
> 3)Search should be keyword anywhere or individual fields although it
> does not need to have every whizzbang feature out there. In other
> words, basic search feature are fine.
> 4) Should support the ability to link to the display copy in our
> repository (probably goes without saying)
> 5) Should be simple to install and maintain (Thus, at least in my
> mind, eliminating something like Blacklight)
> 6) Preferably a LAMP application although a Windows server based
> solution is a possibility as well
> 7) Preferably Open Source, or at least no- or low-cost
>
> I haven't been able to find anything searching the Web, but it seems
> like something people may have done before. Before I re-invent the
> wheel or shoe-horn something together, does anyone have any
> suggestions?
>
> Edward
>