> What's the right place for such a piece of code? Solrmarc seems the
> obvious place to me. As it has been described to me so far, this
> doesn't seem like an issue affecting people outside the library realm,
> which makes it seem too niche and community-specific to get it built
> into the lucene codebase, but I could be wrong about that. Maybe it
> would be better as a lucene contrib library?
I actually think the most logical place for this to be implemented would be at the Solr level, as a filter. I say this for a two main reasons:
1.) One of the great strengths of Solr is the fact that you have the ability to apply the same text analysis logic to input at index time and at query time. The SolrMarc approach gives you only one solution: expand each term into multiple entries in the Solr index. Using a Solr filter allows this approach, but it also adds some other options -- i.e. normalize all the text to a particular format in the index, and then normalize queries to match it using the same algorithm, thus reducing index size. I know very little about Chinese Romanization, so I don't really know which strategy would make the most sense... but Solr filters will give you more flexibility.
2.) One of the issues the VuFind community will be tackling before too long is dealing with records from formats other than MARC. Presumably Romanization problems are not limited strictly to MARC. It seems to make the most sense to address the problem at the index level rather than in a tool tied to a particular input format.
- Demian
|