Huh, since the highlighter only needs to run on the documents in the actual returned section of the result set (10-50?), I wouldn't think total number of documents would matter much (I certainly could be wrong), but total size of each document's stored field definitely has a known performance impact on highlighter. Maybe some time I'll have time or the local requirement need to investigate; wonder if there'd be a way to write a custom highlighting component optimized for the EAD use case, or for the general case of "identify matching section(s) in XML" that would do better.
I'm less nervous about custom components that do not require patches to Solr than I am about patches to Solr core that are not (yet?) included in solr tagged release or trunk.
With some of the stuff I'm working with, RAM seems to have sometimes unexpected impacts on performance too. From thinking about what it does, and from looking at my cache hit/miss/eviction statistics, I didn''t really have reason to think that lack of RAM was what was slowing down my StatsComponent use, but adding RAM seems to help a lot. I need a hardware upgrade to be able to add enough RAM and avoid swap, to be sure that what I think I'm seeing about RAM effects on performance is what I'm seeing, but I think so. Wonder if throwing monster amounts of RAM at Solr and increasing certain relevant caches a lot would have an impact on highlighter performance.
I've thought about using the highlighter in that way on Marc documents to provide matching snippets ala google in hits page -- the fact that Marc documents aren't "full text', but are lists of structured (well, you know, they try :) ) fields, means that you can't just use the highlighter out of the box and get a reasonable snippet to show the user, but if you could use it to identify which _fields_ matched the query, and then throw each matching field (or the first N) through a display mapper that labels it and formats it appropriately (my as-of-yet not publically released marc mapping ruby framework could handle that nicely), that could provide a nice "hit snippet" perhaps. A large marc document is probably still smaller than a typical EAD document, so might have greater chance of success.
From: Code for Libraries [[log in to unmask]] On Behalf Of Bess Sadler [[log in to unmask]]
Sent: Saturday, August 07, 2010 12:41 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
On Aug 6, 2010, at 8:07 PM, Jonathan Rochkind wrote:
> I've been brainstorming other weird ways to do this. This one is totally wacky and possibly a bad idea, but I'll throw it out there anyway. What if you only indexed the entire EAD as one document, BUT threw the entire EAD in a stored field, and used solr highlightning on that field. NOT to show the highlighter results to the user, but to sort of trick the highlighter, using hl.fragmenter/fragmentsBuilder (possibly with a custom component in a jar) to telling you _which_ sub-sections of the EAD matched, and your software could then display the matching sub-sections (possibly with direct links to display) in the search results, under the actual document hit.
Hi, Jonathan. I don't think this is a crazy idea, and in fact it is one of the approaches that Matt M. and I tried during our NWDA project. However, we found that it wasn't scalable. The highlighter was way too slow with the number of documents and fragments we were throwing at it. It wasn't even a huge number of documents, so we abandoned that idea. However, it's still a really elegant solution if only it were performant. Let me know if you decide to give it a try.