You can see another example of blacklight being used to search and display EAD guides at
I've used solr and/or lucene for EAD documents a few times, and here are some observations:
> I've also heard about scalability issues with Solr and large XML documents,
> but I've never seen benchmarks.
Solr is incredibly scalable, so describing this as a solr scalability issue isn't really accurate. What might be more accurate would be to say that Solr is designed for searching, while most people looking for an EAD solution are trying to get it to do a lot more than that. The problem is that you want to be able to discover and view an EAD guide at several levels, right? You want to be able to discover at the collection level, and at the item level, and presumably at the level of some section of the EAD document (e.g., biographical history or whatever). Solr and lucene really just know how to tell you whether a given document in the index matches a query you've entered, though, so if you want to be able to discover on each of those levels, you have to index your document once to represent the collection, then again for each section you want to be independently discoverable, then again for each item you want to be discoverable. Creating a UI that is going to represent a single EAD, which has now been transformed into potentially hundreds or thousands of independently discoverable items and EAD sections is quite challenging. I liked what Matt Mitchell and I did for the Northwest Digital Archives, but I'm always interested in other ways one might approach this.
We indexed each EAD guide into separate lucene documents for each EAD section, then collapsed them under the main EAD title in the search results, so that when you search for an archival collection you only see the EAD guide represented once, but each section of it is still independently viewable and bookmarkable:
Here is the guide for the Bing Crosby Historical Society in a search result:
But in order to look at the guide, you have to look at a specific part of it: http://nwda.projectblacklight.org/catalog/bcc_1-summary
Additionally, we treated each item as a first class independently discoverable object, but still linked them all to the section of the EAD document where they came from:
Matt and I were thinking it would be nice to allow blacklight to handle all of the display of the EAD too, which is why we stored a lot of EAD markup in the solr document, and that can potentially have scalability problems, because lucene is not a database but we were treating it like one. This works, but it's a bit of a hack.