Hi Dan, Our response is fairly lengthy, so I put it in this document: https://docs.google.com/document/d/1c8x-BeA7K13JvCI34Nxk3E_Ha9iP84EXcBYiAdAJVSA/edit?usp=sharing If you want to talk further about staging, preservation, or other issues, feel free to contact me <[log in to unmask]> or Bradley Daigle <[log in to unmask]> at APTrust. Andrew Diamond Lead Developer, APTrust On Mon, Jun 15, 2020 at 11:18 AM Amy Kirchhoff <[log in to unmask]> wrote: > Hi Dan ~ > > > > Your experience rings true (maybe in life, not just in preservation – we > all manage to fill up our humongous staging areas). > > > > We have implemented a few things at Portico over time. The biggest > change was one we made several years ago, where we moved to what we call > “straight-to-ingest” or S2I. Portico preserves electronic publications – > journals, books, digitized collections, and even more new-fangled, database > like content. We are a not-for-profit with limited resources and we were > running into efficiency and cost-effectiveness issues. We can afford a > certain amount of capacity to handle “problem content” and the amount of > content falling into this category was greater than our capacity. This > problem content would remain in our processing space until we could address > it and we were simply not able to get to it all in a timely fashion. Our > processing space was not designed for secure, long-term preservation, it > was designed for transactional activities. During a project to identify > efficiencies, we realized that much of the content we were identifying as > “problem” were just the reality – sometimes journal articles are published > with missing images! If that’s the way they were published, than it is > perfectly appropriate for them to be preserved that way. In addition, > there are whole categories of this content where, despite the problems, we > had at least one complete rendition of the article or book. For example, > if we are sent both the XML of an article and a PDF, even if we are missing > an image file, we have the PDF – which is a perfectly acceptable and > complete rendition. > > > > Enter, S2I – which lets us move some of this problematic content into the > Archive without our need to resolve the problems before preserving it. We > now grade content as we preserve it … ‘A’ means we believe we have > everything needed and know of no problems in the content, ‘B’ means there > are some problems, but we have at least one, complete rendition of the item. > And so on. We may eventually choose to preserve items graded ‘C’, ‘D’, > and ‘F’ – but at the moment we do not, they remain in our processing area > until we address the problems. Tied with grading content, we also > implemented some significant improvements to our error tracking. We now > have a new file, which is part of every archival unit and indicates the > details of any problems we have with the content. > > > > We released the changes to support S2I and are rolling it out scenario by > scenario. For example, we knew that one of our biggest problems was > missing images and so we started there. Another big category is XML that > does not validate to the publisher’s DTD (if we can extract the title, > authors, DOI, and a few other items from it with regular expressions, > we’ll go ahead and preserve the content, if we have a good PDF of the > article). We have 10s of additional scenarios we could roll-out. For > content in each of the implemented scenarios, it is moving through our > processing space and being preserved in the archive with a record of its > grade and errors. This gets it out of our processing space and into the > archive, which is a much more appropriate, long-term location. With the > detailed error message tracking, we now have the information we need to > identify content with specific problems and pull it out of the archive for > reprocessing, as appropriate. We can also write rules to alert us to > anomalies – for example, say publisher A has a historic pattern of 10% of > its content being marked B because of missing images. If we see that > pattern changing, say it rises to 50%, we can get an alert, take a look, > and prioritize a deep review of that publisher. > > > > Note that not all problems are publisher introduced problems, sometimes > it is our problem – for example, we code our XML transformations very > conservatively. We do not write a rule for all elements in the DTD, > just those we have seen in action. Thus, if during processing we > encounter an element where we do not have a rule for that is also a > ‘problem’. > > > > We are not pushing all content into the archive willy-nilly, but S2I has > greatly improved our ability to get content into the archive and out of our > processing system, and to track any problems we have with the content. > > > > You can read more about our S2I project here: > > · http://doi.org/10.17605/OSF.IO/VW7RJ > > · > https://www.dpconline.org/blog/idpd/taming-the-pre-ingest-processing-monster > > > > Another tactic we have used for staging areas (as opposed to processing > areas) is that when we have identified large amounts of content that needs > to stick around in our staging area for a good amount of time, we will > sometimes off-load it into Glacier. For example, we have one large > publisher that sent us their content multiple times and it was going to > take us a long while to confirm it had all made it into the Archive and we > did not want to delete it until we had that confirmation – this was prime > content to move out of our staging area and into Glacier. > > > > I’m happy to answer questions (or find the right person on staff to > answer questions!). > > > > ~ Amy (Portico, Archive Service Product Manager) > > > > > > *From:* The NDSA organization list <[log in to unmask]> *On Behalf > Of *Noonan, Dan > *Sent:* Friday, June 12, 2020 2:48 PM > *To:* [log in to unmask] > *Subject:* [NDSA-ALL] Workflow capacity and "temporary" storage > > > > Hi All: This is a query that I think is valuable for our larger digital > preservation community, so please reply to the whole list. > > > > A few years back we commissioned a new shared drive to be the staging area > for content, where processing and metadata creation happens prior to ingest > into our Digital Collections (Fedora/Hyrax) system. We original expected it > to be about 5-10TBs of fluid space, with things coming in being processed, > ingested and local copies disposed freeing up space. Unfortunately it has > ballooned to 30TBs+. Some of that comes from a pause we had had on ingest > leading up to the Hyrax upgrade, and being without a metadata librarian at > the time, and part of it has come from a significant amount of av > digitization that we had the funding to do, but not the human resources to > push it through the process post-digitization. > > > > So the series of questions I have been tasked with is to find out: > > - How do more mature digital libraries manage temporary storage for > processing? > - Do they also have a mismatch between their ambitions and their > capacity? > - If not, what processes have they developed to keep them in sync? > - What prioritization metrics do you use? > - Are these adaptable processes for other institutions? > > > > Please let me know your thoughts – Thanks – Dan > > > > [image: The Ohio State University] > > *Daniel W. Noonan* > > Associate Professor > > Digital Preservation Librarian > > University Libraries | Digital Programs > > 320A 18th Avenue Library | 175 West 18th Avenue Columbus, OH 43210 > 614.247.2425 Office > [log in to unmask] go.osu.edu/noonan @DannyNoonan1962 > <https://twitter.com/DannyNoonan1962> > > [image: cid:[log in to unmask]] > http://orcid.org/0000-0002-7021-4106 > > > > Pronouns: he/him/his ~ Honorific: Mr. > > *Buckeyes consider the environment before printing.* > > *Campus Campaign Fund: 483229 Rare Books and Manuscripts fund for LGBTQ* > > ######################################################################## > > to manage your NDSA-ALL subscription, visit ndsa.org/ndsa-all > ######################################################################## > > to manage your NDSA-ALL subscription, visit ndsa.org/ndsa-all > ######################################################################## to manage your NDSA-ALL subscription, visit ndsa.org/ndsa-all