LISTSERV 16.5 - NDSA-ALL Archives

Hi Dan,

Our response is fairly lengthy, so I put it in this document:

https://docs.google.com/document/d/1c8x-BeA7K13JvCI34Nxk3E_Ha9iP84EXcBYiAdAJVSA/edit?usp=sharing

If you want to talk further about staging, preservation, or other issues,
feel free to contact me <[log in to unmask]> or Bradley Daigle
<[log in to unmask]> at APTrust.

Andrew Diamond
Lead Developer, APTrust




On Mon, Jun 15, 2020 at 11:18 AM Amy Kirchhoff <[log in to unmask]>
wrote:

> Hi Dan ~
>
>
>
> Your experience rings true (maybe in life, not just in preservation – we
> all manage to fill up our humongous staging areas).
>
>
>
> We have implemented a few things at Portico over time.  The biggest
> change was one we made several years ago, where we moved to what we call
> “straight-to-ingest” or S2I.  Portico preserves electronic publications –
> journals, books, digitized collections, and even more new-fangled, database
> like content.  We are a not-for-profit with limited resources and we were
> running into efficiency and cost-effectiveness issues.  We can afford a
> certain amount of capacity to handle “problem content” and the amount of
> content falling into this category was greater than our capacity.  This
> problem content would remain in our processing space until we could address
> it and we were simply not able to get to it all in a timely fashion.  Our
> processing space was not designed for secure, long-term preservation, it
> was designed for transactional activities.  During a project to identify
> efficiencies, we realized that much of the content we were identifying as
> “problem” were just the reality – sometimes journal articles are published
> with missing images!  If that’s the way they were published, than it is
> perfectly appropriate for them to be preserved that way.  In addition,
> there are whole categories of this content where, despite the problems, we
> had at least one complete rendition of the article or book.  For example,
> if we are sent both the XML of an article and a PDF, even if we are missing
> an image file, we have the PDF – which is a perfectly acceptable and
> complete rendition.
>
>
>
> Enter, S2I – which lets us move some of this problematic content into the
> Archive without our need to resolve the problems before preserving it.  We
> now grade content as we preserve it … ‘A’ means we believe we have
> everything needed and know of no problems in the content, ‘B’ means there
> are some problems, but we have at least one, complete rendition of the item.
> And so on.  We may eventually choose to preserve items graded ‘C’, ‘D’,
> and ‘F’ – but at the moment we do not, they remain in our processing area
> until we address the problems.  Tied with grading content, we also
> implemented some significant improvements to our error tracking.  We now
> have a new file, which is part of every archival unit and indicates the
> details of any problems we have with the content.
>
>
>
> We released the changes to support S2I and are rolling it out scenario by
> scenario.  For example, we knew that one of our biggest problems was
> missing images and so we started there.  Another big category is XML that
> does not validate to the publisher’s DTD (if we can extract the title,
> authors, DOI, and a few other items from it with regular expressions,
> we’ll go ahead and preserve the content, if we have a good PDF of the
> article).  We have 10s of additional scenarios we could roll-out.  For
> content in each of the implemented scenarios, it is moving through our
> processing space and being preserved in the archive with a record of its
> grade and errors.  This gets it out of our processing space and into the
> archive, which is a much more appropriate, long-term location.  With the
> detailed error message tracking, we now have the information we need to
> identify content with specific problems and pull it out of the archive for
> reprocessing, as appropriate.  We can also write rules to alert us to
> anomalies – for example, say publisher A has a historic pattern of 10% of
> its content being marked B because of missing images.  If we see that
> pattern changing, say it rises to 50%, we can get an alert, take a look,
> and prioritize a deep review of that publisher.
>
>
>
> Note that not all problems are publisher introduced problems, sometimes
> it is our problem – for example, we code our XML transformations very
> conservatively.   We do not write a rule for all elements in the DTD,
> just those we have seen in action.  Thus, if during processing we
> encounter an element where we do not have a rule for that is also a
> ‘problem’.
>
>
>
> We are not pushing all content into the archive willy-nilly, but S2I has
> greatly improved our ability to get content into the archive and out of our
> processing system, and to track any problems we have with the content.
>
>
>
> You can read more about our S2I project here:
>
> ·         http://doi.org/10.17605/OSF.IO/VW7RJ
>
> ·
> https://www.dpconline.org/blog/idpd/taming-the-pre-ingest-processing-monster
>
>
>
> Another tactic we have used for staging areas (as opposed to processing
> areas) is that when we have identified large amounts of content that needs
> to stick around in our staging area for a good amount of time, we will
> sometimes off-load it into Glacier. For example, we have one large
> publisher that sent us their content multiple times and it was going to
> take us a long while to confirm it had all made it into the Archive and we
> did not want to delete it until we had that confirmation – this was prime
> content to move out of our staging area and into Glacier.
>
>
>
> I’m happy to answer questions (or find the right person on staff to
> answer questions!).
>
>
>
> ~ Amy (Portico, Archive Service Product Manager)
>
>
>
>
>
> *From:* The NDSA organization list <[log in to unmask]> *On Behalf
> Of *Noonan, Dan
> *Sent:* Friday, June 12, 2020 2:48 PM
> *To:* [log in to unmask]
> *Subject:* [NDSA-ALL] Workflow capacity and "temporary" storage
>
>
>
> Hi All: This is a query that I think is valuable for our larger digital
> preservation community, so please reply to the whole list.
>
>
>
> A few years back we commissioned a new shared drive to be the staging area
> for content, where processing and metadata creation happens prior to ingest
> into our Digital Collections (Fedora/Hyrax) system. We original expected it
> to be about 5-10TBs of fluid space, with things coming in being processed,
> ingested and local copies disposed freeing up space. Unfortunately it has
> ballooned to 30TBs+. Some of that comes from a pause we had had on ingest
> leading up to the Hyrax upgrade, and being without a metadata librarian at
> the time, and part of it has come from a significant amount of av
> digitization that we had the funding to do, but not the human resources to
> push it through the process post-digitization.
>
>
>
> So the series of questions I have been tasked with is to find out:
>
>    - How do more mature digital libraries manage temporary storage for
>    processing?
>    - Do they also have a mismatch between their ambitions and their
>    capacity?
>    - If not, what processes have they developed to keep them in sync?
>    - What prioritization metrics do you use?
>    - Are these adaptable processes for other institutions?
>
>
>
> Please let me know your thoughts – Thanks – Dan
>
>
>
> [image: The Ohio State University]
>
> *Daniel W. Noonan*
>
> Associate Professor
>
> Digital Preservation Librarian
>
> University Libraries | Digital Programs
>
> 320A 18th Avenue Library | 175 West 18th Avenue Columbus, OH 43210
> 614.247.2425 Office
> [log in to unmask] go.osu.edu/noonan @DannyNoonan1962
> <https://twitter.com/DannyNoonan1962>
>
> [image: cid:[log in to unmask]]
> http://orcid.org/0000-0002-7021-4106
>
>
>
> Pronouns: he/him/his ~ Honorific: Mr.
>
> *Buckeyes consider the environment before printing.*
>
> *Campus Campaign Fund: 483229 Rare Books and Manuscripts fund for LGBTQ*
>
> ########################################################################
>
> to manage your NDSA-ALL subscription, visit ndsa.org/ndsa-all
> ########################################################################
>
> to manage your NDSA-ALL subscription, visit ndsa.org/ndsa-all
>

########################################################################

to manage your NDSA-ALL subscription, visit ndsa.org/ndsa-all