Jonathon, Likewise that paragraph reads with the same accuracy with the following alterations s/UTF-8|Unicode/PDF/ s/encoding/version/ I think the key thing is that garbage in == garbage out, but I feel happier with garbage that was meant to have been unicode at some point, compared to a pdf that was made by a Word->PDF printer driver that craps out on large files, but does so silently. My experiences with PDF versions: PDF 1.3 and earlier is evil, 1.4 not too bad aside from colour issues and its ham-fisted way of attempting to shove CMYK info into itself, 1.6 has issues to some people that I am still trying to isolate, 1.7 is rare as hens teeth and PDF/A as a spec seems to be okay, but I've only seen a few of those in the wild and only from OpenOffice too. It would be interesting to see how OOo's idea of PDF/A stacks against Adobe's. And there is PDF/X(-3?) orsimilar which I've only even seen on an options panel, before being swiftly ignored. And on a final note, there have been PDF files that are useless to me, I can't wheedle out anything from them, and that are only 10 years old. However, I have resurrected a tex-based thesis from an earlier period without difficulty, and created a PDF/A from the source. Bottom line is that it's best to preserve the source materials as well as the final disseminations - you can't always guarantee a viewer will work as expected. The trend is that newer PDF versions are better, but be very very wary of hidden DRM. If memory serves, an eBook publisher lost 1/4(?) of their stock, due to losing the mechanism to unlock. Let's not have that happen to repositories... Ben 2009/6/15 Jonathan Rochkind <[log in to unmask]>: > Fair enough. Asking someone to give you a UTF-8 (or other Unicode encoding) > plain text file though -- you better try to heuristically check the encoding > before ingesting it, and plan on a lot of failures. Typical users using > typical consumer software (which tends to be somewhat unpredictable with > character encodings) can't be trusted to give you a UTF-8 encoding just > because you specify it, or to have any idea what this means or how to do > it. > And checking the to see if the 'true' encoding of a plain text file is what > it's advertised as in an automated fashion is heuristic at best, and not > going to be perfect. > And you're still going to have trouble with complicated mathematical > formulas, molecular diagrams, other diagrams, etc. > > Jonathan > > Doran, Michael D wrote: >>> >>> As far as electronic formats go, I think PDF is as good as anything -- >>> except maybe plain ASCII text, which is not >>> nearly as useable (and doesn't allow diagrams, >>> mathematical equations, non-English letters, etc). >>> >> >> There is no requirement that plain text be limited to the ASCII character >> set repertoire. Although once they were almost synonymous, that is no >> longer the case [1]. Plain text can encompass anything and everything in >> the Unicode character set. That includes non-Roman scripts, mathematical >> symbols, yada, yada, yada. >> >> -- Michael >> >> [1] http://en.wikipedia.org/wiki/Plain_text >> >> # Michael Doran, Systems Librarian >> # University of Texas at Arlington >> # 817-272-5326 office >> # 817-688-1926 mobile >> # [log in to unmask] >> # http://rocky.uta.edu/doran/ >> >> >>> >>> -----Original Message----- >>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of >>> Jonathan Rochkind >>> Sent: Monday, June 15, 2009 9:13 AM >>> To: [log in to unmask] >>> Subject: Re: [CODE4LIB] Durability of PDFs >>> >>> The bet is that PDFs are so popular that _someone_ (the archival >>> community if no-one else, but probably someone else) will ensure that they >>> continue to be readable somehow. >>> >>> These are real non-trivial issues in electronic archiving though, issues >>> that the archival community. It is generally a safe assumption that good >>> electronic archiving over the decades-and-more term requires some regular >>> attention by an electronic archivist to make sure that files remain >>> readable, and are converted to new formats when necessary. As well as >>> attention to avoiding actual bit-level corruption of files. You can't >>> neccesarily just dump files on a HD and ignore them and expect they'll be >>> readable in 100 years, that much is true -- and true pretty much regardless >>> of particular electronic format you choose. >>> >>> As far as electronic formats go, I think PDF is as good as anything -- >>> except maybe plain ASCII text, which is not nearly as useable (and doesn't >>> allow diagrams, mathematical equations, non-English letters, etc). I don't >>> know why you're colleague has decided that "30-40 years" is the horizon >>> after which PDF specifically will become "unreadable", this seems like just >>> a wild guess to me, but it would be interesting to see if he has any >>> particular evidence to back up this claim. >>> So there are real issues with electronic archiving, but unless they lead >>> you to refuse to accept electronic submissions at all, you're just going to >>> have to deal with them, it's not really an issue of PDF specifically, but it >>> is true that "just dump files on a HD and forget about them and assume >>> they'll be readable in 100 years" is not a particularly safe electronic >>> archiving strategy. >>> >>> Jonathan >>> >>> Mike Taylor wrote: >>> >>>> >>>> Dear CODE4LIB colleagues, >>>> >>>> In one of my alternative incarnations, I am a zoological taxonomist. >>>> One of the big issues for taxonomy right now is whether to accept as >>>> nomenclaturally valid papers that are published only in electronic >>>> form, i.e. not printed on paper by a publisher. >>>> >>>> In a discussion of this matter, a colleague has claimed: >>>> >>>> >>>>> >>>>> [PDF files will not become unreadable] in the next 30-40 years. >>>>> Possibly not in the 20 years that will follow. After that, >>> >>> when only >>> >>>>> >>>>> 30-year and older documents are in the PDF format, the danger will >>>>> increase that this information will not be readable any more. It is >>>>> generally considered as quite unlikely that PDF will be readable in >>>>> 100 years. >>>>> >>>> >>>> I would appreciate any comments that anyone on this list has on the >>>> likelihood that PDF will be unreadable in 100 years. >>>> >>>> Many thanks, >>>> >>>> _/|_ >>> >>> ___________________________________________________________________ >>> >>>> >>>> /o ) \/ Mike Taylor <[log in to unmask]> >>> >>> http://www.miketaylor.org.uk >>> >>>> >>>> )_v__/\ "Can't someone act COMPLETELY OUT OF CHARACTER >>> >>> without arousing >>> >>>> >>>> suspicion?" -- Bob the Angry Flower, www.angryflower.com >>>> >>>> >> >> >