Print

Print


Jonathon,

Likewise that paragraph reads with the same accuracy with the
following alterations

s/UTF-8|Unicode/PDF/
s/encoding/version/

I think the key thing is that garbage in == garbage out, but I feel
happier with garbage that was meant to have been unicode at some
point, compared to a pdf that was made by a Word->PDF printer driver
that craps out on large files, but does so silently.

My experiences with PDF versions:

PDF 1.3 and earlier is evil, 1.4 not too bad aside from colour issues
and its ham-fisted way of attempting to shove CMYK info into itself,
1.6 has issues to some people that I am still trying to isolate, 1.7
is rare as hens teeth and PDF/A as a spec seems to be okay, but I've
only seen a few of those in the wild and only from OpenOffice too. It
would be interesting to see how OOo's idea of PDF/A stacks against
Adobe's.

And there is PDF/X(-3?) orsimilar which I've only even seen on an
options panel, before being swiftly ignored.

And on a final note, there have been PDF files that are useless to me,
I can't wheedle out anything from them, and that are only 10 years
old. However, I have resurrected a tex-based thesis from an earlier
period without difficulty, and created a PDF/A from the source.

Bottom line is that it's best to preserve the source materials as well
as the final disseminations - you can't always guarantee a viewer will
work as expected. The trend is that newer PDF versions are better, but
be very very wary of hidden DRM. If memory serves, an eBook publisher
lost 1/4(?) of their stock, due to losing the mechanism to unlock.
Let's not have that happen to repositories...

Ben

2009/6/15 Jonathan Rochkind <[log in to unmask]>:
> Fair enough.  Asking someone to give you a UTF-8 (or other Unicode encoding)
> plain text file though -- you better try to heuristically check the encoding
> before ingesting it, and plan on a lot of failures. Typical users using
> typical consumer software (which tends to be somewhat unpredictable with
> character encodings) can't be trusted to give you a UTF-8 encoding just
> because you specify it, or  to have any idea what this means or how to do
> it.
> And checking the to see if the 'true' encoding of a plain text file is what
> it's advertised as in an automated fashion is heuristic at best, and not
> going to be perfect.
> And you're still going to have trouble with complicated mathematical
> formulas, molecular diagrams, other diagrams, etc.
>
> Jonathan
>
> Doran, Michael D wrote:
>>>
>>> As far as electronic formats go, I think PDF is as good as anything --
>>> except maybe plain ASCII text, which is not
>>> nearly as useable (and doesn't allow diagrams,
>>> mathematical equations, non-English letters, etc).
>>>
>>
>> There is no requirement that plain text be limited to the ASCII character
>> set repertoire.  Although once they were almost synonymous, that is no
>> longer the case [1].  Plain text can encompass anything and everything in
>> the Unicode character set.  That includes non-Roman scripts, mathematical
>> symbols, yada, yada, yada.
>>
>> -- Michael
>>
>> [1] http://en.wikipedia.org/wiki/Plain_text
>>
>> # Michael Doran, Systems Librarian
>> # University of Texas at Arlington
>> # 817-272-5326 office
>> # 817-688-1926 mobile
>> # [log in to unmask]
>> # http://rocky.uta.edu/doran/
>>
>>
>>>
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>>> Jonathan Rochkind
>>> Sent: Monday, June 15, 2009 9:13 AM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] Durability of PDFs
>>>
>>> The bet is that PDFs are so popular that _someone_ (the archival
>>> community if no-one else, but probably someone else) will ensure that they
>>> continue to be readable somehow.
>>>
>>> These are real non-trivial issues in electronic archiving though, issues
>>> that the archival community.  It is generally a safe assumption that good
>>> electronic archiving over the decades-and-more term requires some regular
>>> attention by an electronic archivist to make sure that files remain
>>> readable, and are converted to new formats when necessary. As well as
>>> attention to avoiding actual bit-level corruption of files. You can't
>>> neccesarily just dump files on a HD and ignore them and expect they'll be
>>> readable in 100 years, that much is true -- and true pretty much regardless
>>> of particular electronic format you choose.
>>>
>>> As far as electronic formats go, I think PDF is as good as anything --
>>> except maybe plain ASCII text, which is not nearly as useable (and doesn't
>>> allow diagrams, mathematical equations, non-English letters, etc). I don't
>>> know why you're colleague has decided that "30-40 years" is the horizon
>>> after which PDF specifically will become "unreadable", this seems like just
>>> a wild guess to me, but it would be interesting to see if he has any
>>> particular evidence to back up this claim.
>>> So there are real issues with electronic archiving, but unless they lead
>>> you to refuse to accept electronic submissions at all, you're just going to
>>> have to deal with them, it's not really an issue of PDF specifically, but it
>>> is true that "just dump files on a HD and forget about them and assume
>>> they'll be readable in 100 years" is not a particularly safe electronic
>>> archiving strategy.
>>>
>>> Jonathan
>>>
>>> Mike Taylor wrote:
>>>
>>>>
>>>> Dear CODE4LIB colleagues,
>>>>
>>>> In one of my alternative incarnations, I am a zoological taxonomist.
>>>> One of the big issues for taxonomy right now is whether to accept as
>>>> nomenclaturally valid papers that are published only in electronic
>>>> form, i.e. not printed on paper by a publisher.
>>>>
>>>> In a discussion of this matter, a colleague has claimed:
>>>>
>>>>
>>>>>
>>>>> [PDF files will not become unreadable] in the next 30-40 years.
>>>>> Possibly not in the 20 years that will follow. After that,
>>>
>>> when only
>>>
>>>>>
>>>>> 30-year and older documents are in the PDF format, the danger will
>>>>> increase that this information will not be readable any more. It is
>>>>> generally considered as quite unlikely that PDF will be readable in
>>>>> 100 years.
>>>>>
>>>>
>>>> I would appreciate any comments that anyone on this list has on the
>>>> likelihood that PDF will be unreadable in 100 years.
>>>>
>>>> Many thanks,
>>>>
>>>>  _/|_
>>>
>>> ___________________________________________________________________
>>>
>>>>
>>>> /o ) \/  Mike Taylor    <[log in to unmask]>
>>>
>>> http://www.miketaylor.org.uk
>>>
>>>>
>>>> )_v__/\  "Can't someone act COMPLETELY OUT OF CHARACTER
>>>
>>> without arousing
>>>
>>>>
>>>>         suspicion?" -- Bob the Angry Flower, www.angryflower.com
>>>>
>>>>
>>
>>
>