Fair enough. Asking someone to give you a UTF-8 (or other Unicode
encoding) plain text file though -- you better try to heuristically
check the encoding before ingesting it, and plan on a lot of failures.
Typical users using typical consumer software (which tends to be
somewhat unpredictable with character encodings) can't be trusted to
give you a UTF-8 encoding just because you specify it, or to have any
idea what this means or how to do it.
And checking the to see if the 'true' encoding of a plain text file is
what it's advertised as in an automated fashion is heuristic at best,
and not going to be perfect.
And you're still going to have trouble with complicated mathematical
formulas, molecular diagrams, other diagrams, etc.
Jonathan
Doran, Michael D wrote:
>> As far as electronic formats go, I think PDF is as good as
>> anything -- except maybe plain ASCII text, which is not
>> nearly as useable (and doesn't allow diagrams,
>> mathematical equations, non-English letters, etc).
>>
>
> There is no requirement that plain text be limited to the ASCII character set repertoire. Although once they were almost synonymous, that is no longer the case [1]. Plain text can encompass anything and everything in the Unicode character set. That includes non-Roman scripts, mathematical symbols, yada, yada, yada.
>
> -- Michael
>
> [1] http://en.wikipedia.org/wiki/Plain_text
>
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [log in to unmask]
> # http://rocky.uta.edu/doran/
>
>
>
>> -----Original Message-----
>> From: Code for Libraries [mailto:[log in to unmask]] On
>> Behalf Of Jonathan Rochkind
>> Sent: Monday, June 15, 2009 9:13 AM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] Durability of PDFs
>>
>> The bet is that PDFs are so popular that _someone_ (the archival
>> community if no-one else, but probably someone else) will ensure that
>> they continue to be readable somehow.
>>
>> These are real non-trivial issues in electronic archiving
>> though, issues
>> that the archival community. It is generally a safe assumption that
>> good electronic archiving over the decades-and-more term
>> requires some
>> regular attention by an electronic archivist to make sure that files
>> remain readable, and are converted to new formats when necessary. As
>> well as attention to avoiding actual bit-level corruption of
>> files. You
>> can't neccesarily just dump files on a HD and ignore them and expect
>> they'll be readable in 100 years, that much is true -- and
>> true pretty
>> much regardless of particular electronic format you choose.
>>
>> As far as electronic formats go, I think PDF is as good as
>> anything --
>> except maybe plain ASCII text, which is not nearly as useable (and
>> doesn't allow diagrams, mathematical equations, non-English letters,
>> etc). I don't know why you're colleague has decided that
>> "30-40 years"
>> is the horizon after which PDF specifically will become "unreadable",
>> this seems like just a wild guess to me, but it would be
>> interesting to
>> see if he has any particular evidence to back up this claim.
>>
>> So there are real issues with electronic archiving, but
>> unless they lead
>> you to refuse to accept electronic submissions at all, you're
>> just going
>> to have to deal with them, it's not really an issue of PDF
>> specifically,
>> but it is true that "just dump files on a HD and forget about
>> them and
>> assume they'll be readable in 100 years" is not a particularly safe
>> electronic archiving strategy.
>>
>> Jonathan
>>
>> Mike Taylor wrote:
>>
>>> Dear CODE4LIB colleagues,
>>>
>>> In one of my alternative incarnations, I am a zoological taxonomist.
>>> One of the big issues for taxonomy right now is whether to accept as
>>> nomenclaturally valid papers that are published only in electronic
>>> form, i.e. not printed on paper by a publisher.
>>>
>>> In a discussion of this matter, a colleague has claimed:
>>>
>>>
>>>
>>>> [PDF files will not become unreadable] in the next 30-40 years.
>>>> Possibly not in the 20 years that will follow. After that,
>>>>
>> when only
>>
>>>> 30-year and older documents are in the PDF format, the danger will
>>>> increase that this information will not be readable any more. It is
>>>> generally considered as quite unlikely that PDF will be readable in
>>>> 100 years.
>>>>
>>>>
>>> I would appreciate any comments that anyone on this list has on the
>>> likelihood that PDF will be unreadable in 100 years.
>>>
>>> Many thanks,
>>>
>>> _/|_
>>>
>> ___________________________________________________________________
>>
>>> /o ) \/ Mike Taylor <[log in to unmask]>
>>>
>> http://www.miketaylor.org.uk
>>
>>> )_v__/\ "Can't someone act COMPLETELY OUT OF CHARACTER
>>>
>> without arousing
>>
>>> suspicion?" -- Bob the Angry Flower, www.angryflower.com
>>>
>>>
>>>
>
>
|