LISTSERV 16.5 - CODE4LIB Archives

Hello,

   For a project I just finished several scripts to generate pdfs from
piles of tiffs.  The process was:

In a .htaccess file have not found urls rewritten to a script that
passed the desired filename to it.

The script would then build the pdf and 'print' it to the requester
along with the correct mime type. It would also save it to the disk in
the place the original request was made to. Such that a second request
for the same file would just serve the file with no script processing.

A simple batch script to delete all files older than X days in the pdf
folder.

You could do something similurlar and by using different urls and some
unix pdf2pdf command processing make a bunch of urls that would serve
different dpi version of the same file. like:

http://foo.bar/pdf/big/101010.pdf
http://foo.bar/pdf/small/101010.pdf
http://foo.bar/pdf/tiny/101010.pdf
http://foo.bar/pdf/verytiny/101010.pdf

If you need any code samples or additional info I welcome your query.

Aaron




-- 
Aaron Addison
Unix Administrator 
W. E. B. Du Bois Library UMass Amherst
413 577 2104



On Wed, 2011-08-03 at 22:51 -0400, Joe Hourcle wrote:
> On Aug 3, 2011, at 7:36 PM, Ranti Junus wrote:
> 
> > Dear All,
> > 
> > My colleague came with this query and I hope some of you could give us some
> > ideas or suggestion:
> > 
> > Our Digital Multimedia Center (DMC) scanning project can produce very large
> > PDF files. They will have PDFs that are about 25Mb and some may move into
> > the 100Mb range. If we provide a link to a PDF of that large, a user may not
> > want to try to download it even though she really needs to see the
> > information. In the past, DMC has created a lower quality, smaller versions
> > to the original file to reduce the size. Some thoughts have been tossed
> > around to reduce the duplication or the work (e.g. no more creating the
> > lower quality PDF manually.)
> > 
> > They are wondering if there is an application that we could point to the end
> > user, who might need it due to poor internet access, that if used will
> > simplify the very large file transfer for the end user. Basically:
> > - a client software that tells the server to manipulate and reduce the file
> > on the fly
> > - a server app that would to the actual manipulation of the file and then
> > deliver it to the end user.
> > 
> > Personally, I'm not really sure about the client software part. It makes
> > more sense to me (from the user's perspective) that we provide a "download
> > the smaller size of this large file" link that would trigger the server-side
> > apps to manipulate the big file. However, we're all ears for any suggestions
> > you might have.
> 
> 
> I've been dealing with related issues for a few years, and if you have
> the file locally, it's generally not too difficult to have a CGI or similar
> that you can call that will do some sort of transformation on the fly.
> 
> Unfortunately, what we've run into is that in some cases, in part because
> it tends to be used by people with slow connections, and for very large
> files, they'll keep restarting to the process, and because it's a generated
> on-the-fly, the webserver can't just pick up where it left off, so has to
> re-start the process.
> 
> The alternative is to write it out to disk, and then let the webserver
> handle it as a normal file.  Depending on how many of these you're
> dealing with, you may have to have something manage the scratch
> space and remove the generated files that haven't been viewed in
> some time.
> 
> What I've been hoping to do is:
> 
> 	1. Assign URLs to all of the processed forms, of the format:
> 		http://server/processing/ID
> 		(where 'ID' includes some hashing in it, so it's not 10mil files in a directory)
> 
> 	2. Write a 404 handler for each processing type, so that
> 		should a file not exist in that directory, it will:
> 		(a) verify that the ID is valid, otherwise, return a 404.
> 		(b) check to see if the ID's being processed, otherwise, kick
> 			off a process for the file to be generated
> 		(c) return a 503 status.
> 
> Unfortunately, my initial testing (years ago) suggested that no
> clients at the time properly handled 503 requests (effectively,
> try back in (x) minutes, and you give 'em a time)
> 
> The alternative is to just basically sleep for a period of time, and
> then return the file once it's been generated ... but for ones
> that take some time (some of my processing might take hours,
> as the files that it needs as input are stored near-line, and we're
> at the mercy of a tape robot)
> 
> ...
> 
> You might also be able to sleep and then use one of the various
> 30x status codes, but I don't know what a client might do if you
> returned the same URL.  (they might abort, to prevent looping)
> 
> -Joe