LISTSERV 16.5 - CODE4LIB Archives

On Aug 3, 2011, at 7:36 PM, Ranti Junus wrote:

> Dear All,
> 
> My colleague came with this query and I hope some of you could give us some
> ideas or suggestion:
> 
> Our Digital Multimedia Center (DMC) scanning project can produce very large
> PDF files. They will have PDFs that are about 25Mb and some may move into
> the 100Mb range. If we provide a link to a PDF of that large, a user may not
> want to try to download it even though she really needs to see the
> information. In the past, DMC has created a lower quality, smaller versions
> to the original file to reduce the size. Some thoughts have been tossed
> around to reduce the duplication or the work (e.g. no more creating the
> lower quality PDF manually.)
> 
> They are wondering if there is an application that we could point to the end
> user, who might need it due to poor internet access, that if used will
> simplify the very large file transfer for the end user. Basically:
> - a client software that tells the server to manipulate and reduce the file
> on the fly
> - a server app that would to the actual manipulation of the file and then
> deliver it to the end user.
> 
> Personally, I'm not really sure about the client software part. It makes
> more sense to me (from the user's perspective) that we provide a "download
> the smaller size of this large file" link that would trigger the server-side
> apps to manipulate the big file. However, we're all ears for any suggestions
> you might have.


I've been dealing with related issues for a few years, and if you have
the file locally, it's generally not too difficult to have a CGI or similar
that you can call that will do some sort of transformation on the fly.

Unfortunately, what we've run into is that in some cases, in part because
it tends to be used by people with slow connections, and for very large
files, they'll keep restarting to the process, and because it's a generated
on-the-fly, the webserver can't just pick up where it left off, so has to
re-start the process.

The alternative is to write it out to disk, and then let the webserver
handle it as a normal file.  Depending on how many of these you're
dealing with, you may have to have something manage the scratch
space and remove the generated files that haven't been viewed in
some time.

What I've been hoping to do is:

	1. Assign URLs to all of the processed forms, of the format:
		http://server/processing/ID
		(where 'ID' includes some hashing in it, so it's not 10mil files in a directory)

	2. Write a 404 handler for each processing type, so that
		should a file not exist in that directory, it will:
		(a) verify that the ID is valid, otherwise, return a 404.
		(b) check to see if the ID's being processed, otherwise, kick
			off a process for the file to be generated
		(c) return a 503 status.

Unfortunately, my initial testing (years ago) suggested that no
clients at the time properly handled 503 requests (effectively,
try back in (x) minutes, and you give 'em a time)

The alternative is to just basically sleep for a period of time, and
then return the file once it's been generated ... but for ones
that take some time (some of my processing might take hours,
as the files that it needs as input are stored near-line, and we're
at the mercy of a tape robot)

...

You might also be able to sleep and then use one of the various
30x status codes, but I don't know what a client might do if you
returned the same URL.  (they might abort, to prevent looping)

-Joe