On Tue, May 19, 2009 at 8:26 AM, Boheemen, Peter van
<[log in to unmask]> wrote:
> Clever idea to put the TicToc stuff 'in the cloud'. How are you going to
> keep it up-to-date ?
By periodically reuploading the entire set (which takes about 15-20
mins), new or changed records can be updated. A changed record is one
with a new RSS feed for the same ISSN + Title combination; the data is
keyed by ISSN+Title. This process can be optimized by only uploading
the delta (you upload .csv files, so the delta can be obtained easily
via comm(1)).
Removing records is a bit of a hassle since GAE does not provide an
easy-to-use interface for that. It's possible to wipe an entire table
clean by repeatedly deleting 500 records at a time (the entire set is
about 19,000 records), then doing a fresh import. This can be done by
uploading a "console" application into the cloud.
(http://con.appspot.com/console/help/about ) Alternatively, smaller
sets of records can be deleted via a "remove" handler, which I haven't
implemented yet. A script will need to post the data to be removed
against the handler. Will do that though if anybody uses it. User
impact is low if old records aren't removed.
A possible alternative is to have the GAE app periodically verify the
validity of each requested record with a server we'd have to run.
(Pulling the data straight from tictocs.ac.uk doesn't work since it's
larger what you're allowed to fetch.) This approach would somewhat
defeat the idea of the cloud since we'd have to rely on keeping that
server operational, albeit at a lower degree of availability and load.
Another potential issue is the quota Google provides: you get 10GBytes
and 1.3M requests free per 24 hour period, then they start charging
you ($.12 per GByte)
I think I mentioned in my post that I included a non-GAE version of
the server that only requires mod_wsgi. For that standalone version,
keeping the data set up to date is implemented by checking the last
mod time of its localy copy - it will reread its data when it detects
a more recent jrss.txt in its current directory, so keeping its data
up to date is a simple a periodically curling
http://www.tictocs.ac.uk/text.php
- Godmar
|