Print

Print


Hi,

since I brought up the issue of the Google App Engine (GAE) (or
similar services, such as Amazon's EC2 "Elastic Compute Cloud"), I
thought I give a brief overview of what it can and cannot do, such
that we may judge its potential use for library services.

GAE is a cloud infrastructure into which developers can upload
applications. These applications are replicated among Google's network
of data centers and they have access to its computational resources.
Each application has access to a certain amount of resources at no
fee; Google recently announced the pricing for applications whose
resource use exceeds the "no fee" threshold [1]. The no fee threshold
is rather substantial: 500MB of persistent storage, and, according to
Google, enough bandwidth and cycles to serve about "5 million page
views" per month.

Google Apps must be written in Python. They run in a sandboxed
environment. This environment limits what applications can do and how
they communicate with the outside world.  Overall, the sandbox is very
flexible - in particular, application developers have the option of
uploading additional Python libraries of their choice with their
application. The restrictions lie primarily in security and resource
management. For instance, you cannot use arbitrary socket connections
(all outside world communication must be through GAE's "fetch" service
which supports http/https only), you cannot fork processes or threads
(which would use up CPU cycles), and you cannot write to the
filesystem (instead, you must store all of your persistent data in
Google's scalable datastorage, which is also known as BigTable.)

All resource usage (CPU, Bandwidth, Persistent Storage - though not
memory) is accounted for and you can see your use in the application's
"dashboard" control panel. Resources are replenished on the fly where
possible, as in the case of CPU and Bandwidth. Developers are
currently restricted to 3 applications per account. Making
applications in multiple accounts work in tandem to work around quota
limitations is against Google's terms of use.

Applications are described by a configuration file that maps URI paths
to scripts in a manner similar to how you would use Apache
mod_rewrite.  URIs can also be mapped to explicitly named static
resources such as images. Static resources are uploaded along with
your application and, like the application, are replicated in Google's
server network.

The programming environment is CGI 1.1.  Google suggests, but doesn't
require, the use of supporting libraries for this model, such as WSGI.
 This use of high-level libraries allows applications to be written in
a very compact, high-level style, the way one is used to from Python.
In addition to the WSGI framework, this allows the use of several
template libraries, such as Django.  Since the model is CGI 1.1, there
are no or very little restrictions on what can be returned - you can
return, for instance, XML or JSON and you have full control over the
Content-Type: returned.

The execution model is request-based.  If a client request arrives,
GAE will start a new instance (or reuse an existing instance if
possible), then invoke the main() method. At this point, you have a
set limit to process this request (though not explicitly stated in
Google's doc, the limit appears to be currently 9 seconds) and return
a result to the client. Note that this per-request limit is a maximum;
you should usually be much quicker in your response. Also note that
any CPU cycles you use during those 9 seconds (but not time you spent
wait fetching results from other application tiers) count against your
overall CPU budget.

The key service the GAE runtime libraries provide is the Google
datastore, aka BigTable [2].
You can think of this service as a highly efficient, persistent store
for structured data. You may think of it as a simplified database that
allows the creation, retrieval, updating, and deletion (CRUD) of
entries using keys and, optionally, indices. It provides limited
support transactions as well. Though it is less powerful than
conventional relational databases - which aren't nearly as scalable -
it can be accessed using GQL, a query language that's similar in
spirit to SQL.  Notably, GQL (or BigTable) does not support JOINs,
which means that you will have to adjust your traditional approach to
database normalization.

The Python binding for the structured data is intuitive and seamless.
You simply declare a Python class for the properties of objects you
wish to store, along with the types of the properties you wish
included, and you can subsequently use a put() or delete() method to
write and delete. Queries will return instances of the objects you
placed in a given table.  Tables are named using the Python classes.

Google provides a number of additional runtime libraries, such as for
simple Image processing a la Google Picasa, for the sending of email
(subject to resource limits), and for user authentication, solely
using Google accounts. User authentication is optional.

A shortcoming is that the API does not allow you to query how much
storage or other resources you've used (such as number of emails
sent.) Instead, a runtime error is thrown when you exceed your limit.
This may make the creation of applications that attempt to stay under
the "no fee" threshold difficult. [3]

There's also an SDK that replicates their environment (minus the
resource restrictions, and minus the scalable data store) locally.
Applications developed in this environment can be uploaded and
deployed using a single command and will appear momentarily at the
.appspot.com domain name. In my test application, this worked
flawlessly.

 - Godmar

[1] http://www.insideria.com/2008/05/google-app-engine-is-open-and.html
[2] http://labs.google.com/papers/bigtable.html
[3] It may be possible to scrape it off the dashboard, which display
your current use.  The dashboard also shows all objects currently
stored in the persistent store.