Print

Print


Hi Jonathan,

It is indeed working with the proxypass directive in Apache.
Now Google sees the Ip adres of the server and apparently, this does not create too much trafic. However, when it is busy the user might see the we're sorry page when they click on the link the API creates and travel through the NAT gateway. At least they can then enter the captcha and continue.
Today I have upscaled the service for books that do not have an ISBN. Google also accepts LCC and OCLC numbers as an id for books, but both numbers are not present in our catalog. All of our books are in Worldcat, so there must be a link. I have asked OCLC PICA (the dutch branch of OCLC) to provide me with a service that will return an OCLC number when I present it our national catalog number. They were very cooporative (http://webquery.blogspot.com/2008/03/hooray-for-oclc-pica-customer-response.html) and build this service (first they answered with a plain text return, but they altered this into a true XML service on my request) So now I call this service for the OCLC number and use this to invoke Google books API when and ISBN is missing from the catalog record. It works fine. I only find that Google's service is very slow. When I watch the response with firebug I see that the Google API takes about 10 - 20 times as much time (130 -250 msec) as the local parts of the page and twice as much as an Amazon book cover lookup. However, this is when it all goes well. Around mid day response slows down to 1,6 seconds and at some moments to over 30 seconds. Five minutes later response can be back to normal. I checked google.books.com at such a moment and it does not respond at all. I guess they have heavily underpowered Google books. Have you noticed this as well ?

Google got in touch with me about the problem and asked me where they could see the service. That won't help, since they will not pass our NAT gateway. However, I will contact them, also about the poor response. The way we have implemented it now our full record presentation performance is heavily influenced by the Google books response times.
I haven't had time to get back to them, because I have been busy organizing the yearly European Library Automation Group (ELAG) meeting which we will host next week. http://library.wur.nl/elag2008

I'll CC this message to the list, it ay be of use to others and I wonder how others experience the Google Books performance

Peter

Drs. P.J.C. van Boheemen
Hoofd Applicatieontwikkeling en beheer - Bibliotheek Wageningen UR
Head of Application Development and Management - Wageningen University and Research Library
tel. +31 317 48 25 17                                                                                    http://library.wur.nl <http://library.wur.nl/>
P Please consider the environment before printing this e-mail

________________________________

Van: Jonathan Rochkind [mailto:[log in to unmask]]
Verzonden: di 8-4-2008 18:33
Aan: Boheemen, Peter van
Onderwerp: Re: [CODE4LIB] Restricted access fo free covers from Google :)



Hi Pete, I'd be interested in an update on this. Is your ProxyPass with
Apache to access Google Books search API still working well for you, and
not running into Google traffic limiters?    You haven't actually
communicated with Google on this to set up something special or
anything, have you?

I'm interested in trying a similar thing here.

Jonathan

Boheemen, Peter van wrote:
> I don't think I do anything sophisticated like X-forwarder-for. I just have a ProxyPass directive in the apache configuration teeling it to reverse proxy a directory to google
>
> ProxyPass /googlebooks http://books.google.com/books
>
> But what if Google did something with a X-forwarded-for header? It can not see where the actual user is located. Behind a NAT usually 10.0.0.0 adresses are used. In fact it is trivial what Ip adresses are used behind the NAT. Since they are not exposed to the outside world it is only relevant if they are unique within the network behind the NAT.
>
> Anyway, since we only hit google books form the server when a user asks for display of a full record, I hardly expect that will cause the Google triggers. I suspect that the few thousand PC's within the university campus hitting Google cause the problem, which especially Google books reacts upon. (I can still search Google when Google books rejects accces from my IP adress.)
> I'll keep you informed.
>
> Peter
>
>
> Drs. P.J.C. van Boheemen
> Hoofd Applicatieontwikkeling en beheer - Bibliotheek Wageningen UR
> Head of Application Development and Management - Wageningen University and Research Library
> tel. +31 317 48 25 17                                                                                    http://library.wur.nl <http://library.wur.nl/>  <http://library.wur.nl/>
> P Please consider the environment before printing this e-mail
>
> ________________________________
>
> Van: Code for Libraries namens Jonathan Rochkind
> Verzonden: di 18-3-2008 18:48
> Aan: [log in to unmask]
> Onderwerp: Re: [CODE4LIB] Restricted access fo free covers from Google :)
>
>
>
> Nice. X-Forwarded-For would also allow google to deliver availability
> information suitable for the actual location of the end-user.  If their
> software chooses to pay attention to this. Which is the objection to
> server-side API requests voiced to me by a Google person. (By proxying
> everything through the server, you are essentially doing what I wanted
> to do in the first place but Google told me they would not allow. Ironic
> if you have more luck with that then the actual client-side AJAXy
> requests that Google said they required!)
>
> Thanks for alerting us to X-forwarded-for, that's a good idea.
>
> Jonathan
>
> Joe Hourcle wrote:
>
>> On Tue, 18 Mar 2008, Jonathan Rochkind wrote:
>>
>>
>>> Wait, now ALL of your clients calls are coming from one single IP?
>>> Surely that will trigger Googles detectors, if the NAT did. Keep us
>>> updated though.
>>>
>> I don't know what Peter's exact implementation is, but they might relax
>> the limits when they see an 'X-Forwarded-For' header, or something
>> else to
>> suggest it's coming through a proxy.  It used to be pretty common when
>> writing rate limiting code to use X-Forwarded-For in place of
>> HTTP_ADDR so
>> you didn't accidentally ban groups behind proxies.  (of course, I don't
>> know if the X-Forwarded-For value is something that's not routable (in
>> 10/8), or the NAT IP, so it might still look like 1 IP address behind a
>> proxy)
>>
>> Also, by using a caching proxy (if the responses are cachable), the total
>> number of requests going to Google might be reduced.
>>
>> I would assume they'd need to have some consideration for proxies, as I
>> remember the days when AOL's proxy servers channeled all requests through
>> less than a dozen unique IP addresses.  (or at least, those were the only
>> ones hitting my servers)
>>
>> -Joe
>>
>>
>
> --
> Jonathan Rochkind
> Digital Services Software Engineer
> The Sheridan Libraries
> Johns Hopkins University
> 410.516.8886
> rochkind (at) jhu.edu
>
>

--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu