Print

Print


Thank you Andrew, that is insanely useful.

cheers
stuart

On 30/01/14 12:00, Andrew Anderson wrote:
> When OCLC first announced their purchase of EZproxy, we started a low priority research project to see what the alternatives were a few years ago, and what it would take to bring them into a production ready state.  The two open source solutions we evaluated were Squid and Apache HTTPd.  We considered other options (e.g. Apache Traffic Server), but limited the research to these two pieces of software since they are already widely used and familiar to most system administrators.
>
> Long story short, Squid did not support URL rewriting in a way that we felt would be able to be supported well, between requiring patches to the core C++ server code, or an external rewriting processes, or an ICAP server implementation.  Some of that has improved a bit since the original evaluation, but the built-in support for URL rewriting may still need some time to mature.  Another aspect of Squid that did not seem to be a good fit was that it is somewhat limited in its authentication mechanisms vs Apache HTTPd.
>
> So we moved on to evaluating Apache HTTPd with the mod_proxy family of modules.  While Apache HTTPd does not support the advanced cache federation features as Squid, it has grown to be a robust proxy solution in its own right, and the 2.4 release appears to have all of the required pieces out of the box, with the mod_proxy_html module functionality.  In addition to basic URL rewriting support, you get full HTTP protocol support, mature IPv6 support, GZIP support, just about any authentication mechanism you need, a server that you can self-host content with easily, as well as a built-in HTTP object cache.
>
> How would it work?
>
> Here’s the current EZproxy stanza for ProQuest:
>
> HTTPHeader X-Requested-With
> HTTPHeader Accept-Encoding
> Title ProQuest
> URL http://search.proquest.com/ip
> DJ proquest.com
> HJ gateway.proquest.com
> DJ umi.com
> HJ fedsearch.proquest.com
> HJ literature.proquest.com
> DJ conquest-leg-insight.com
> DJ conquestsystems.com
> DJ m.search.proquest.com
> DJ media.proquest.com
> NeverProxy order.proquest.com
> NeverProxy rss.proquest.com
>
> Here’s an Apache HTTPd configuration using ProQuest that accomplishes much of the same functionality for the main search.proquest.com interface:
>
> <VirtualHost _default_:80>
>   ServerName search.proquest.com.fqdn
>
>   ProxyRequests Off
>   ProxyVia On
>
>   RewriteEngine On
>   RewriteRule ^/(.*) http://search.proquest.com/$1 [P]
>
>   <Location “/“>
>    AllowMethods GET POST OPTIONS
>    ProxyPassReverse http://search.proquest.com/
>    ProxyPassReverseCookieDomain search.proquest.com search.proquest.com.fqdn
>    CacheEnable disk
>    SetOutputFilter INFLATE;DEFLATE
>    Header Append Vary User-Agent env=!dont-vary
>    # Put Authentication directives here
>    ErrorDocument 401 /path/to/login
>    Require Valid-User
>   </Location>
> </virtualHost>
>
> A few notes on this:
>
> - There is no need for NeverProxy: if you do not define a VirtualHost for the hostname, it is not proxied.  So instead of HJ and DJ lines, you add a new VirtualHost block for each hostname that needs to be proxied.  The astute will ask “what about services that have dozens or hundreds of host entries, like Sage?”  Those can be handled by the ProxyExpress features in Apache HTTPd.
>
> - There is no need for HTTPHeader: since Apache HTTPd is a full HTTP proxy/server, it supports all HTTP headers natively.
>
> - Some of the hostnames that are in EZproxy stanzas are not needed, and some are legacy hostnames that are no longer used by the vendor
>
> - Some of the hostnames that are in EZproxy stanzas are for CDN hosted content that requires no special access (e.g. JavaScript/CSS/graphics assets that make up the vendor’s user interface).  Another example: how many of you have “DJ google.com” in one of your stanzas? Now how many of you registered your IP addresses with Google in any way?  Outside of Google Scholar, I suspect the answer to those questions are “nearly everyone” and “nearly no one”, respectively.
>
> - Some of the hostnames are for things that no sane person would do: How many people run their discovery services through their EZproxy server vs. authenticating their discovery platform by IP address with vendors directly?
>
> - Something that this configuration does that EZproxy does not do is enable object caching.  This can easily save 30-50% of your upstream bandwidth usage (Proxy/ProxySSL in EZproxy can achieve the same result with an external caching proxy server).
>
> - More complex vendor platforms (e.g. Gale Cengage) need ProxyHTML directives and ProxyHTMLURLMap configured, and multiple VirtualHost sections to get them fully working.  These can be a little fun to get working initially.
>
> - Some services need redirects edited to work correctly, and not break out of the proxy:
>
> 	Header edit Location http://vendor/ http://vendor.fqdn/
>
> - Some vendors send wrong HTTP headers for the MIME type, and mod_proxy_html exposes this in some cases as it rewrites the page.  There may be a better way to do this, but this is what I threw together for testing:
>
> 	<Location “/badpath”>
> 		ProxyHTMLEnable Off
> 		SetOutputFilter INFLATE;dummy-html-to-plain
> 		ExtFilterOptions LogStdErr Onfail=remove
> 	</Location>
> 	ExtFilterDefine dummy-html-to-plain mode=output intype=text/html outtype=text/plain cmd=“/bin/cat -“
>
> So what’s currently missing in the Apache HTTPd solution?
>
> - Services that use an authentication token (predominantly ebook vendors) need special support written.  I have been entertaining using mod_lua for this to make this support relatively easy for someone who is not hard-core technical to maintain.
>
> - Services that are not IP authenticated, but use one of the Form-based authentication variants.  I suspect that an approach that injects a script tag into the page pointing to javascript that handles the form fill/submission might be a sane approach here.  This should also cleanly deal with the ASP.net abominations that use __PAGESTATE to store sessions client-side instead of server-side.
>
> - EZproxy’s built-in DNS server (enabled with the “DNS” directive) would need to be handled using a separate DNS server (there are several options to choose from).
>
> - In this setup, standard systems-level management and reporting tools would be used instead of the /admin interface in EZproxy
>
> - In this setup, the functionality of the EZproxy /menu URL would need to be handled externally.  This may not be a real issue, as many academic sites already use LMS or portal systems instead of the EZproxy to direct students to resources, so this feature may not be as critical to replicate.
>
> - And of course, extensive testing.  While the above ProQuest stanza works for the main ProQuest search interface, it won’t work for everyone, everywhere just yet.
>
> Bottom line: Yes, Apache HTTPd is a viable EZproxy alternative if you have a system administrator who knows their way around Apache HTTPd, and are willing to spend some time getting to know your vendor services intimately.
>
> All of this testing was done on Fedora 19 for the 2.4 version of HTTPd, which should be available in RHEL7/CentOS7 soon, so about the time that hard decisions are to be made regarding EZproxy vs something else, that something else may very well be Apache HTTPd with vendor-specific configuration files.
>


-- 
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/