On Wed, 9 May 2007, Tom Keays wrote:
> Still, it is worth asking: Has anyone made a stab at this -- ie,
> publically exposing server logs? Are there code examples (any
> real-world, generalizable examples would be welcome). Sorry for
> cross-posting this.
I've done it in the past -- typically using general analystics programs
(eg, analog), or just parsing out relevant data w/ perl.
The problem is, a few years ago, that spammers started sending bogus
requests to servers, to try to get them to show up in your stats pages.
In ORA's case, they're only showing the top 20, and they presumably get
lots of requests, so someone would have to hit them pretty hard to get
something to show up.
If you're thinking about exposing your server logs, I'd recommend the
following:
1. Don't give out IP addresses of the requestors
(privacy reasons)
2. Don't put on a public page any data that's generated by the
user-agent, to include HTTP_USER_AGENT, HTTP_REFERER and
QUERY_STRING. All have been used by spammers to insert URLs to
try to get links back to their sites.
3. Filter out all entries with 'error' results (people trying to
probe your system for vulnerabilities, etc.)
4. Filter out all 'intranet' pages or other pages that the general
public shouldn't be going to.
5. Avoid giving information that provides signatures of the CMS
you're using, or other signatures of potential vulnerabilities.
6. Use robot.txt to request search engines to not serve whatever
pages you generate.
For the particular case of generating tag clouds from search results, the
problem lies in that you typically need to use QUERY_STRING if it's a
local search script, and HTTP_REFERER if it's a remote search engine that
linked to you. Both values can't be trusted.
In this particular case, I probably wouldn't try a fully automated
approach -- I'd generate the page, but require someone to manually verify
it before it got posted.
-----
Joe Hourcle
(insert some statement here about everything being my personal opinions,
and that I don't speak for any company, organization, etc.)
|