Print

Print


How would you go about doing some analysis of your website's referrer
data?

I have committed to writing an article for the anniversary issue of
First Monday (as if I don't already have enough to do). Here is the
accepted/proposed title and abstract:

   Ethical issues surrounding freely available information
   found on the Web

   By reverse engineering Google queries and by tracing back
   the referrer values found in Apache log files, the use of
   content made available from infomotions.com is examined and
   ethical questions are asked. While all the content from the
   site is "freely" available under the GNU Public License, the
   content is not always used in the intended manner. This
   raises interesting questions regarding the time spent making
   the content available, the expense of the hardware and
   network connections, and whether or not the application of
   the content is put to good and moral purposes. This essay
   addresses these and other ethical questions in an attempt to
   come to an understanding regarding the place of information
   and knowledge in an "open" environment.

I find it interesting to watch the content of my access_log scroll by
on my console. I am most interested in the referrer information. Most
of my hits originate as searches against Google. It is fun feed these
queries back into Google and see what people searched for, watch what
the searches return, and see what page number my item is located. I
see that a lot of the hits to my site come from MySpace.com where
teenaged and college aged girls have incorporated some of my pictures
into their pages. Another common use is on "bulletin board" systems
where someone used one of my pictures as their avatar. In these
second and third cases should I expect some sort of remuneration or
at least a link back to infomotions.com?

Some hits come from really weird places. For example, the search for
"lease" brings back many hits about equipment rental, but sometimes
my name and/or the Alex Catalogue of Electronic Texts is linked from
the equipment rental site. Sort of strange if you ask me. They are
using my name, sort of. ("Is it 'my' name?")

In any event, I plan to take two months of access_log data, extract
the pages being looked at and the referrer information to more
systematically examine how the content on Infomotions is being
incorporated into other sites. How would you suggest I do this?
Presently I plan to extract the necessary information from my logs
and dump it into a flat database file where I will exploit various
incarnations of SQL SELECT statements. Count this. Group that. Sort
this way. Etc. Mind you, I am most interested in the one-off sort of
hits, not just the overall usage.

How would you go about doing this sort of analysis? All I have to
start with is my Apache "combined" access_log files?

--
Eric Lease Morgan
University Libraries of Notre Dame