I have written a few hacks allowing me to do rudimentary text mining against the logs. [1] From readme.txt:
This directory contains a number of files and scripts allowing
one to do a bit of text mining against the Code4Lib conference
IRC log files for 2011. This is just a beginning, and the
directory includes:
* irclog.txt - the raw log file downloaded from
http://irc.code4lib.org/c4l11/static/logs/irclog
* log2db.pl - reads the raw log and outputs a tab-delimited
file with three columns (date, name, text)
* irclog.db - the output of log2db.pl
* count.pl - outputs the number of names (n), increases (i),
decreases (d), URLs (u), and commands (c) found in the log;
useful for seeing what is hot and what is not.
* ngrams.pl - given an integer (n), outputs the most frequent
n-length phrases; useful to see what words and phrases are
used most frequently
* concordance.pl - a KWIK index; the simplest of search engines
* readme.txt - this file
Using these tools one can see that:
* Zoia had the most to say
* mbklein's karma was increased the most
* Zoia's karma was decreased the most
* the most popular URL passed around regarded social activities
* we tried to sing as many as 196 songs closely followed by anagrams
* 28 of the songs weren't found
* live streams were mentioned frequently
I have to go shovel snow now...
[1] initial hacks - http://bit.ly/gMO4op
--
Eric Lease Morgan
|