LISTSERV 16.5 - CODE4LIB Archives

On Wed, Aug 5, 2015 at 1:07 PM, Harper, Cynthia <[log in to unmask]> wrote:

> Hi all. What are you using to process circ data for ad-hoc queries.  I
> usually extract csv or tab-delimited files - one row per item record, with
> identifying bib record data, then total checkouts over the given time
> period(s).  I have been importing these into Access then grouping them by
> bib record. I think that I've reached the limits of scalability for Access
> for this project now, with 250,000 item records.  Does anyone do this in
> R?  My other go-to- software for data processing is RapidMiner free
> version.  Or do you just use MySQL or other SQL database?  I was looking
> into doing it in R with RSQLite (just read about this and sqldf
> http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure
> my IT department will be skeptical of letting me have MySQL on my desktop.
> (I've moved into a much more users-don't-do-real-computing kind of
> environment).  I'm rusty enough in R that if anyone will give me some
> start-off data import code, that would be great.
>

As has been mentioned already, it's worth investigating whether OpenRefine
or sqllite are options for you. If not, I'd be inclined to explore
solutions that don't rely on your local IT dept.

It's so easy to spend far more time going through approval, procurement,
and then negotiating local IT security/policies than actually working that
it pays to do a lot of things on the cloud. There are many services out
there, but I like Amazon for occasional need things because you can
provision anything you want in minutes and they're stupid cheap. If all you
need is mysql for a few minutes now and then, just pay for Relational
Database Services. If you'd rather have a server and run mysql off it, get
an EBS backed EC2 instance (the reason to go this route rather than
instance store is improved IO and your data is all retained if you shut off
the server without taking a snapshot). Depending on your usage, bills of
less than a buck a month are very doable. If you need something that runs
24x7, other routes will probably be more attractive. Another option is to
try the mysql built into cheapo web hosting accounts like bluehost, though
you might find that your disk IO gets you throttled. But it might be worth
a shot.

If doing this work on your desktop is acceptable (i.e. other people don't
need access to this service), you might seriously consider just doing it on
a personal laptop that you can install anything you want on. In addition to
mysql, you can also install VirtualBox which is a great environment for
provisioning servers that you can export to other environments or even
carry around on your cell phone.

With regards to some of the specific issues you bring up, 40 minutes for a
query on a database that size is insane which indicates the tool you have
is not up for the job. Because of the way databases store info, performance
degrades on a logarthmic (rather than linear) basis on indexed data. In
plain English, this means even queries on millions of records take
surprisingly little power. Based on what you've described, changing a field
from variable to fixed might not save you any space and could even increase
it depending on what you have. In any case, the difference won't be worth
worrying about.

Whatever solution you go with, I'd recommend learning to provision yourself
resources when you can find some time. Work is hard enough when you can't
get the resources you need. When you can simply assign them to yourself,
the tools you need are always at hand so life gets much easier and more fun.

kyle