Print

Print


I did just bring in my own laptop to see if my problem is unique to my work computer.  I actually have used Amazon AWS, and yes, that might be the best option.  I've been looking into why my MSAccess job is limited to 25% of my CPU time - Maybe Access just can't use multiprocessors.  I'm going to investigate SLQite and OpenRefine on my presonal laptop.

Thanks all!
Cindy Harper

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Kyle Banerjee
Sent: Thursday, August 06, 2015 12:34 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Processing Circ data

On Wed, Aug 5, 2015 at 1:07 PM, Harper, Cynthia <[log in to unmask]> wrote:

> Hi all. What are you using to process circ data for ad-hoc queries.  I 
> usually extract csv or tab-delimited files - one row per item record, 
> with identifying bib record data, then total checkouts over the given 
> time period(s).  I have been importing these into Access then grouping 
> them by bib record. I think that I've reached the limits of 
> scalability for Access for this project now, with 250,000 item 
> records.  Does anyone do this in R?  My other go-to- software for data 
> processing is RapidMiner free version.  Or do you just use MySQL or 
> other SQL database?  I was looking into doing it in R with RSQLite 
> (just read about this and sqldf 
> http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my IT department will be skeptical of letting me have MySQL on my desktop.
> (I've moved into a much more users-don't-do-real-computing kind of 
> environment).  I'm rusty enough in R that if anyone will give me some 
> start-off data import code, that would be great.
>

As has been mentioned already, it's worth investigating whether OpenRefine or sqllite are options for you. If not, I'd be inclined to explore solutions that don't rely on your local IT dept.

It's so easy to spend far more time going through approval, procurement, and then negotiating local IT security/policies than actually working that it pays to do a lot of things on the cloud. There are many services out there, but I like Amazon for occasional need things because you can provision anything you want in minutes and they're stupid cheap. If all you need is mysql for a few minutes now and then, just pay for Relational Database Services. If you'd rather have a server and run mysql off it, get an EBS backed EC2 instance (the reason to go this route rather than instance store is improved IO and your data is all retained if you shut off the server without taking a snapshot). Depending on your usage, bills of less than a buck a month are very doable. If you need something that runs 24x7, other routes will probably be more attractive. Another option is to try the mysql built into cheapo web hosting accounts like bluehost, though you might find that your disk IO gets you throttled. But it might be worth a shot.

If doing this work on your desktop is acceptable (i.e. other people don't need access to this service), you might seriously consider just doing it on a personal laptop that you can install anything you want on. In addition to mysql, you can also install VirtualBox which is a great environment for provisioning servers that you can export to other environments or even carry around on your cell phone.

With regards to some of the specific issues you bring up, 40 minutes for a query on a database that size is insane which indicates the tool you have is not up for the job. Because of the way databases store info, performance degrades on a logarthmic (rather than linear) basis on indexed data. In plain English, this means even queries on millions of records take surprisingly little power. Based on what you've described, changing a field from variable to fixed might not save you any space and could even increase it depending on what you have. In any case, the difference won't be worth worrying about.

Whatever solution you go with, I'd recommend learning to provision yourself resources when you can find some time. Work is hard enough when you can't get the resources you need. When you can simply assign them to yourself, the tools you need are always at hand so life gets much easier and more fun.

kyle