You can use sqlite within R using the sqldf package. It allows you to perform sql select statements on your data in memory. I've used it with datasets in this size range and it was fairly fast.
Information Discovery Services Team
River Campus Libraries
University of Rochester
Rochester, NY 14627
[log in to unmask]
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Harrison G. Dekker
Sent: Thursday, August 06, 2015 12:08 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Processing Circ data
R would be ideal for the types data manipulation you describe and would allow to automate the entire process. If you can share a sample of your data and examples of the types of queries you're running, I'd be glad to help you get started.
If you'd like to keep a relational database in your workflow, check out sqlite. It's a file format rather than a database server, so won't be an issue for your IT staff. There's a Firefox plug-in that provides basic client functionality, and you can also easily access the tables from R
(directly) or for that matter with Access or Excel via odbc (not what I'd recommend, but it's possible!).
Head, Library Data Lab
UC Berkeley Libraries
On Thu, Aug 6, 2015 at 6:05 AM, Harper, Cynthia <[log in to unmask]> wrote:
> I have compacted the database, and I'm using the Group By SQL query. I
> think I actually am hitting the 2GB limit, because of all the data I
> have for each row. I'm wondering if having added a field for reserves
> history notes, that that's treated as a fixed-length field for every
> record, rather than variable length, and just appearing for the small
> number of records that have been put on reserve. I suppose if I
> exported my data in two tables - bib and item data, the database would
> be much more efficient than the flat-file approach I've been using.
> Time to turn the mind back on, rather than just taking the lazy approach every time...
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> Of Kevin Ford
> Sent: Wednesday, August 05, 2015 5:16 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] Processing Circ data
> On the surface, your difficulties suggest you may need look at a few
> optimization tactics. Apologies if these are things you've already
> considered and addressed - just offering a suggestion.
> This page  is for Access 2003 but the items under "Improve query
> performance" should apply - I think - to newer versions also. I'll
> draw specific attention to 1) Compacting the database; 2) making sure
> you have an index set up on the bib record number field and number of
> circs field; and 3) make sure you are using hte "Group by" sql syntax .
> Now, I'm not terribly familiar with Access so I can't actually help
> you with point/click instructions, but the above are common 'gotchas'
> that could be a problem regardless of RDBMS.
>  https://support.microsoft.com/en-us/kb/209126
> On 8/5/15 4:01 PM, Harper, Cynthia wrote:
> > Well, I guess it could be bad data, but I don't know how to tell. I
> think I've done more than this before.
> > I have a "Find duplicates" query that groups by bib record number.
> > That
> query seemed to take about 40 minutes to process. Then I added a
> criterion to limit to only records that had >0 circs this year. That
> query displays the rotating cursor, then says "Not Responding", then
> the cursor, and loops through that for hours. Maybe I can find the
> Access bad data, but I'd be glad to find a more modern data analysis software. My db is 136,256 kb.
> But adding that extra query will probably put it over the 2GB mark.
> I've tried extracting to a csv, and that didn't work. Maybe I'll try a
> Make table to a separate db.
> > Or the OpenRefine suggestion sounds good too.
> > Cindy Harper
> > -----Original Message-----
> > From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> > Of Kevin Ford
> > Sent: Wednesday, August 05, 2015 4:23 PM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] Processing Circ data
> > Hi Cindy,
> > This doesn't quite address your issue, but, unless you've hit the 2
> > GB
> Access size limit , Access can handle a good deal more than 250,000
> item records ("rows," yes?) you cited.
> > What makes you think you've hit the limit? Slowness, something else?
> > All the best,
> > Kevin
> > 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__support.office.
> > com_en-2Dus_article_Access-2D2010-2Dspecifications-2D1e&d=BQIBaQ&c=k
> > bmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=1MNr3t6NTVLgjsxd2xillyz
> > 8v6r4pfPDxJ9G15aln5s&m=LY4gYEwKT6jqEL7Yt12pXGSIinKBC-60A-PooQO9r4E&s
> > =PybZds8WwCiARRBFzVEgkiQeT20jzL7LXwvW-y0C9jM&e=
> > 521481-7f9a-46f7-8ed9-ea9dff1fa854
> > On 8/5/15 3:07 PM, Harper, Cynthia wrote:
> >> Hi all. What are you using to process circ data for ad-hoc queries.
> >> I
> usually extract csv or tab-delimited files - one row per item record,
> with identifying bib record data, then total checkouts over the given
> time period(s). I have been importing these into Access then grouping
> them by bib record. I think that I've reached the limits of
> scalability for Access for this project now, with 250,000 item
> records. Does anyone do this in R? My other go-to- software for data
> processing is RapidMiner free version. Or do you just use MySQL or
> other SQL database? I was looking into doing it in R with RSQLite
> (just read about this and sqldf
> om_make-2Dr-2Dspeak-2Dsql-2Dwith-2Dsqldf_&d=BQIBaQ&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=1MNr3t6NTVLgjsxd2xillyz8v6r4pfPDxJ9G15aln5s&m=LY4gYEwKT6jqEL7Yt12pXGSIinKBC-60A-PooQO9r4E&s=x3DiZroWJUb6VyR13i_ygoGoEyWVw-0grp5ZLpLVDE0&e= ) because ... I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great.
> >> Cindy Harper
> >> E-services and periodicals librarian Virginia Theological Seminary
> >> Bishop Payne Library
> >> 3737 Seminary Road
> >> Alexandria VA 22304
> >> [log in to unmask]<mailto:[log in to unmask]>
> >> 703-461-1794