LISTSERV 16.5 - CODE4LIB Archives

Hi Cynthia,

R would be ideal for the types data manipulation you describe and would
allow to automate the entire process. If you can share a sample of your
data and examples of the types of queries you're running, I'd be glad to
help you get started.

If you'd like to keep a relational database in your workflow, check out
sqlite. It's a file format rather than a database server, so won't be an
issue for your IT staff. There's a Firefox plug-in that provides basic
client functionality, and you can also easily access the tables from R
(directly) or for that matter with Access or Excel via odbc (not what I'd
recommend, but it's possible!).

Harrison Dekker
Head, Library Data Lab
UC Berkeley Libraries

On Thu, Aug 6, 2015 at 6:05 AM, Harper, Cynthia <[log in to unmask]> wrote:

> I have compacted the database, and I'm using the Group By SQL query. I
> think I actually am hitting the 2GB limit, because of all the data I have
> for each row. I'm wondering if having added a field for reserves history
> notes, that that's treated as a fixed-length field for every record, rather
> than variable length, and just appearing for the small number of records
> that have been put on reserve.  I suppose if I exported my data in two
> tables - bib and item data, the database would be much more efficient than
> the flat-file approach I've been using.  Time to turn the mind back on,
> rather than just taking the lazy approach every time...
>
> Cindy
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Kevin Ford
> Sent: Wednesday, August 05, 2015 5:16 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] Processing Circ data
>
> On the surface, your difficulties suggest you may need look at a few
> optimization tactics. Apologies if these are things you've already
> considered and addressed - just offering a suggestion.
>
> This page [1] is for Access 2003 but the items under "Improve query
> performance" should apply - I think - to newer versions also.  I'll draw
> specific attention to 1) Compacting the database; 2) making sure you have
> an index set up on the bib record number field and number of circs field;
> and 3) make sure you are using hte "Group by" sql syntax [2].
>
> Now, I'm not terribly familiar with Access so I can't actually help you
> with point/click instructions, but the above are common 'gotchas' that
> could be a problem regardless of RDBMS.
>
> Yours,
> Kevin
>
> [1] https://support.microsoft.com/en-us/kb/209126
> [2] http://www.w3schools.com/sql/sql_groupby.asp
>
>
>
> On 8/5/15 4:01 PM, Harper, Cynthia wrote:
> > Well, I guess it could be bad data, but I don't know how to tell. I
> think I've done more than this before.
> >
> > I have a "Find duplicates" query that groups by bib record number.  That
> query seemed to take about 40 minutes to process. Then I added a criterion
> to limit to only records that had >0 circs this year. That query displays
> the rotating cursor, then says "Not Responding", then the cursor, and loops
> through that for hours.  Maybe I can find the Access bad data, but I'd be
> glad to find a more modern data analysis software.  My db is 136,256 kb.
> But adding that extra query will probably put it over the 2GB mark.  I've
> tried extracting to a csv, and that didn't work. Maybe I'll try a Make
> table to a separate db.
> >
> > Or the OpenRefine suggestion sounds good too.
> >
> > Cindy Harper
> >
> > -----Original Message-----
> > From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> > Of Kevin Ford
> > Sent: Wednesday, August 05, 2015 4:23 PM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] Processing Circ data
> >
> > Hi Cindy,
> >
> > This doesn't quite address your issue, but, unless you've hit the 2 GB
> Access size limit [1], Access can handle a good deal more than 250,000 item
> records ("rows," yes?) you cited.
> >
> > What makes you think you've hit the limit?  Slowness, something else?
> >
> > All the best,
> > Kevin
> >
> > [1]
> > https://support.office.com/en-us/article/Access-2010-specifications-1e
> > 521481-7f9a-46f7-8ed9-ea9dff1fa854
> >
> >
> >
> >
> >
> > On 8/5/15 3:07 PM, Harper, Cynthia wrote:
> >> Hi all. What are you using to process circ data for ad-hoc queries.  I
> usually extract csv or tab-delimited files - one row per item record, with
> identifying bib record data, then total checkouts over the given time
> period(s).  I have been importing these into Access then grouping them by
> bib record. I think that I've reached the limits of scalability for Access
> for this project now, with 250,000 item records.  Does anyone do this in
> R?  My other go-to- software for data processing is RapidMiner free
> version.  Or do you just use MySQL or other SQL database?  I was looking
> into doing it in R with RSQLite (just read about this and sqldf
> http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because ...  I'm
> rusty enough in R that if anyone will give me some start-off data import
> code, that would be great.
> >>
> >> Cindy Harper
> >> E-services and periodicals librarian
> >> Virginia Theological Seminary
> >> Bishop Payne Library
> >> 3737 Seminary Road
> >> Alexandria VA 22304
> >> [log in to unmask]<mailto:[log in to unmask]>
> >> 703-461-1794
> >>
>