On Tue, Aug 30, 2011 at 12:56 PM, Ken Irwin <[log in to unmask]> wrote:
> I have a feeling it may be time for me to learn some grown-up programming
> skills, and I hope someone here might be able to help.
>
> I'm basically building a big associative array encoding the name of the
> borrowing institution, the patron type (student,faculty,staff,etc), the item
> barcode, and the various bits of data we want for each of these items.
>
> // the first time we see a barcode
> $stats[$inst][$patron_type][$item_barcode][first] = $date;
> $stats[$inst][$patron_type][$item_barcode][min] = $renewals;
> $stats[$inst][$patron_type][$item_barcode][call] = $call_no;
> //subsequent instances of the barcode
> $stats[$inst][$patron_type][$item_barcode][max] = $renewals;
>
> Once I've chewed over all 4million records (40MB) , I spit it out into a
> new MySQL table that has the collated data that I want. Unfortunately this
> system breaks down when I get to so many millions of records.
>
> Is there a more efficient way of doing this kind of data transformation? I
> *could* not keep so much in the big stats array and instead make millions of
> "UPDATE" calls to the MySQL table, but that doesn't sound like a winning
> proposition to me. I imagine that I could also increase the memory
> allotment, but it will eventually get to big too. Or I suppose that I could
> do it all in chunks - right now I'm parsing the whole raw-data SQL table at
> once; I could do one institution's data at a time and buy myself some
> wiggle-room.
Sometimes it can make sense to use the database to do the aggregation; e.g.
CREATE TABLE Summary AS
SELECT inst,patron_type,item_barcode,min(date) "first",
min(call_no),min(renewals) "min_renewals", max(renewals) "max_renewals"
FROM Renewals
GROUP BY inst,patron_type,item_barcode;
|