I've been looking briefly at sushi.py, as a way to orient myself to
collecting stats this way. I'm not intending to single out sushi.py, but
looking at it (mainly the data structure at this point, and not the code
itself), raises some questions about the best approach for collecting SUSHI
data.
sushi.py seems to have a small number of routines; mainly to retrieve the
XML file from a vendor and ingest the data in that file into a MySQL
database. There are only MySQL tables for COUNTER JR1, DR1, DR2, and DR2
reports and they mirror, to a degree, the structure of the item records
returned in the SUSHI xml. Here are the skeletons of 2 of the sushi.py SQL
tables:
counter_jr1
id int,
print_issn varchar,
online_issn varchar,
platform varchar,
item_name text,
data_type varchar,
date_begin datetime,
date_end datetime,
ft_pdf int,
ft_html int,
ft_total varchar
counter_db3
id int,
platform varchar,
item_name text,
data_type varchar,
date_begin datetime,
date_end datetime,
searches int,
sessions int
On the face of it, this seems like a pretty good data structure (although I
have a couple of concerns, that I will get to) but my main question is
whether there is any agreement about how to collect this data? If I were to
dig into some of the other SUSHI packages mentioned in this thread, what
would I find there? Excel-formatted COUNTER reports are simply a table of
columns, representing various fields, such as title (for JR1), platform,
publisher (for JR1), ISSN (for JR1), etc., followed by columns for up to 12
months of the collected year, and then summary data. JR1 reports have
fulltext HTML, PDF, and Total columns. DR1 has two rows, one for searches
and one for sesssions, with YTD totals in the final column. Similar data
structures exist for other COUNTER reports. They rely on the user to
interpret them and probably ought not to inform a decision for structuring
the data in a database. Is there been any best practice for how COUNTER
data is modeled in a database?
There are other COUNTER reports besides those four. For instance, some
journal vendors do indeed report searches and sessions using the DR3
report, but others use the equivalent JR4 report, so I would have expected
sushi.py to have a mechanism to collect these. Does SUSHI only deliver JR1,
DR1, DR2, and DR2 reports, or is this a problem with sushi.py?
Now, one of the selling points for SUSHI is that if a vendor ever advises
that you should re-collect data for a given time period, the xml you
receive is structured such that the act of collecting OUGHT TO update,
rather than duplicate, data previously collected. However in sushi.py's SQL
structure, which gives every row a unique (auto-incremented) ID number,
there would have to be logic applied during the ingest to prevent multiple
instances of data collected from the same vendor for the same time period.
So, that's a concern.
I'm also concerned about what is represented in the ft_pdf, ft_html, and
ft_total fields. In the Excel COUNTER reports, the ft_pdf, ft_html, and
ft_total columns simply tabulate the YTD totals and the only way you would
be able to derive a monthly breakdown would be to collect 12 monthly
reports and analyze the differences from month to month -- something that
most libraries don't do. I have to go back and confirm this, but I don't
think the SUSHI reports are giving a month-only breakdown for those fields,
so I wonder about their inclusion in that table. I guess my question is
what is returned in the SUSHI xml report: monthly or yearly figures for the
ft_pdf, ft_html, and ft_total fields?
Tom
|