LISTSERV 16.5 - CODE4LIB Archives

I've not tried using the LCNAF RDF files, and I've not used RDFLib, but a couple of things from (a relatively small amount of) experience parsing RDF:

Don't try to parse the RDF/XML, use n-triples instead
As Kyle mentioned, you might want to use command line tools to strip down the n-triples to only deal with data you actually want
Rapper and the Redland RDF libraries are a good place to start, and have bindings to Perl, PHP, Python and Ruby (http://librdf.org/raptor/rapper.html and http://librdf.org). This StackOverflow Q&A might help getting started http://stackoverflow.com/questions/5678623/how-to-parse-big-datasets-using-rdflib
If you want to move between RDF formats an alternative to Rapper is http://www.l3s.de/~minack/rdf2rdf/ - this succeeded converting a file of 48 million triples in ttl to ntriples where Rapper failed with an 'out of memory' error (once in ntriples, Rapper can be used for further parsing)


Some slightly random advice there, but maybe some of it will be useful!

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936

On 30 Sep 2014, at 15:54, Jeremy Nelson <[log in to unmask]> wrote:

> Hi Jean,
> I've found rdflib (https://github.com/RDFLib/rdflib) on the Python side exceeding simple to work with and use. For example, to load the current BIBFRAME vocabulary as an RDF graph using a Python shell:
> 
>>> import rdflib
>>> bf_vocab = rdflib.Graph().parse('http://bibframe.org/vocab/')
>>> len(bf_vocab) # Total number of triples
> 1683
>>> set([s for s in bf_vocab]) # A set of all unique subjects in the graph
> 
> 
> This module offers RDF/XML, Turtle, or N-triples support and with various options for retrieving and manipulating the graph's subjects, predicate, and objects. I would advise installing the JSON-LD (https://github.com/RDFLib/rdflib-jsonld) extension as well.
> 
> Jeremy Nelson
> Metadata and Systems Librarian
> Colorado College
> 
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Jean Roth
> Sent: Tuesday, September 30, 2014 8:14 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] Python or Perl script for reading RDF/XML, Turtle, or N-triples Files
> 
> Thank you so much for the reply.
> 
> I have not investigated the LCNAF data set thoroughly.  However, my default/ideal is to read in all variables from a dataset.  
> 
> So, I was wondering if any one had an example Python or Perl script for reading RDF/XML, Turtle, or N-triples file.  A simple/partial example would be fine.
> 
> Thanks,
> 
> Jean
> 
> On Mon, 29 Sep 2014, Kyle Banerjee wrote:
> 
> KB> The best way to handle them depends on what you want to do. You need 
> KB> to actually download the NAF files rather than countries or other 
> KB> small files as different kinds of data will be organized 
> KB> differently. Just don't try to read multigigabyte files in a text 
> KB> editor :)
> KB> 
> KB> If you start with one of the giant XML files, the first thing you'll 
> KB> probably want to do is extract just the elements that are 
> KB> interesting to you. A short string parsing or SAX routine in your 
> KB> language of choice should let you get the information in a format you like.
> KB> 
> KB> If you download the linked data files and you're interested in 
> KB> actual headings (as opposed to traversing relationships), grep and 
> KB> sed in combination with the join utility are handy for extracting 
> KB> the elements you want and flattening the relationships into 
> KB> something more convenient to work with. But there are plenty of other tools that you could also use.
> KB> 
> KB> If you don't already have a convenient environment to work on, I'm a  
> KB> fan of virtualbox. You can drag and drop things into and out of your 
> KB> regular desktop or even access it directly. That way you can 
> KB> view/manipulate files with the linux utilities without having to 
> KB> deal with a bunch of clunky file transfer operations involving 
> KB> another machine. Very handy for when you have to deal with multigigabyte files.
> KB> 
> KB> kyle
> KB> 
> KB> On Mon, Sep 29, 2014 at 11:19 AM, Jean Roth <[log in to unmask]> wrote:
> KB> 
> KB> > Thank you!  It looks like the files are available as  RDF/XML, 
> KB> > Turtle, or N-triples files.
> KB> >
> KB> > Any examples or suggestions for reading any of these formats?
> KB> >
> KB> > The MARC Countries file is small, 31-79 kb.  I assume a script 
> KB> > that would read a small file like that would at least be a start 
> KB> > for the LCNAF
> KB> >
> KB> >
> KB>