Print

Print


On Sep 18, 2017 3:30 PM, "raffaele messuti" <[log in to unmask]> wrote:

On 18/09/17 21:37, Eric Lease Morgan wrote:
> A cool collection of early English print materials is available at the
following URL:
>   https://archive.org/details/bplsceep
>
> Again, can I programmatically read the contents of a Internet Archive
collection?
this tool is what you need:
https://internetarchive.readthedocs.io/en/latest/

to get a list of all items of the collection:
$ ia search -i collection:bplsctpbs > bplsctpbs.txt

the txt file contain an identifier on each row

$ wc -l bplsctpbs.txt
     824 bplsctpbs.txt

$ head -n5 bplsctpbs.txt
accountofcountri00dobb_0
accountofenglish01lang
accountofenglish02lang
accountofenglish03lang
admirableeuentss00camu

then you can have metadata of all items
(using parallel https://www.gnu.org/software/parallel/ )

$ parallel ia metadata {} :::: bplsctpbs.txt > all.json




--
[log in to unmask]