On 18/09/17 21:37, Eric Lease Morgan wrote:
> A cool collection of early English print materials is available at the following URL:
> https://archive.org/details/bplsceep
>
> Again, can I programmatically read the contents of a Internet Archive collection?
this tool is what you need:
https://internetarchive.readthedocs.io/en/latest/
to get a list of all items of the collection:
$ ia search -i collection:bplsctpbs > bplsctpbs.txt
the txt file contain an identifier on each row
$ wc -l bplsctpbs.txt
824 bplsctpbs.txt
$ head -n5 bplsctpbs.txt
accountofcountri00dobb_0
accountofenglish01lang
accountofenglish02lang
accountofenglish03lang
admirableeuentss00camu
then you can have metadata of all items
(using parallel https://www.gnu.org/software/parallel/ )
$ parallel ia metadata {} :::: bplsctpbs.txt > all.json
--
[log in to unmask]
|