LISTSERV 16.5 - CODE4LIB Archives

Hi,

Just a note if someone wants to do this at scale. Statistics Canada has an
AI to convert from pdfs to csv.

https://www.statcan.gc.ca/en/data-science/projects#pdf-extraction



On Wed, Jun 22, 2022, 5:59 AM Owen Stephens <[log in to unmask]> wrote:

> There was some work at the Wellcome Collection several years ago looking at
> extracting tabular information from digitised materials - a brief review
> suggests that Abbyy FineReader Engine 11 was used to identify tables,
> although there were a number of challenges - how far those challenges were
> overcome wasn't clear to me from a brief review, but if this is of interest
> there's a post at
>
> https://stacks.wellcomecollection.org/1-million-tables-and-counting-7e7e6c9f76e
> plus a report the Wellcome Collection commissioned at
>
> https://github.com/wellcometrust/wellcomecollection.org/files/2148381/Scoping.MOH.for.data.recovery.report.-.final.pdf
>
> Christy Henshaw at the Wellcome Collection may be able to share some of
> their experience and learning if you reach out to them
> https://twitter.com/chenshaw
>
> Best wishes
>
> Owen
>
> On Tue, 21 Jun 2022 at 19:47, Medina-Smith, Andrea M. (Fed) <
> [log in to unmask]> wrote:
>
> > Hello List,
> >
> > Has anyone had success converting tables in a PDF to CSV? These are scans
> > of paper from the 70s on forward. I know this isn’t a super easy
> > conversion, but I would think it’s not impossible either.
> >
> > Thanks,
> > Andrea
> >
> > --
> >
> > Andrea Medina-Smith
> > Data Librarian
> > Information Services Office
> > National Institute of Standards and Technology
> > [log in to unmask]<mailto:[log in to unmask]>
> > https://orcid.org/0000-0002-1217-701X
> >
> >
> >
>
> --
> Owen Stephens
> Owen Stephens Consulting
> Web: http://www.ostephens.com
> Email: [log in to unmask]
>