Print

Print


Very good results, for me, using https://tabula.technology/ <https://tabula.technology/> (Mac OS X).
Bye. sb

> On 23 Jun 2022, at 21:52, Medina-Smith, Andrea M. (Fed) <[log in to unmask]> wrote:
> 
> Thanks everyone for the pointers. 
> 
> On 6/22/22, 7:05 AM, "Code for Libraries on behalf of Julien Tremblay McLellan" <[log in to unmask] on behalf of [log in to unmask]> wrote:
> 
>    Hi,
> 
>    Just a note if someone wants to do this at scale. Statistics Canada has an
>    AI to convert from pdfs to csv.
> 
>    https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.statcan.gc.ca%2Fen%2Fdata-science%2Fprojects%23pdf-extraction&amp;data=05%7C01%7Candrea.medina-smith%40NIST.GOV%7Cc29940e159de452ef18008da543f15f4%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C637914927194949931%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=IpriEBpvU4uafB2dTKA2BgN8Cfz1CSAUK%2F%2FS17YCf%2FQ%3D&amp;reserved=0
> 
> 
> 
>    On Wed, Jun 22, 2022, 5:59 AM Owen Stephens <[log in to unmask]> wrote:
> 
>> There was some work at the Wellcome Collection several years ago looking at
>> extracting tabular information from digitised materials - a brief review
>> suggests that Abbyy FineReader Engine 11 was used to identify tables,
>> although there were a number of challenges - how far those challenges were
>> overcome wasn't clear to me from a brief review, but if this is of interest
>> there's a post at
>> 
>> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstacks.wellcomecollection.org%2F1-million-tables-and-counting-7e7e6c9f76e&amp;data=05%7C01%7Candrea.medina-smith%40NIST.GOV%7Cc29940e159de452ef18008da543f15f4%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C637914927194949931%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=tdKda6acvapl9spTnGOVsYC3qxGV0aGFHWO3u9FniDo%3D&amp;reserved=0
>> plus a report the Wellcome Collection commissioned at
>> 
>> https://github.com/wellcometrust/wellcomecollection.org/files/2148381/Scoping.MOH.for.data.recovery.report.-.final.pdf
>> 
>> Christy Henshaw at the Wellcome Collection may be able to share some of
>> their experience and learning if you reach out to them
>> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fchenshaw&amp;data=05%7C01%7Candrea.medina-smith%40NIST.GOV%7Cc29940e159de452ef18008da543f15f4%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C637914927194949931%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=x4m%2BjEW%2BZnQRF2WCG4rVGujotsYpB%2F7OOhO6sU%2BJveE%3D&amp;reserved=0
>> 
>> Best wishes
>> 
>> Owen
>> 
>> On Tue, 21 Jun 2022 at 19:47, Medina-Smith, Andrea M. (Fed) <
>> [log in to unmask]> wrote:
>> 
>>> Hello List,
>>> 
>>> Has anyone had success converting tables in a PDF to CSV? These are scans
>>> of paper from the 70s on forward. I know this isn’t a super easy
>>> conversion, but I would think it’s not impossible either.
>>> 
>>> Thanks,
>>> Andrea
>>> 
>>> --
>>> 
>>> Andrea Medina-Smith
>>> Data Librarian
>>> Information Services Office
>>> National Institute of Standards and Technology
>>> [log in to unmask]<mailto:[log in to unmask]>
>>> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Forcid.org%2F0000-0002-1217-701X&amp;data=05%7C01%7Candrea.medina-smith%40NIST.GOV%7Cc29940e159de452ef18008da543f15f4%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C637914927194949931%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=prHifR7W4w2yNbqqJX1USbIhc2wahD5uwlUuM2Q4nWw%3D&amp;reserved=0
>>> 
>>> 
>>> 
>> 
>> --
>> Owen Stephens
>> Owen Stephens Consulting
>> Web: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ostephens.com%2F&amp;data=05%7C01%7Candrea.medina-smith%40NIST.GOV%7Cc29940e159de452ef18008da543f15f4%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C637914927194949931%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=IyGvvcqcoDVaAf1c3A9hMR%2BWiwGEiWA%2Biq%2FYE55GGQU%3D&amp;reserved=0
>> Email: [log in to unmask]
>> 
>