Hello Erich,
With the growing need for multi-format conversion and data extraction tools
behind the scenes of LLMs and VLMs (such as converting PDFs to images for
VLM-style OCR extraction, PDF to Markdown, or web document parsing),
there’s now a rich set of open-source libraries that can be very useful for
processing and parsing PDF data, like MarkItDown
<https://github.com/microsoft/markitdown>, Nougat
<https://github.com/facebookresearch/nougat>, OlmoOCR
<https://github.com/allenai/olmocr>, docling
<https://github.com/docling-project/docling> (which is my personal
favorite)...
For example, I tested your file using the OlmoOCR demo
<https://olmocr.allen.ai/> site, and it produced very clean MARC data like
this:
000 01221cam a2200193 4500
001 178194
005 20210316133917.0
008 060720 xx ||||| eng
245 ___ |a Category sorting box |h [kit].
260 ___ |a Carson, CA : |b Lakeshore.
500 ___ |a Lakeshore product "GG227".
500 ___ |a Targets standards in these areas: sorting & classifying, naming
objects from basic categories, word-object association.
505 0_ |a 30 miniatures -- 10 9" x 9" laminated mats.
520 ___ |a As children identify and sort miniatures into basic categories
like "foods" and "transportation," they strengthen their vocabulary and
develop the skills they need to become successful readers.
10 colorful activity mats are each printed with a different category, and
each features a helpful picture clue, so even nonreaders can sort and
match. Kids simply search for 3 objects that correspond with each category,
and place them on the mat.
521 1_ |a Ages 3 and up.
650 .0 |a Language arts (Primary)
650 .0 |a Early childhood education |x Activity programs.
650 .0 |a Educational games.
856 4_ |u https://libapps.s3.amazonaws.com/customers/230/images/MISC_17.jpg
|y Click here to view an image of this kit's contents
Very easy then to convert if needed in Marc xml format with MarcEdit.
Alternatively, you could continue with a AI-friendly pipeline using an
open-source SLM such as Phi4 or Qwen3 to stay in an fully Python-scripted
workflow.
Good luck with your extraction process!
Best regards*,*
Géraldine
--
Géraldine Geoffroy
grldn <[log in to unmask]>[log in to unmask]
<[log in to unmask]>
[log in to unmask]
*SmartBibl.IA Solutions*
*Solutions d'Ingénierie documentaire & IA pour les structures documentaires*
Le lun. 28 juil. 2025 à 21:05, Hammer, Erich F <[log in to unmask]> a écrit :
> Here is a random example.
>
> Don't grind too hard on it; I think we have found a bit of success feeding
> these to M365 CoPilot (which is licensed to us). It's not perfect and
> there is still some cleanup, but that would still be true if the data came
> in perfectly.
>
> Thanks,
> Erich
>
>
> On Monday, July 28, 2025 at 14:11, Wil Blake eloquently inscribed:
>
> Hello Erich, Can you paste an example of the PDF text Marc record into
> this thread? Regards, Wil Blake
>
>
>
|