LISTSERV 16.5 - CODE4LIB Archives

On Jan 16, 2020, at 9:43 AM, Mike Monaco <[log in to unmask]> wrote:

> A colleague and I are planning a workshop on using regular expressions and expect an audience of primarily public services librarians. I was hoping other users here could suggest some applications of regex that would be useful for librarians who are *not* working in technical services or IT 9where the applications are much more obvious to me). For example, pointing out that some apps and programs, like Google Docs, can use regex for find/replace, web sites or databases that support regex in searches, and so on. Thanks in advance.


Hmmm... Rudimentary searching:

  * amass a set of plain text files, say Project Gutenberg texts
  * articulate an idea ("word") of interest
  * use grep to search for the idea in the set

Clean/normalize a corpus:

  * amass a set of plain text files, say Project Gutenberg texts
  * download & install BBEdit
  * use BBEdit's "Multi-File Search..." function and regular expressions to do things like remove digits (\d+) from the set or remove two-letter "words" from the set (\b\w\w\b)

Such works wonders against ugly OCR, and as a bonus, the result will be much more amenable to topic modeling. I suspect NotePad++ includes similar functionality.

From a Linux or Mac OS X command line, count & tabulate all the words & numbers in a file (functional, not perfect, and ugly):

  $ cat file.txt | tr -d '\r' | tr '\n' ' ' | \
    tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | \
    tr ' ' '\n' | sort | uniq -c | sort -rn | less

Great for beginning to learn the "aboutness" of a file. For extra credit, remove stop words.

Emphasize how the use of regular expressions is about the syntax ("shape") of words, not their semantics. 

--
Eric Morgan