Hi Matt,
I don't know much about your docx file, but I've also recently been learning & using regular expressions, and I thought I'd send you a link to a handy tool in case you hadn't seen it yet: http://regexr.com/
I've found regexr extremely helpful while trying to create useful regular expressions. You can tweak your regular expression in regexr and instantly see the results. (They provide some default sample text to search, though you're free to type/paste in your own.)
If you hover your cursor over pieces of the regular expression, hints pop up and tell you what each part of the expression does; I've found it useful for learning how regular expressions work. There's also a nice cheatsheet on the left, which sometimes cuts down on how much Googling you need to do.
Also, in case this is potentially helpful... here is a regular expression that matches groups of two or more capital letters: http://regexr.com/3bbet
Perhaps this will do the trick when searching for words that are in all caps? (I make no guarantees; you might need to fiddle with it a bit.)
As for searching for italicized words, I have no idea how to search for them unless they are surrounded by certain tags or signifiers. For instance, perhaps all italicized words are surrounded by tags like this: <em>Some Nice Title</em>. You could search for all phrases surrounded by those tags. But without a textual signifier like that, it's beyond me.
Best,
-- Ivan Goldsmith
Web Project Analyst
University of Pennsylvania Libraries
----- Original Message -----
From: "Matt Sherman" <[log in to unmask]>
To: [log in to unmask]
Sent: Tuesday, July 7, 2015 11:56:15 AM
Subject: [CODE4LIB] Regex Question
Hi all,
I am working my way through teaching myself regex to parse an annotated
bibliography docx file and had a question as I can't seem to get a succinct
answer from Google. Is it possible to have regex find words, or in the
case names, in displayed in all caps? Also similarly is it possible to
have regex find words, or in this case titles, that are italicized? Given
how the document is formatted doing both would be nice so that I could
parse them into a table or or database, but I cannot find a clear answer on
that, though I am very new to regex so it is probably jumping into the deep
end on this. Any answers are appreciated.
Matt Sherman
|