Print

Print


How do I go about munging wikimedia content?

After realizing that downloadable data dumps of Wikipedia are sorted
by language code, I was able to acquire the 1.6 GB compressed data,
uncompress it, parse it with Parse::MediaWikiDump, and output things
like article title and article text.

The text contains all sorts of wikimedia mark-up: [[]], \\, #, ==, *,
etc. I suppose someone has already written something that converts
this markup into HTML and/or plain text, but I can't find anything.

If you were to get the Wikipeda content, cache it locally, index it,
and provide access to the index, then how would you deal with the
Wiki mark-up?

--
Eric Lease Morgan
University Libraries of Notre Dame