LISTSERV 16.5 - CODE4LIB Archives

Hi Eric
 
The best place to look is probably http://meta.wikimedia.org/wiki/Alternative_parsers 
 
I'm guessing the "non-parser dumper", which uses MediaWiki's internal code to the do rendering, might be the a good choice.
 
regards
Dave Pattern
University of Huddersfield
 

________________________________

From: Code for Libraries on behalf of Eric Lease Morgan
Sent: Sun 10/09/2006 14:28
To: [log in to unmask]
Subject: [CODE4LIB] munging wikimedia



How do I go about munging wikimedia content?

After realizing that downloadable data dumps of Wikipedia are sorted
by language code, I was able to acquire the 1.6 GB compressed data,
uncompress it, parse it with Parse::MediaWikiDump, and output things
like article title and article text.

The text contains all sorts of wikimedia mark-up: [[]], \\, #, ==, *,
etc. I suppose someone has already written something that converts
this markup into HTML and/or plain text, but I can't find anything.

If you were to get the Wikipeda content, cache it locally, index it,
and provide access to the index, then how would you deal with the
Wiki mark-up?

--
Eric Lease Morgan
University Libraries of Notre Dame



This transmission is confidential and may be legally privileged. If you receive it in error, please notify us immediately by e-mail and remove it from your system. If the content of this e-mail does not relate to the business of the University of Huddersfield, then we do not endorse it and will accept no liability.