Python has a library which will parse mbox files. It worked on the test
file I downloaded from GMail. If all you want are the message bodies, it
looks like you can do that in seven lines. Obviously, this doesn't
guarantee much of anything for the jobs mbox files.
Looking at some of the posts on the web site, it looks like you'll have two
top-level problems with posts / message body content:
- posts that contain more than one job description
- posts that contain no job descriptions, just a link to a job
description somewhere else.
I'm happy to continue this discussion either here or offline, and if
someone sends me an mbox file, I'll see what I can do (in seven lines (-:)).
Graeme Williams
Las Vegas, NV
p.s. I love scraping web pages
On Fri, Jan 22, 2021 at 10:39 AM Monica Maceli <[log in to unmask]> wrote:
> Hi all,
>
> I've done a couple projects mining the data from the code4lib listserv
> (e.g. https://ejournals.bc.edu/index.php/ital/article/view/5893 ). Both
> times the fastest route was finding helpful folks involved in it to provide
> me with a data dump vs. spending time on a scraper.
>
> The most recent work I did was in 2018 - I have a tarball of all the
> message log files for the listserv (some will be job posts and others not)
> which is 2003 through 2018. I believe I asked about this on the c4l Slack
> at the time and Wayne Graham from CLIR kindly helped me out with the data!
> This data is not anonymized (as it was/is publically available with names
> and emails associated) but I did anonymize the findings for reporting.
>
> Ellen - I'd be happy to chat sometime about how I mined the data for job
> titles and related skills/technologies, feel free to reach out to me
> directly!
>
> Best,
>
> Monica Maceli, Ph.D.
> Associate Professor
> Pratt Institute | School of Information
> 144 W 14th St, 6th Floor, New York, NY, 10011-7301
> www.monicamaceli.com | [log in to unmask]
>
>
> On Fri, Jan 22, 2021 at 1:18 PM Andromeda Yelton <
> [log in to unmask]>
> wrote:
>
> > The initial commit in https://github.com/code4lib/shortimer/ was
> November
> > 2011, which is ten years for some values of ten. Taking a quick and
> > noncomprehensive glance around, I see postings as old as 2005. I don't
> see
> > an obvious API, but maybe a maintainer could weigh in about data dump
> > possibilities?
> >
> > On Fri, Jan 22, 2021 at 11:28 AM Eric Lease Morgan <[log in to unmask]>
> wrote:
> >
> > > On Jan 22, 2021, at 11:11 AM, Jill Ellern <[log in to unmask]>
> wrote:
> > >
> > > > I'm doing some research into systems librarian duties and wondering
> if
> > > there is an easy way to get a dump of the code4lib jobs from the last
> 10
> > > years? In excel format?
> > >
> > >
> > > Easy? I'd be surprised.
> > >
> > > There are two or three sources of the Code4Lib jobs data:
> > >
> > > 1. the underlying data from the jobs.code4lib.org site
> > >
> > > 2. any one of a number of different Code4Lib mailing list Web
> archives
> > >
> > > 3. the archived mailbox (mbox) files from the mailing list
> > >
> > > I don't think the jobs site has been around for ten years. Has it? Nor
> do
> > > I know whether or not the data is archived. If it is, then I'd bet you
> > will
> > > be able get it in some sort of structured format like JSON or delimited
> > > delimited format like Excel.
> > >
> > > Scraping different Web archives would require... scraping which,
> > > personally, I run away from.
> > >
> > > Finally, the archived mbox files would be the most comprehensive, but a
> > > programmer would have to parse the mbox (email) files, which is a
> > > specialized task in and of itself. If you want to know where the mbox
> > files
> > > are located, then drop me a line and I'll let you know. Easy.
> > >
> > > Finally, what's the questions you would like to answer? How many system
> > > librarian jobs have been posted? Where were the jobs? What are the
> > > characteristics of systems librarianship and how have they changed over
> > > time? How much they pay? Extracting some of this information from the
> > > postings may be difficult, if not heroic in nature.
> > >
> > > --
> > > Eric Morgan
> > > University of Notre Dame
> >
> >
> >
> > --
> > Andromeda Yelton
> > Humanistic Machine Learning for Library Data
> > Lecturer, San José State University iSchool
> > https://andromedayelton.com
> > @ThatAndromeda
> > <http://twitter.com/ThatAndromeda>
> >
>
|