On May 28, 2019, at 11:01 AM, Peter Murray <[log in to unmask]> wrote:
> The Journal site (journal.code4lib.org) is a lightly modified WordPress site, and the indexing is whatever comes with WordPress. (I would guess it renders the HTML to flat text with no regard for authorship and reference sections.) The issue is a WordPress category, the date is the WordPress post date (I think), and Title is the WordPress title. Author is a field we added to WordPress, and it is just a text field (authors are undistinguished in the field). Abstract is the WordPress summary. I think the RSS feed from the Journal might be a good place to get much of the information, although in some cases (like Author), further processing would be required. We also submit metadata to DOAJ (https://doaj.org/toc/1940-5758), the basis of which comes from a custom plugin; see, for example, http://journal.code4lib.org/issues/issue44/feed/doaj. (The coordinating editor downloads that file, manually checks/corrects XML errors, and uploads it too DOAJ.)
Peter, thank you, and at first glance a more through indexing process would be to:
1. regularly retrieve the feed/doaj file
2. parse it
3. save the result as metadata
4. harvest full text
5. index full text this way, that way, and the other way
6. associate the result of Step #5 with the result of Step #3
7. present the result
Hmmm... Interesting, and again, thank you.
--
Eric Morgan
|