The Journal site (journal.code4lib.org) is a lightly modified WordPress site, and the indexing is whatever comes with WordPress. (I would guess it renders the HTML to flat text with no regard for authorship and reference sections.) The issue is a WordPress category, the date is the WordPress post date (I think), and Title is the WordPress title. Author is a field we added to WordPress, and it is just a text field (authors are undistinguished in the field). Abstract is the WordPress summary. I think the RSS feed from the Journal might be a good place to get much of the information, although in some cases (like Author), further processing would be required. We also submit metadata to DOAJ (https://doaj.org/toc/1940-5758), the basis of which comes from a custom plugin; see, for example, http://journal.code4lib.org/issues/issue44/feed/doaj. (The coordinating editor downloads that file, manually checks/corrects XML errors, and uploads it too DOAJ.)
Hope this helps -- sounds like you are doing some interesting work!
On May 16, 2019, 1:12 PM -0400, Eric Lease Morgan <[log in to unmask]>, wrote:
> How is Code4Lib Journal indexed? What software is used, and more specifically, what characteristics of each article are included in the index?
> Our journal is pretty cool, but as a library-related journal, I think it can be better. For example, what are the various indexed fields? Maybe we can support faceted browsing? Search results are returned in a very narrative form -- a format this is not very computable. If search results were in some sort of columnar format (TSV, CSV, etc.) sorting and grouping would be possible as well as analysis.
> Recently, I have been playing a lot with natural language processing and this has resulted in the extraction of statistically significant keywords, named entities, parts-of-speech, and even the identification of sentences matching a given grammar. All of these things lend themselves to inputs for machine learning processes. In turn, the results of all these things can re-incorporated into an index of Code4Lib. Thus the index not only supports find & get but also analysis. For a good time, I'd like to give this a go, just as an experiment.
> Is there someplace where I can download a rudimentary metadata file of all Code4Lib articles? At the least, I hope such a metadata file includes fields such as:
> * author(s)
> * title
> * date
> * abstract
> * link to full text
> * issue
> Is there a place where I can get such metadata?
> Eric Morgan