LISTSERV 16.5 - CODE4LIB Archives

For anyone wondering if microdata is being used, and by whom, the links
here give some interesting stats.

kc

-------- Forwarded Message --------
Subject: 	WebDataCommons releases 38.7 billion quads Microdata, Embedded
JSON-LD, RDFa and Microformat data originating from 7.4 million
pay-level-domains
Resent-Date: 	Thu, 11 Jan 2018 09:35:55 +0000
Resent-From: 	[log in to unmask]
Date: 	Thu, 11 Jan 2018 10:35:20 +0100
From: 	Anna Primpeli <[log in to unmask]>
To: 	[log in to unmask], [log in to unmask], [log in to unmask]



Hi All,

we are happy to announce the new release of the WebDataCommons
Microdata, JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the November 2017 version of the Common
Crawl covering 3.2 billion HTML pages which originate from 26 million
websites (pay-level domains).

In summary, we found structured data within 1.2 billion HTML pages out
of the 3.2 billion pages contained in the crawl (38.9%). These pages
originate from 7.4 million different pay-level domains out of the 26
million pay-level-domains covered by the crawl (28.4%).

Approximately 3.7 million of these websites use Microdata, 2.6 million
websites use JSON-LD, and 1.2 million websites make use of RDFa.
Microformats are used by more than 3.3 million websites within the crawl.

 

*Background:* 

More and more websites annotate data describing for instance products,
people, organizations, places, events, reviews, and cooking  recipes
within their HTML pages using markup formats such as Microdata, embedded
JSON-LD, RDFa and Microformat. 

The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and
Microformat data from the Common Crawl web corpus, the largest
web corpus that is available to the public, and provides the extracted
data for download. In addition, we publish statistics about the adoption
of the different markup formats as well as the vocabularies that are
used together with each format. We run yearly extractions since 2012 and
we provide the dataset series as well as the related statistics at:

http://webdatacommons.org/structureddata/

 

*Statistics about the November 2017 Release:*

Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and
Microformat data sets as well as the vocabularies that are used together
with each markup format are found at: 

http://webdatacommons.org/structureddata/2017-12/stats/stats.html

* *

*Markup Format Adoption*

The page below provides an overview of the increase in the adoption of
the different markup formats as well as widely used schema.org classes
from 2012 to 2017:

http://webdatacommons.org/structureddata/#toc10

Comparing the statistics from the new 2017 release to the statistics
about the October 2016 release of the data sets

http://webdatacommons.org/structureddata/2016-10/stats/stats.html

we see that the adoption of structured data keeps on increasing while
Microdata remains the most dominant markup syntax. The different nature
of the crawling strategy that was used makes it hard to compare absolute
as well as certain relative numbers between the two releases. More
concretely, we observe that the November 2017 Common Crawl corpus is
much deeper for certain domains like blogspot.com and wordpress.com
while other domains are covered in a shallower way, with fewer URLs
crawled in comparison to the October 2016 Common Crawl corpus.
Nevertheless, it is clear that the growth rate of Microdata and
Microformats is much higher than the one of RDFa and embedded JSON-LD. 
Although, the latter format is widely spread, it is mainly used to
annotate metadata for search actions (80% of the domains using JSON-LD)
while only a few domains use it for annotating content information such
as Organizations (25% of the domains using JSON-LD), Persons (4% of the
domains using JSON-LD) or Offers (0.1% of the domains using JSON-LD).

* *

*Vocabulary Adoption*

Concerning the vocabulary adoption, schema.org, the vocabulary
recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the
most dominant in the context of Microdata with 78% of the webmasters
using it in comparison to its predecessor, the data-vocabulary, which is
only used by 14% of the websites containing Microdata. In the context of
RDFa, the Open Graph Protocol recommended by Facebook remains the most
widely used vocabulary.

* *

*Parallel Usage of Multiple Formats*

Analyzing topic-specific subsets, we discover some interesting trends.
As observed in the previous extractions, content related information is
mostly described either with the Microdata format or less frequently
with the JSON-LD format, in both cases using the schema.org vocabulary.
However, we find out that 30% of the websites that use JSON-LD
annotations to describe product related information, make use of
Microdata as well as JSON-LD to cover the same topic. This is not the
case for other topics, such as Hotels or Job Postings, for which
webmasters use only one format to annotate their content.

* *

*Richer Descriptions of Job Postings*

Following the release of the “Google for Jobs” search vertical and the
more detailed guidance by Google on how to annotate job postings
(https://developers.google.com/search/docs/data-types/job-posting), we
see an increase in the number of websites annotating job postings (2017:
7,023, 2016: 6,352). In addition, the job posting annotations tend to
become richer in comparison to the previous years as the number of Job
Posting related properties adopted by at least 30% of the websites
containing job offers has increased from 4 (2016) to 7 (2017). The newly
adopted properties are JobPosting/url, JobPosting/datePosted, and
JobPosting/employmentType.

You can find a more extended analysis concerning specific topics, like
Job Posting and Product, here

http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html#extendedanalysis

 

*Download *

The overall size of the November 2017 RDFa, Microdata, Embedded JSON-LD
and Microformat data sets is 38.7 billion RDF quads. For download, we
split the data into 8,433 files with a total size of 858 GB.

http://webdatacommons.org/structureddata/2017-12/stats/how_to_get_the_data.html

In addition, we have created for over 40 different schema.org
<http://schema.org/> classes separate files, including all quads
extracted from pages, using a specific schema.org class. 

http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html

 

*Lots of thanks to:* 

+ the Common Crawl project for providing their great web crawl and
thus enabling the WebDataCommons project. 
+ the Any23 project for providing their great library of structured
data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 
+ the Ministry of Economy, Research and Arts of Baden – Württemberg
which supported through the ViCE project the extraction and analysis of
the November 2017 corpus.

*General Information about the WebDataCommons Project*

The WebDataCommons project extracts structured data from the Common
Crawl, the largest web corpus available to the public, and provides the
extracted data for public download in order to support researchers and
companies in exploiting the wealth of information that is available on
the Web. Beside of the yearly extractions of semantic annotations from
webpages, the WebDataCommons project also provides large hyperlink
graphs, the largest public corpus of WebTables, a corpus of product
data, as well as a collection of hypernyms extracted from billions of
web pages for public download. General information about the
WebDataCommons project is found at 

http://webdatacommons.org/


Have fun with the new data set. 

Cheers, 
Anna Primpeli, Robert Meusel and Chris Bizer