-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.html
180 lines (163 loc) · 26.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
<!DOCTYPE html>
<html><head><title>Web Data Commons</title>
<link rel="stylesheet" href="style.css" type="text/css" media="screen"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-30248817-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<div id="logo" style="text-align:right; background-color: white;"> <a href="http://dws.informatik.uni-mannheim.de"><img src="images/ma-logo.gif" alt="University of Mannheim - Logo"></a> <br></div>
<div itemscope itemtype="http://schema.org/WebObservatory">
<div id="header">
<h1 style="font-size: 250%;"><span itemprop="name">Web Data Commons</span></h1>
</div>
<meta itemprop="url" content="http://webdatacommons.org" />
<div id="tagline"><span itemprop="description">Extracting Structured Data from the Common Crawl</span</div>
<div id="content">
<p>
The Web Data Commons project extracts structured data from the <a href="http://commoncrawl.org/">Common Crawl</a>, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web.
</p>
<h2 id="news">News</h2>
<ul>
<li><strong>2024-02-01: We have released the WDC <a href="structureddata/#results-2023-1">RDFa, Microdata, Microformat, and Embedded JSON-LD</a> data sets extracted from the October 2024 Common Crawl corpus and created multiple <a href="structureddata/2023-12/stats/schema_org_subsets.html">schema.org class-specific subsets</a>. </strong></li>
<li><strong>2024-02-01: We have released the WDC <a href="http://webdatacommons.org/structureddata/schemaorgtables/2023/index.html">Schema.org Table Corpus 2023</a> which contains ~5M tables and is based on the October 2023 WDC schema.org extraction.</strong></li>
<li><strong>2024-01-08: We have released the WDC <a href="structureddata/#results-2023-1">RDFa, Microdata, Microformat, and Embedded JSON-LD</a> data sets extracted from the October 2023 Common Crawl corpus and created multiple <a href="structureddata/2023-12/stats/schema_org_subsets.html">schema.org class-specific subsets</a>. </strong></li>
<li><strong>2023-06-22: We have released <a href="../largescaleproductcorpus/wdc-block/">WDC Block</a> a benchmark for comparing the performance of blocking methods. WDC Block features a maximal Cartesian product of 200 billion pairs of product offers which were extracted form 3,259 e-shops.</strong></li>
<li><strong>2023-01-25: We have released the WDC <a href="structureddata/#results-2022-1">RDFa, Microdata, Microformat, and Embedded JSON-LD</a> data sets extracted from the October 2022 Common Crawl corpus and created multiple <a href="structureddata/2022-12/stats/schema_org_subsets.html">schema.org class-specific subsets</a>. </strong></li>
<li><strong>2022-12-22: We have released the <a href="https://webdatacommons.org/largescaleproductcorpus/wdc-products/">WDC Products</a> benchmark for fine-grained evaluation of the performance of entity matching methods along three dimensions.</strong></li>
<li><strong>2022-09-22: We have released the WDC <a href="http://webdatacommons.org/structureddata/sotab/">Schema.org Table Annotation Benchmark</a> for evaluating the performance of methods for annotating columns of Web tables with terms from the Schema.org vocabulary.</strong></li>
<li><strong>2022-01-04: We have released the WDC <a href="structureddata/#results-2021-1">RDFa, Microdata, Microformat, and Embedded JSON-LD</a> data sets extracted from the October 2021 Common Crawl corpus and created multiple <a href="structureddata/2021-12/stats/schema_org_subsets.html">schema.org class-specific subsets</a>. </strong></li>
<li><strong>2021-09-10: We have released the <a href="http://webdatacommons.org/largescaleproductcorpus/v2020/index.html">WDC Product Data Corpus V.2020</a>, extracted from the <a
href="2020-12/stats/schema_org_subsets.html">December 2020 WDC schema.org Product and Offer subsets</a>.</strong></li>
<li><strong>2021-03-29: We have released the WDC <a href="http://webdatacommons.org/structureddata/schemaorgtables/">Schema.org Table Corpus</a>, which was created by grouping the December 2020 <a
href="2020-12/stats/schema_org_subsets.html">schema.org class-specific subsets</a> into relational tables.</strong></li>
<li><strong>2021-03-22: The paper <a href="https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Research/Web-based_Systems/pub/Brinkmann-Bizer-Improving_Hierarchical_Product_Classification_using_domain_specific_laguage_modelling-PKG4Ecommerce2021.pdf">Improving Hierarchical Product Classification using Domain-specific Language Modelling</a> has been accepted at the <a href="https://km-ecomm-www.github.io/2021/">Knowledge Management in e-Commerce workshop</a> held in conjunction with <a href="https://www2021.thewebconf.org/">The Web Conference 2021</a>.</strong></li>
<li><strong>2021-01-21: We have released the WDC <a href="structureddata/2020-12/stats/stats.html">RDFa, Microdata, Microformat, and
Embedded JSON-LD</a> data sets extracted from the September 2020 Common Crawl corpus. </strong></li>
<li><strong>2020-08-24: The paper <a href="https://data.dws.informatik.uni-mannheim.de/largescaleproductcorpus/data/v2/papers/DI2KG2020_Peeters.pdf">Intermediate Training of BERT for Product Matching</a> using <a href="http://webdatacommons.org/largescaleproductcorpus/v2/index.html">Version 2.0</a> of the WDC Product Data Corpus and Gold Standard for Large-Scale Product Matching has been accepted at the <a href="http://di2kg.inf.uniroma3.it/2020/#">DI2KG workshop</a> held in conjunction with <a href="https://vldb2020.org/">VLDB2020</a>.</strong></li>
<li><strong>2020-07-01: We will present the paper <a href="https://data.dws.informatik.uni-mannheim.de/largescaleproductcorpus/data/v2/papers/WIMS2020_Peeters.pdf">Using schema.org Annotations for Training and Maintaining Product Matchers</a> using <a href="http://webdatacommons.org/largescaleproductcorpus/v2/index.html">Version 2.0</a> of the WDC Product Data Corpus and Gold Standard for Large-Scale Product Matching at the <a href="https://wims2020.sigappfr.org/pp/">WIMS2020</a> conference.</strong></li>
<li><strong>2020-03-19: The <a href="http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=100326">CfP</a> for the <a href="https://ir-ischool-uos.github.io/mwpd/">Semantic Web Challenge</a>@<a href="https://iswc2020.semanticweb.org/">ISWC2020</a> "Mining the Web of HTML-embedded Product Data" has been announced. The <a href="http://webdatacommons.org/largescaleproductcorpus/v2/index.html">WDC Product Data Corpus and Gold Standard V2.0</a> will be used as training and evaluation resources for the Product Matching task.</strong></li>
<li><strong>2020-01-13: We have released the WDC <a href="structureddata/2019-12/stats/stats.html">RDFa, Microdata, Microformat, and
Embedded JSON-LD</a> data sets extracted from the November 2019 Common Crawl corpus.</strong></li>
<li><strong>2019-10-23: <a href="http://webdatacommons.org/largescaleproductcorpus/v2/index.html">Version 2.0</a> of the WDC Product Data Corpus and Gold Standard for Large-Scale Product Matching released.</strong></li>
<li><strong>2019-07-19: We have released the <a href="T4LTE/">Web Tables for Long-Tail Entity Extraction (T4LTE)</a> dataset, the first gold standard for the task of long-tail entity extraction from web tables.</strong></li>
<li><strong>2019-07-19: We have released the <a href="TDGT/">Time-Dependent Ground Truth (TDGT)</a>, a dataset covering time-dependent data from various domains.</strong></li>
<li><strong>2019-05-15: Journal Article about <a href="largescaleproductcorpus\papers\Bizer2019_Article_UsingTheSemanticWebAsASourceOf.pdf">Using the Semantic Web as a Source of Training Data</a> has been published by the <a href="https://www.springer.com/computer/database+management+%26+information+retrieval/journal/13222">Datenbank-Spektrum Journal</a>.</strong></li>
<li><strong>2019-02-27: Paper about <a href="https://dl.acm.org/citation.cfm?id=3316609">The WDC training dataset and gold standard for large-scale product matching</a> accepted at <a href="https://sites.google.com/view/ecnlp">ECNLP Workshop</a> at <a href="https://www2019.thewebconf.org/">WWW2019</a> conference in San Francisco.</strong></li>
<li><strong>2019-01-17: We have released a new version of the <a href="structureddata/2018-12/stats/stats.html">RDFa, Microdata, Microformat, and Embedded JSON-LD</a> data corpus extracted from the November 2018 Common Crawl corpus.</strong></li>
<li><strong>2018-12-20: We have released the <a href="http://webdatacommons.org/largescaleproductcorpus/index.html">WDC Training Dataset and Gold Standard for Large-Scale Product Matching</a>.</strong></li>
<li><strong>2018-01-08: We have released a new version of the <a href="structureddata/2017-12/stats/stats.html">RDFa, Microdata, Microformat, and Embedded JSON-LD</a> data corpus extracted from the November 2017 Common Crawl corpus.</strong></li>
<li><strong>2017-06-26: We have released the <a href="https://github.com/olehmberg/winter/">Web Data Integration Framework (WInte.r)</a>, which provides parsers and methods for the integration of Web Tables.</strong></li>
<li><strong>2017-01-17: We have released a new version of the <a href="structureddata/2016-10/stats/stats.html">RDFa, Microdata, Microformat, and Embedded JSON-LD</a> data corpus extracted from the October 2016 Common Crawl corpus.</strong></li>
<li><strong>2016-09-01: We have released a <a href="productcorpus/index.html">Gold Standard for Product Matching and Product Feature Extraction</a>. The gold standards are accompanied by a 11.2 million product data corpus crawled in the first quarter of 2016.</strong></li>
<li><strong>2016-04-25: We have released a new version of the <a href="structureddata/2015-11/stats/stats.html#results-2015-1">RDFa, Microdata, Microformat, and Embedded JSON-LD</a> data corpus extracted from the November 2015 Common Crawl corpus. This corpus for the first time also includes JSON-LD data.</strong></li>
<li><strong>2016-04-13: We have released a <a href="isadb/index.html">web-scale "IsA" database</a> containing over 400 million hypernymy relations extracted from the text of HTML pages.</strong></li>
<li><strong>2015-12-15: Paper about <a href="http://www.wim.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Ritze-etal-ProfilingWebTables.pdf">Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases</a> has been accepted at the <a href="http://www2016.ca/">WWW'16</a> conference in Montréal, Canada.</strong></li>
<li><strong>2015-12-15: Anthelion, a focused crawler for structured data released as Yahoo open source project. Find the code as well as a more comprehensive description at the <a href="https://github.com/yahoo/anthelion">Yahoo GitHub repository</a> (<a href="http://yahoolabs.tumblr.com/post/135196452221/explore-anthelion-our-open-source-focused-crawler">Yahoo tumblr posting</a>)</strong></li>
<li><strong>2015-11-19: <a href="webtables/index.html#results-2015">WDC Web Table Corpus 2015</a> released consisting of 233 million Web tables extracted from the July 2015 Common Crawl.</strong></li>
<li><strong>2015-08-13: Journal Article about <a href="http://www.webscience-journal.net/webscience/article/view/11">The Graph Structure in the Web - Analyzed on Different Aggregation Levels</a> has been published by the <a href="http://www.webscience-journal.net/webscience/issue/view/1">Journal of Web Science</a>.</strong></li>
<li><strong>2015-04-02: <a href="structureddata/index.html#toc3">RDFa, Microdata, and Microformat</a> data sets extracted from the December 2014 Common Crawl corpus available for download.</strong></li>
<li><strong>2015-04-01: <a href="webtables/goldstandard.html">T2D Gold Standard</a> for comparing matching systems on the task of finding correspondences between Web tables and large-scale knowledge bases released.</strong></li>
<li><strong>2015-03-30: Paper about <a href="http://www.wim.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/MeuselPaulheim-HeuristicsForFixingCommonErrorsInDeployedSchemaOrgMicrodata-ESWC2015.pdf">Heuristics for Fixing Common Errors in Deployed schema.org Microdata</a> accepted at <a href="http://2015.eswc-conferences.org/">ESWC2015</a> conference in Portoroz, Slovenia.</strong></li>
<li><strong>2014-12-04: We have created several <a href="structureddata/2013-11/stats/schema_org_subsets.html">Class-Specific Subsets of the Schema.org Data contained in the Winter 2013 Microdata Corpus</a> (e.g. schema.org/Product and schema.org/Offer) in order to make it easier to work with the data.</strong></li>
<li><strong>2014-08-27: We have released an easy to customize version of the <a href="framework/index.html">WDC Extraction Framework</a> including a tutorial, which explains the usage and customization in detail. See also our <a href="http://blog.commoncrawl.org/2014/08/web-data-commons-extraction-framework-for-the-distributed-processing-of-cc-data/">guest post</a> at the Common Crawl Blog.</strong></li>
<li><strong>2014-08-13: <a href="http://webdatacommons.org/hyperlinkgraph/index.html#toc3">Hyperlink Graph Dataset</a> covering 1.7 billion web pages extracted from the April 2014 Common Crawl corpus available for download.</strong></li>
<li><strong>2014-07-06: Paper about WebDataCommons Microdata, Rdfa and Microformats Dataset Series accepted at <a href="http://iswc2014.semanticweb.org/">ISWC'14</a> conference in Riva del Garda - Trentino, Italy: <a href="https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Research/Web-based_Systems/pub/Meusel-etal-TheWDCMicrodataRdfaMicroformatsDataSeries-ISWC2014-rbds.pdf" target="_blank"> The WebDataCommons Microdata, RDFa and Microformat Dataset Series</a></strong></li>
<li><strong>2014-04-14: Paper about WDC Pay-Level Domain Graph accepted at <a href="http://www.websci14.org/">WebSci'14</a> conference in Bloomington, USA: <a href="https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Research/Web-based_Systems/pub/Lehmberg-etal-GraphStructureOfTheWebPLD.pdf" target="_blank">Graph Structure in the Web - Aggregated by Pay-Level Domain</a></strong></li>
<li><strong>2014-04-01: <a href="structureddata/index.html#results-2013-1">RDFa, Microdata, and Microformat</a> data sets extracted from the Winter 2013 Common Crawl corpus available for download.</strong></li>
<li><strong>2014-03-05: Initial release of the <a href="http://webdatacommons.org/webtables/index.html">WDC Web Tables</a> data set consisting of 147 million relational Web tables.</strong></li>
<li><strong>2014-02-12: <a href="http://wwwranking.webdatacommons.org">First open ranking of the World Wide Web</a> is now available. The ranking is based on the <a href="hyperlinkgraph/index.html">WDC Hyperlink Graph</a>. </strong></li>
<li><strong>2014-02-04: Paper about WDC Hyperlink Graph accepted at WWW2014 conference (Web Science Track) in Seoul: <a href="https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Research/Web-based_Systems/pub/Meusel-etal-GraphStructureOfTheWeb.pdf">Graph Structure in the Web - Revisited</a></strong></li>
<li><strong>2014-01-20: Paper about the integration of product data from the WDC Microdata data set accpeted at the DEOS2014 workshop at the WWW2014 conference in Seoul: <a href="https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Research/Web-based_Systems/pub/petrovski_bryl_bizer_deos2014.pdf" target="_blank"> Integrating Product Data from Websites offering Microdata Markup</a></strong></li>
<li><strong>2013-11-12: Web Data Commons releases large <a href="hyperlinkgraph/index.html">Hyperlink Graph</a> covering 3.5 billion web pages and 128 billion hyperlinks between these pages.</strong></li>
<li><strong>2013-09-02: Paper about the WDC RDFa, Microdata, and Microformat data set accepted at the ISWC2013 conference in Sydney: <a href="https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Research/Web-based_Systems/pub/Bizer-etal-DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf" target="_blank">Deployment of RDFa, Microdata, and Microformats on the Web -- A Quantitative Analysis</a>.</strong></li>
<li><strong>2013-07-12: New analysis available about the <a href="structureddata/2012-08/stats/microdata_products_detail.html">types of products that are offered by e-shops using Microdata markup</a>.</strong></li>
<li><strong>2013-07-05: Yahoo! Research releases Glimmer search engine which enables you to search Web Data Commons data. <a href="https://groups.google.com/forum/?fromgroups=#!topic/web-data-commons/coDFbhRSAQQ" target="_blank">Details</a>.</strong></li>
<li><strong>2012-12-10: <a href="structureddata/index.html#results-2012-1">RDFa, Microdata, and Microformat</a> data sets extracted from the August 2012 Common Crawl corpus available for download.</strong></li>
<li><strong>2012-06-29: We have created a <a href="structureddata/vocabulary-usage-analysis/index.html">new analysis on vocabulary usage</a> in our Microdata and RDFa data set.</strong></li>
<li><strong>2012-06-20: Presentation of the Web Data Commons project and our data extraction framework at the <a href="http://aws.amazon.com/aws-summit-2012/berlin/">AWS Summit 2012 Berlin</a> - <a href="http://www.slideshare.net/hannesmuehleisen/aws-summit-berlin-2012-talk-on-web-data-commons" target="_blank">Slides</a>.</strong></li>
<li><strong>2012-04-16: Paper on Web Data Commons presented at the LDOW 2012 Workshop (<a href="structureddata/index.html#references">References</a>)</strong></li>
<li><strong>2012-03-22: RDFa, Microdata, and Microformat data sets extracted from the February 2012 Common Crawl corpus available for download. </strong></li>
<li><strong>2012-03-13: RDFa, Microdata, and Microformat data sets extracted from the 2009/2010 Common Crawl corpus available for download.</strong></li>
</ul>
<h2 id="dataset">Available Data Sets</h2>
<div itemprop="webObservatoryProject" itemscope itemtype="http://schema.org/WebObservatoryProject">
<meta itemprop="url" content="http://webdatacommons.org/structureddata/index.html"/>
<h3><a href="structureddata/index.html"><span itemprop="name">RDFa, Microdata, and Microformat</span></a></h3>
<p>
More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using <span itemprop="method">markup standards such as RDFa, Microdata and Microformats</span>.
The Web Data Commons project extracts this data from several billion web pages. So far the project provides six different data set releases extracted from the Common Crawl 2016, 2015, 2014, 2013, 2012 and 2010. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
</p>
</div>
<div itemprop="webObservatoryProject" itemscope itemtype="http://schema.org/WebObservatoryProject">
<meta itemprop="url" content="http://webdatacommons.org/webtables/index.html"/>
<h3><a href="webtables/index.html"><span itemprop="name">Web Tables</span></a></h3>
<p>
The Web contains vast amounts of <span itemprop="method">HTML tables</span>. Most of these tables are used for layout purposes, but a fraction of the tables is also quasi-relational, meaning that they contain structured data describing a set of entities, and are thus useful in application contexts such as data search, table augmentation, knowledge base construction, and for various NLP tasks. The WDC Web Tables data set consists of the 147 million relational Web tables that are contained in the overall set of 11 billion HTML tables found in the Common Crawl.
</p>
</div>
<div itemprop="webObservatoryProject" itemscope itemtype="http://schema.org/WebObservatoryProject">
<meta itemprop="url" content="http://webdatacommons.org/hyperlinkgraph/index.html"/>
<h3><a href="hyperlinkgraph/index.html"><span itemprop="name">Hyperlink Graph</span></a></h3>
<p>
We offer a large <span itemprop="method">hyperlink graph</span> that we extracted from the 2012 version of the Common Crawl. The WDC Hyperlink Graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. The graph can help researchers to improve search algorithms, develop spam detection methods and evaluate graph analysis algorithms. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public.
</p>
</div>
<div itemprop="webObservatoryProject" itemscope itemtype="http://schema.org/WebObservatoryProject">
<meta itemprop="url" content="http://webdatacommons.org/isadb/index.html"/>
<h3><a href="isadb/index.html"><span itemprop="name">WebIsA Database</span></a></h3>
<p>
We offer a large <span itemprop="method">IsA database</span> that we extracted from the 2015 version of the Common Crawl. The WDC IsA Database contains more than 400 million hypernymy relations we extracted from the text of HTML pages included in the crawl.
This collection of relations represents a rich source of knowledge and can be used to improve approaches in various application domains. We offer the tuple dataset for public download and an application programming interface to help other researchers programmatically query the database. In addition a <a href="http://webisadb.webdatacommons.org/webisadb/">demo web application</a> of the database is available.
</p>
</div>
<div itemprop="webObservatoryProject" itemscope itemtype="http://schema.org/WebObservatoryProject">
<meta itemprop="url" content="http://webdatacommons.org/productcorpus/index.html"/>
<span itemprop="name"><b>Product Data Corpora</b></span>
<p>
We offer two product data corpora containing offers from multiple e-shops. The <a href="http://webdatacommons.org/productcorpus/index.html">first corpus</a> consists of 5.6 million product offers from the categories mobile phones,
headphones and televisions and was crawled from 32 popular shopping websites. The corpus is accompanies by a manually verified gold standard for the evaluation and comparison
of product feature extraction and product matching methods. The <a href="http://webdatacommons.org/largescaleproductcorpus/index.html">second corpus</a> consists of more than 26 million product offers originating from 79 thousand websites.
The offers are grouped into 16 million clusters of offers referring to the same product using product identifiers, such as GTINs or MPNs.
</p>
</div>
<h2 id="software">Available Software</h2>
<h3><a href="framework/index.html">Extraction Framework</a></h3>
<p>
The effective processing of large web corpora presents challenges in terms of resources, time and costs. In order to extract the data sets presented above, the Web Data Commons project has developed a framework which provides an easy to use basis for the distributed processing of large web crawls using <a href="http://aws.amazon.com/de/">Amazon EC2 cloud services</a>. The framework is published under the terms of the Apache license and can be simply customized to perform also different data extraction tasks.
</p>
<h2 id="license">License</h2>
<p>The Web Data Commons extraction framework can be used under the terms of the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache Software License</a>.
</p>
<h2 id="Feedback">Feedback</h2>
<p>Please send questions and feedback to the <a href="http://groups.google.com/group/web-data-commons">Web Data Commons mailing list</a> or post them in our <a href="https://groups.google.com/forum/?fromgroups#!forum/web-data-commons">Web Data Commons Google Group</a>.
</p>
<h2 id="about">About Web Data Commons Project</h2>
<p>
The Web Data Commons project was started by researchers from <span itemprop="contributer" itemscope itemtype="http://schema.org/Organization"><span itemprop="name">Freie Universität Berlin</span></span> and the <span itemprop="contributer" itemscope itemtype="http://schema.org/Organization"><span itemprop="name">Karlsruhe Institute of Technology (KIT)</span></span> in 2012. The goal of the project is to facilitate research and support companies in exploiting the wealth of information on the Web by extracting structured data from web crawls and provide this data for public download. Today the WDC Project is mainly maintained by the <span itemprop="contributer" itemscope itemtype="http://schema.org/Organization"><a href="http://dws.informatik.uni-mannheim.de/"><span itemprop="subOrganization" itemscope itemtype="http://schema.org/Organization"><span itemprop="name">Data and Web Science Research Group</a></a> at the <span itemprop="name">University of Mannheim</span></span>. The project is coordinated by <a href="http://dws.informatik.uni-mannheim.de/en/people/professors/prof-dr-christian-bizer/"><span itemprop="accountablePerson">Christian Bizer</span></a> who has moved from Berlin to Mannheim.
</p>
<h2 id="credits">Credits</h2>
<p>
Web Data Commons is supported by the EU FP7 projects <a href="http://planet-data.eu">PlanetData</a> and <a href="http://lod2.eu">LOD2</a>, by an <a href="http://aws.amazon.com/education/">Amazon Web Services in Education Grant Award</a>, by the <a href="http://www.dfg.de/">German Research Foundation (DFG)</a>
and by the <a href="https://www.alwr-bw.de/kooperationen/vice/">ViCe</a> research project of the <a href="https://mwk.baden-wuerttemberg.de/de/startseite/">Ministry of Economy, Research and Arts of Baden - Württemberg</a>.
</p>
<a href="http://planet-data.eu"><img src="images/pd.gif" alt="PlanetData Logo"></a>
<a href="http://lod2.eu"><img src="images/lod2.gif" alt="LOD2 Logo"></a>
<a href="http://aws.amazon.com/education/"><img src="images/aws.png" alt="AWS Logo"></a>
<a href="http://www.dfg.de/"><img src="images/dfg_logo.gif" alt="DFG Logo"></a>
<a href="https://mwk.baden-wuerttemberg.de/de/startseite/"><img src="images/mwk.png" height="130" alt="MWK_BW Logo"></a>
</div>
</div>
</div>
</body>
</html>
<ul>