Skip to content

Latest commit

 

History

History
241 lines (194 loc) · 29.8 KB

README.md

File metadata and controls

241 lines (194 loc) · 29.8 KB

Awesome Open Data software Awesome

Awesome list of the software tools related to opendata: open data catalogs, open spatial data, data gingestion tools, data prep tools, metadata standards and so on.

This awesome list is a part of Common Data Index project and is derived from registry of data portals https://registry.commondata.io

Table of contents

Data catalogs

Open data portals

Open source

  • Aleph - Aleph is a tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search.
  • CKAN - CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers hundreds of data portals worldwide.
  • DKAN - DKAN is a community-driven, free and open source open data platform that gives organizations and individuals ultimate freedom to publish and consume structured information.
  • JKAN - A lightweight, backend-free open data portal, powered by Jekyll
  • Magda - A federated, open-source data catalog for all your big data and small data
  • uData - Customizable and skinnable social platform dedicated to (open)data by Etalab

Commercial

  • EntryScape catalog - DCAT AP compliant data catalog
  • Junar - Junar data platform, SaaS product and US-based company
  • OpenDataSoft - The all-in-one platform that empowers your whole team to accelerate wider data usage and generate value in your ecosystems. The French company and SaaS data portal
  • Socrata - SaaS data platfrom popular in US and Canada. Socrata was acquired by Tyler Technologies in 2018 and is now the Data and Insights division of Tyler.
  • Tablion Data Portal - commercial data portal software from Aristotle Metadata, Australia
  • TriplyDb - TriplyDB integrates your organization's data assets into a standards-compliant knowledge graph.

Geodata catalogs

Open source

  • Esri Geoportal server - Geoportal Server is a standards-based, open source product that enables discovery and use of geospatial resources including data and services. Not updated anymore
  • Geoblacklight - A multi-institutional open-source collaboration building a better way to find and share geospatial data
  • Geonetwork - GeoNetwork is a catalog application to manage spatially referenced resources.
  • Geonode - GeoNode is a web-based application and platform for developing geospatial information systems (GIS) and for deploying spatial data infrastructures (SDI).
  • Geoportal.rlp - A complete SDI-Suite for the management of OWS (WMS / WFS, CSW), metadata (iso19139), users, organizations, and licences.
  • Geoserver - GeoServer is an open source server for sharing geospatial data.
  • LizMap - open source web map application from 3liz
  • MapBender - Mapbender is one of the leading open source solutions for creating intuitive and high-performance WebGIS applications.
  • MapProxy - MapProxy is an open source proxy for geospatial data. It caches, accelerates and transforms data from existing map services and serves any desktop or web GIS client.
  • ncWMS - ncWMS is a Web Map Service for displaying environmental data.
  • NextGIS Web - Web GIS framework by NextGIS
  • Open Geoportal - The Open Geoportal (OGP) is a collaboratively developed, open source, federated web application to rapidly discover, preview, and retrieve geospatial data from multiple organizations.
  • OpenDataCube - The Open Data Cube (ODC) is an Open Source Geospatial Data Management and Analysis Software project that helps you harness the power of Satellite data.
  • Oskari - geoportal open source software from Finland Kadaster, incubating in Open Geo
  • PyCSW - Python based CSW server, actively used as standalone service
  • pygeoapi - pygeoapi is a Python server implementation of the OGC API suite of standards. The project emerged as part of the next generation OGC API efforts in 2018 and provides the capability for organizations to deploy a RESTful OGC API endpoint using OpenAPI, GeoJSON, and HTML. pygeoapi is open source and released under an MIT license.
  • Stac-server - A Node-based STAC API, AWS Serverless, OpenSearch

Commercial

  • ArcGIS Hub -an easy-to-configure cloud platform that organizes people, data, and tools to accomplish initiatives and goals. Esri SaaS product
  • ArcGIS Server - ArcGIS Server is the server software component in ArcGIS Enterprise that makes your geographic information available to other users in your organization, and optionally to any Internet user.
  • Carto - SaaS mapping service with possibility of creating of geodata portals
  • ERDAS Apollo - Enables enterprise data management, discovery and delivery for geospatial data
  • Koordinates - Koordinates is a geospatial data management platform inspired by cracking GIS data out of vendor silos.
  • MetaGIS - commercial GIS server/portal from Sweden and popular in Sweden
  • OrbisMAP - Russian geoportal product

Research data repositories

Open source

  • DataCat - DataLad Catalog is a free and open source command line tool, with a Python API, that assists with the automatic generation of user-friendly, browser-based data catalogs from structured metadata.
  • Dataverse - Open source research data repository software
  • Djehuty - The 4TU.ResearchData repository system
  • DSpace - The Software of Choice for Academic, Non-profit & Commercial Organizations Building Open Digital Repositories. Used to create research data repositories too.
  • EPrints - open source research data repository popular in United Kingdom
  • ERDDAP - ERDDAP is a data server that gives you a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps.
  • Galaxy - open source bioinformatics research management platform
  • InvenioRDM - The turn-key research data management repository
  • IPT - GBIF Integrated Publishing Toolkit (IPT). Data catalog software integrated into GBIF ecosystem.
  • Islandora - digital documents open source repository used by universities and libraries, could include datasets
  • LibreCat - A publication management system. Used to create research data repositories too.
  • MyCoRe - MyCoRe (portmanteau of My Content Repository) is an open source repository software framework for building disciplinary or institutional
  • MyTardis - MYTARDIS: Research data management for instrument data repositories, digital archives, digital libraries, and scientific journals.
  • NYU Data catalog - The NYU Data Catalog facilitates researchers’ access to large datasets available either publicly or through institutional or individual licensing. It also includes descriptions of internally-generated research datasets from NYU researchers.
  • Samvera - another digital documents open source repository software, actively used by libraries around the world
  • THREDDS Data Server - The THREDDS Data Server (TDS) is a web server that provides metadata and data access for scientific datasets, using OPeNDAP, OGC WMS and WCS, HTTP, and other remote data access protocols.
  • Vufind - VuFind® is a discovery system designed and developed for libraries by libraries. It is also flexible enough to build search interfaces for all kinds of content beyond the library environment.
  • Weco - Weko3 is a repository software based on invenio3.

Commercial

  • Converis - research data management product by Clarivate
  • DataOne Hosted Repo - online catalog and SaaS hosted repositories
  • Elsevier Digital Commons - Elsevier product to manage research output, similar to Elsevier Pure but less complicated.
  • Elsevier Pure - Pure is a Research Information Management System (RIMS) or Current Research Information System (CRIS).
  • Esploro - research outputs management system from Exlibris Group
  • Figshare - cloud based open access repository software and service
  • Omega-PSIR - research management information system from Poland and used by Poland universities
  • Worktribe - Worktribe is an cloud-based platform for research management.

Microdata catalogs

Open source

  • NADA Data Catalog - An open-source software designed for researchers to browse, search, compare, apply for access and download research data.
  • Obiba Mica - Mica is a powerful software application used to create data web portals for large-scale epidemiological studies or multiple-study consortia. Mica2 is the successor of Mica.

Commercial

  • Colectica - Colectica is the fastest way to design, document, and publish your statistical data and survey research using open data standards.

Statistics and indicators databases

Open source

  • OpenSDG - Open SDG. An open source, free-to-reuse platform for managing and publishing data and statistics related to the UN Sustainable Development Goals (SDGs).
  • PxWeb - PxWeb is used for publishing statistics in a data base at the web and is since 1 January 2016 free of charge for government agencies and municipalities, international NSI:s and international organisations of statistics.
  • .Stat Suite - The .Stat Suite is a standard-based, componentised, open source platform for the efficient production and dissemination of high-quality statistical data. The product is based on the General Statistical Business Process Model (GSBPM) and the Statistical Data and Metadata eXchange (SDMX) standards.

Commercial

  • PublishMyData - RDF-based open statistical data product. Acquired by TPX Impact

Metadata catalogs

Open source

  • Fusion Metadata Registry - open source metadata catalog used by European Union authorities and some countries statistical agencies. Open source by request

Standards

Common data standards

  • Apache Parquet - Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc.... It's still uncommon for open data portals but common for public ML data catalogs.

  • Arrow Columnar Format - The Arrow columnar format includes a language-agnostic in-memory data structure specification, metadata serialization, and a protocol for serialization and generic data transport.

  • CDF - CDF is a conceptual data abstraction for storing, manipulating, and accessing multidimensional data sets. The basic component of CDF is a software programming interface that is a device-independent view of the CDF data model. Common for scientific data.

  • CSV - Common Format and MIME Type for Comma-Separated Values (CSV) Files

  • JSON - JSON (JavaScript Object Notation) is a lightweight data-interchange format.

  • NDJSON/JSON lines - NDJSON is a convenient format for storing or streaming structured data that may be processed one record at a time. It works well with unix-style text processing tools and shell pipelines.

  • NETCDF - NetCDF (network Common Data Form) is a set of interfaces for array-oriented data access and a freely distributed collection of data access libraries for C, Fortran, C++, Java, and other languages. The netCDF libraries support a machine-independent format for representing scientific data. Common for scientific data.

  • RDF - The Resource Description Framework (RDF) is a general framework for representing interconnected data on the web. RDF statements are used for describing and exchanging metadata, which enables standardized exchange of data based on relationships.

  • XLS - The Microsoft Excel Binary File format, with the .xls extension and referred to as XLS or MS-XLS, was the default format used for spreadsheets in Excel through Microsoft Office 2003. It is not open data format since it's proprietary, but it's defacto very common.

  • XLSX - The Open Office XML-based spreadsheet format using .xlsx as a file extension has been the default format produced for new documents by versions of Microsoft Excel since Excel 2007. It is not open data format since it's proprietary, but it's defacto very common.

  • XML - Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.

Spatial data standards

  • CSW - Catalogue services support the ability to publish and search collections of descriptive information (metadata) for data, services, and related information objects. Open Geospatial Consortium standard.
  • ESRI Rest API - ArcGIS REST APIs used to query data from ArcGIS Enterprise products
  • FITS - FITS is a file format designed to store, transmit, and manipulate scientific images and associated data.
  • GeoJSON - GeoJSON is a format for encoding a variety of geographic data structures. In 2015, the Internet Engineering Task Force (IETF), in conjunction with the original specification authors, formed a GeoJSON WG to standardize GeoJSON. RFC 7946 was published in August 2016 and is the new standard specification of the GeoJSON format, replacing the 2008 GeoJSON specification.
  • GeoPackage - Specifications in the family of GeoPackage formats (see GeoPackage_family) specify GeoPackages for exchange and GeoPackage SQLite Extensions that permit direct use, without intermediate format translations, of vector geospatial features and/or tile matrix sets of earth images and raster maps at various scales.
  • GeoTIFF - This OGC Standard defines the Geographic Tagged Image File Format (GeoTIFF) by specifying requirements and encoding rules for using the Tagged Image File Format (TIFF) for the exchange of georeferenced or geocoded imagery. Open Geospatial Consortium standard.
  • GML - The OpenGIS® Geography Markup Language Encoding Standard (GML) The Geography Markup Language (GML) is an XML grammar for expressing geographical features.
  • KML - KML is an XML language focused on geographic visualization, including annotation of maps and images.
  • OGC API - Records - OGC API - Records is a multi-part draft specification that offers the capability to create, modify, and query metadata on the Web.
  • ShapeFile - A shapefile is an Esri vector data storage format for storing the location, shape, and attributes of geographic features. It is stored as a set of related files and contains one feature class.
  • TMS - The OGC Tile Matrix Set standard defines the rules and requirements for a tile matrix set as a way to index space based on a set of regular grids defining a domain (tile matrix) for a limited list of scales in a Coordinate Reference System (CRS) as defined in [OGC 08-015r2] Abstract Specification Topic 2: Spatial Referencing by Coordinates.
  • WCS - A Web Coverage Service (WCS) offers multi-dimensional coverage data for access over the Internet. WCS Core specifies a core set of requirements that a WCS implementation must fulfill. Open Geospatial Consortium standard.
  • WFS - The Web Feature Service (WFS) represents a change in the way geographic information is created, modified and exchanged on the Internet. Rather than sharing geographic information at the file level using File Transfer Protocol (FTP), for example, the WFS offers direct fine-grained access to geographic information at the feature and feature property level. Open Geospatial Consortium standard.
  • WMS - The OpenGIS Web Map Service Interface Standard (WMS) provides a simple HTTP interface for requesting geo-registered map images from one or more distributed geospatial databases. Open Geospatial Consortium standard.
  • WMTS - OpenGIS Web Map Tile Service Implementation Standard
  • WPS - The OpenGIS® Web Processing Service (WPS) Interface Standard provides rules for standardizing how inputs and outputs (requests and responses) for geospatial processing services, such as polygon overlay.

APIs specifications

  • OpenAPI - The OpenAPI Specification is a specification language for HTTP APIs that provides a standardized means to define your API to others.

Statistics specifications

  • DDI - The Data Documentation Initiative (DDI) is an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences.
  • SDMX - A global initiative to improve Statistical Data and Metadata eXchange

Specific data standards

  • Fiscal data package - Fiscal Data Package is a lightweight and user-oriented format for publishing and consuming fiscal data. Fiscal data packages are made of simple and universal components. They can be produced from ordinary spreadsheet software and used in any environment.
  • GTFS (General Transit Feed Specification) - defines a common format for public transportation schedules and associated geographic information. GTFS "feeds" let public transit agencies publish their transit data and developers write applications that consume that data in an interoperable way.
  • IATI Standard - The IATI Standard is a set of rules and guidance on how to publish useful development and humanitarian data.
  • Open Contracting Data Standard (OCDS) - The Open Contracting Data Standard (OCDS) enables disclosure of data and documents at all stages of the contracting process by defining a common data model.

Data containers

  • BagIt - BagIt is a set of hierarchical file layout conventions designed to support storage and transfer of arbitrary digital content. A "bag" consists of a directory containing the payload files and other accompanying metadata files known as "tag" files.
  • BioCompute Objects - BCOs are represented in JSON (JavaScript Object Notation) formatted text, adhearing to JSON schema draft-07. The JSON format was chosen because it is both human and machine readable/writable. For a detailed description of JSON see www.json.org.
  • COMBINE - The “COmputational Modeling in BIology NEtwork” (COMBINE) is an initiative to coordinate the development of the various community standards and formats for computational models.
  • DataCrate - Data Crate is based on the Bagit packaging spec, with additional human and machine readable metadata in JSON-LD.
  • Frictionless standards - A Data Package is a simple container format used to describe and package a collection of data (a dataset).
  • ReproZIP - ReproZip can automatically pack your research along with all necessary data files, libraries, environment variables and options into a self-contained bundle.
  • RO-CRATE - RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments.

Metadata standards

  • Asset Description Metadata Schema, ADMS - metadata management of a European public administration or service and want to explore, (re-)use or share semantic assets (metadata or reference data)
  • CKAN API - defacto metadata standard for most open data portals
  • CSVW - CSV on the Web - CSV on the Web (CSVW) standard to add metadata to describe the contents and structure of comma-separated values (CSV) data files
  • DataCite Metadata Schema - The DataCite Metadata Schema is a list of core metadata properties chosen for an accurate and consistent identification of a resource for citation and retrieval purposes, along with recommended use instructions.
  • Dataset Publishing Language - Google metadata standard to prepare datasets for the Google Public Data Explorer.
  • DC Packaging Specification - provides protocols for packages to capture not only primary data, but also associated metadata and relationships to other objects (papers, projects, people, etc.) no matter where they are located.
  • DCAT - DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. This document defines the schema and provides examples for its use.
  • DCAT-US - DCAT-US Schema v1.1 (Project Open Data Metadata Schema). The metadata schema specified in this memorandum is based on DCAT, a hierarchical vocabulary specific to datasets.
  • DCAT-AP 1.1 - The DCAT Application profile for data portals in Europe (DCAT-AP) is a specification based on W3C's Data Catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Version 1.1
  • DCAT-AP 2.1.1 - The DCAT Application Profile for data portals in Europe (DCAT-AP) is a specification based on W3C's Data Catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Version 2.1.1
  • DCAT-AP IT - This is version 2.0 of the ontology of the Italian application profile for the metadata that describe catalogues and data of Italian Public Administrations (DCAT-AP_IT).
  • DCAT-AP.de - DCAT-AP.de is the common German metadata model for the exchange of open administrative data. On this platform you will find the current version of the specification documents, sample files and DCAT-AP.de's own vocabularies.
  • Dublin Core - the most common digital objects description standard
  • EU Vocabularies - European Reference data catalogue
  • Executable Research Compendum - An Executable Research Compendium (ERC) is a packaging convention for computational research.
  • Google Search. Dataset (Dataset, DataCatalog, DataDownload) structured data - Google search description on implementation of Schema.org Dataset
  • INSPIRE - According to Article 5(1) of INSPIRE Directive 2007/2/EC, EU Member States shall ensure that metadata are created for the spatial data sets and services corresponding to the themes listed in Annexes I, II and III, and that those metadata are kept up to date.
  • ISO 19115:2003 - ISO 19115:2003 defines the schema required for describing geographic information and services. It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data.
  • Metatab and Metapack - Metatab stores metadata in a spreadsheet, alongside data, ensuring that the metadata is easy to create, easy to read, and cannot be separated from the data. Metapack builds data packages with Metatab metadata.
  • PEP, Portable Encapsulated Projects - PEP, or Portable Encapsulated Projects, is a community effort to make sample metadata reusable. PEPs decouple metadata from analysis
  • Schema.org Dataset - A body of structured information describing some topic(s) of interest.

Additional data standards resources

Tools

Data refining

  • OpenRefine - OpenRefine is a free, open source power tool for working with messy data and improving it

Data packaging

  • bdbag - The bdbag utilities are a collection of software programs for working with BagIt packages that conform to the BDBag and Bagit/RO profiles.
  • datalad - DataLad makes data management and data distribution more accessible. To do that, it stands on the shoulders of Git and Git-annex to deliver a decentralized system for data exchange.
  • Frictionless Framework - Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data

Quality management

Data publishing

  • Datasette - An open source multi-tool for exploring and publishing data

Statistics tools

  • RSDMX - Tools for reading SDMX data and metadata in R