fairds

FAIR Dataspaces is a BMBF-funded project that aims to bridge the gap between research and industry-related data infrastructures. This is primarily done by combining data infrastructures from the German research data infrastructure (NFDI) with industry driven standards established by Gaia-X.

FAIR Dataspaces is made up of legal and technical experts from academia and industry, and aims to identify ways in which existing data infrastructures can be linked to create added value for the scientific and industrial communities.

Getting started

FAIR-DS uses a few core building blocks to enable data exchange.

  1. Federated catalog to register all available connectors and assets this is done via the W3C recommended DCAT Catalog vocabulary.
  2. Negotiation of data exchange is done via the Dataspace Protocol that allows for souvereign data exchange between independent dataspace participants.

Goals of the project

  • Connecting communities from NFDI, EOSC and Gaia-X.
  • Explore the ethical and legal frameworks needed for data exchange between research and industry
  • Demonstrate the use and benefits of Gaia-X technologies across different sectors of academia and industry.

Demonstrators

Demonstrators have been developed to show the possible interactions between different data infrastructures. These include:

FAQ

What is a dataspace ?

A data space is a distributed data system based on common guidelines and rules. The users of such a data space can access the data in a secure, transparent, trustworthy, simple and standardized manner. Data owners have control over who has access to their data and for what purpose and under what conditions it can be used. These data spaces enable digital sovereignty, as the data remains in the possession of the company or research institute until it is retrieved.

What technologies are used ?

FAIR Dataspaces relies on many primarily cloud focused technologies for more information see the technologies section of this book.

License

This work is licensed under a Creative Commons Attribution-NoDerivs 4.0 International License.

image

Technologies in FAIR Dataspaces

FAIR DS relies heavily on modern cloud-based data infrastructures. Many deployments are powered by the de.NBI cloud, one of Germany's largest community clouds.

Most deployments are done utilizing Containers in a Kubernetes environment.

Example messages for the EDC Connector

Example messages for the EDC connector

Data exchange with the EDC connector follows the specification of the International Dataspace Protocol.

Reading data from a data source is a stepwise process that initially requires the provider of the data to make it discoverable to consumers of the data space.

Once consumers have discovered the published asset, they can negotiate to fetch the data.

  1. Create asset [PROVIDER]
  2. Create policy [PROVIDER]
  3. Create contract [PROVIDER]
  4. Fetch catalog [CONSUMER]
  5. Negotiate contract [CONSUMER]
  6. Get contract agreement ID [CONSUMER]
  7. Start the data transfer [CONSUMER]
  8. Get the data [CONSUMER]
    1. Get the data address [CONSUMER]
    2. Get the actual data content [CONSUMER]

The workflow above is called Consumer-PULL in the language of the Eclipse Data Components. It is the consumer who initializes the data transfer, and finally fetches (pulls) the data content as the last step.

The examples below were extracted from the documentation about the EDC samples applications.

The prerequisites documentation describes how to build and run the EDC connector as consumer or provider. Our startup configuration determines if it is consumer or provider. Both share the same source code.

Once a consumer and a provider are running, you can establish a data exchange via HTTP with a REST endpoint as data source using the following steps.

In the examples below, consumer and provider are running on the same machine. The consumer listens to ports 2919x and the provider listens to ports 1919x.

1. Create asset [PROVIDER]

curl -d @path/to/create-asset.json -H 'content-type: application/json ' http://localhost:19193/management/v3/assets -s | jq

create-asset.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/"
  },
  "@id": "myAsset",
  "properties": {
    "name": "REST endpoint",
    "contenttype": "application/json"
  },
  "dataAddress": {
    "type": "HttpData",
    "name": "myAssetAddress",
    "baseUrl": "https://path.of.rest/endpoint",
    "proxyPath": "true",
    "proxyMethod": "true",
    "proxyBody": "true",
    "proxyQueryParams": "true",
    "authKey": "ApiKey",
    "authCode": "{{APIKEY}}",
    "method": "POST"
  }
}

baseUrl is the URL of the REST endpoint, It is the data source from where the data will be finally fetched.

proxyPath, proxyMethod, proxyBody, proxyQueryParams are flags that allow a provider to behave as a proxy that forwards requests to the actual data source.

In the example above, the provider will be a proxy for the consumer. The provider will fetch the data from the data source. The REST call from the provider to the data source will use the value of authKey as part of its header, toghether with the value of authCode as the token.

The secret token is known to the provider but not to the consumer.

Example Response

{
  "@type": "IdResponse",
  "@id": "myAsset",
  "createdAt": 1727429239982,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

2. Create policy [PROVIDER]

curl -d @path/to/create-policy.json -H 'content-type: application/json' http://localhost:19193/management/v3/policydefinitions -s | jq

create-policy.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  },
  "@id": "myPolicy",
  "policy": {
    "@context": "http://www.w3.org/ns/odrl.jsonld",
    "@type": "Set",
    "permission": [],
    "prohibition": [],
    "obligation": []
  }
}

Thre policy above does not specify special permissions, prohibitions, or oblogations. By default, all access is allowed.

Example response

{
  "@type": "IdResponse",
  "@id": "myPolicy",
  "createdAt": 1727429399794,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

3. Create contract [PROVIDER]

curl -d @path/to/create-contract-definition.json -H 'content-type: application/json' http://localhost:19193/management/v3/contractdefinitions -s | jq

create-contract-definition.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/"
  },
  "@id": "1",
  "accessPolicyId": "myPolicy",
  "contractPolicyId": "myPolicy",
  "assetsSelector": []
}

The contract above does not specify any assets to which the policy is connected. By default, it is all assets.

Example response

{
  "@type": "IdResponse",
  "@id": "myPolicy",
  "createdAt": 1727429399794,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

4. Fetch catalog [CONSUMER]

curl -X POST "http://localhost:29193/management/v3/catalog/request" -H 'Content-Type: application/json' -d @path/to/fetch-catalog.json -s | jq

Example response

{
  "@id": "773be931-40ff-410c-b6cb-346e3f6eaa6f",
  "@type": "dcat:Catalog",
  "dcat:dataset": {
    "@id": "myAsset",
    "@type": "dcat:Dataset",
    "odrl:hasPolicy": {
      "@id": "MQ==:YXNzZXRJZA==:ODJjMTc3NWItYmQ4YS00M2UxLThkNzItZmFmNGUyMTA0NGUw",
      "@type": "odrl:Offer",
      "odrl:permission": [],
      "odrl:prohibition": [],
      "odrl:obligation": []
    },
    "dcat:distribution": [
      {
        "@type": "dcat:Distribution",
        "dct:format": {
          "@id": "HttpData-PULL"
        },
        "dcat:accessService": {
          "@id": "17b36140-2446-4c23-a1a6-0d22f49f3042",
          "@type": "dcat:DataService",
          "dcat:endpointDescription": "dspace:connector",
          "dcat:endpointUrl": "http://localhost:19194/protocol"
        }
      },
      {
        "@type": "dcat:Distribution",
        "dct:format": {
          "@id": "HttpData-PUSH"
        },
        "dcat:accessService": {
          "@id": "17b36140-2446-4c23-a1a6-0d22f49f3042",
          "@type": "dcat:DataService",
          "dcat:endpointDescription": "dspace:connector",
          "dcat:endpointUrl": "http://localhost:19194/protocol"
        }
      }
    ],
    "name": "product description",
    "id": "myAsset",
    "contenttype": "application/json"
  },
  "dcat:distribution": [],
  "dcat:service": {
    "@id": "17b36140-2446-4c23-a1a6-0d22f49f3042",
    "@type": "dcat:DataService",
    "dcat:endpointDescription": "dspace:connector",
    "dcat:endpointUrl": "http://localhost:19194/protocol"
  },
  "dspace:participantId": "provider",
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "odrl": "http://www.w3.org/ns/odrl/2/",
    "dspace": "https://w3id.org/dspace/v0.8/"
  }
}

The consumer finds available assets in the catalog response. In the example above, the relevant identifier for the next steps of the data exchange is the offer id at the JSON path: ./"dcat:dataset"."odrl:hasPolicy"."@id" (here: MQ==:YXNz...NGUw).

5. Negotiate contract [CONSUMER]

curl -d @path/to/negotiate-contract.json -X POST -H 'content-type: application/json' http://localhost:29193/management/v3/contractnegotiations -s | jq

negotiate-contract.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/"
  },
  "@type": "ContractRequest",
  "counterPartyAddress": "http://localhost:19194/protocol",
  "protocol": "dataspace-protocol-http",
  "policy": {
    "@context": "http://www.w3.org/ns/odrl.jsonld",
    "@id": "MQ==:YXNzZXRJZA==:ODJjMTc3NWItYmQ4YS00M2UxLThkNzItZmFmNGUyMTA0NGUw",
    "@type": "Offer",
    "assigner": "provider",
    "target": "myAsset"
  }
}

The request for negotiating a contract about the upcoming data exchange contains the offer id from the previous catalog response.

Example response

{
  "@type": "IdResponse",
  "@id": "96e58145-b39f-422d-b3f4-e9581c841678",
  "createdAt": 1727429786939,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

6. Get contract agreement ID [CONSUMER]

curl -X GET "http://localhost:29193/management/v3/contractnegotiations/96e58145-b39f-422d-b3f4-e9581c841678" --header 'Content-Type: application/json' -s | jq

The negotiation of a contract happens asynchronously. At the end of the negotiation process, the consumer can get the agreement ID by calling the contractnegotiations API with the negotiation ID from the previous response (here: 96e58145-b39f-422d-b3f4-e9581c841678).

Example response

{
  "@type": "ContractNegotiation",
  "@id": "96e58145-b39f-422d-b3f4-e9581c841678",
  "type": "CONSUMER",
  "protocol": "dataspace-protocol-http",
  "state": "FINALIZED",
  "counterPartyId": "provider",
  "counterPartyAddress": "http://localhost:19194/protocol",
  "callbackAddresses": [],
  "createdAt": 1727429786939,
  "contractAgreementId": "441353d9-9e86-4011-9828-63b2552d1bdd",
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

7. Start data transfer [CONSUMER]

curl -X POST "http://localhost:29193/management/v3/transferprocesses" -H "Content-Type: application/json" -d @path/to/start-transfer.json -s | jq

start-transfer.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/"
  },
  "@type": "TransferRequestDto",
  "connectorId": "provider",
  "counterPartyAddress": "http://localhost:19194/protocol",
  "contractId": "441353d9-9e86-4011-9828-63b2552d1bdd",
  "assetId": "myAsset",
  "protocol": "dataspace-protocol-http",
  "transferType": "HttpData-PULL"
}

The consumer initiates the data transfer process with a POST message containing the agreement ID from the previous response (here: 441353d9-9e86-4011-9828-63b2552d1bdd), the asset identifier of the wanted data asset, and the transfer type.

Example response

{
  "@type": "IdResponse",
  "@id": "84eb10cc-1192-408c-a378-9ef407c156a0",
  "createdAt": 1727430075515,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

8. Get the data [CONSUMER]

8.1. Get the data address [CONSUMER]

curl http://localhost:29193/management/v3/edrs/84eb10cc-1192-408c-a378-9ef407c156a0/dataaddress | jq

The result of starting a transfer process is a DataAddress that the consumer can get by calling the endpoint data reference API (edrs) with the identifier of the started transfer process (here: 84eb10cc-1192-408c-a378-9ef407c156a0).

Example response

{
  "@type": "DataAddress",
  "type": "https://w3id.org/idsa/v4.1/HTTP",
  "endpoint": "http://localhost:19291/public",
  "authType": "bearer",
  "endpointType": "https://w3id.org/idsa/v4.1/HTTP",
  "authorization": "eyJraWQiOiJwdW...7Slukn1OMJw",
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

8.2. Get the data content [CONSUMER]

curl --location --request GET 'http://localhost:19291/public/' --header 'Authorization: eyJraWQiOiJwdWJsaW...7Slukn1OMJw'

The consumer can fetch the data about the wanted data asset by calling the endpoint that is included in the previous data address response (here: http://localhost:19291/public). The previous response also includes the required access token (here: eyJraWQiOiJwdWJsaW...7Slukn1OMJw shortened).

The response of this request is the actual data content from the data source.

The consumer calls the public API endpoint that was shared by the provider. The request is authorized with the token that was shared as well. The provider now fetches the data from the data source based on the configuration of the data asset from the first step, and forwards the response to the consumer.

Example script

The Python script connector.py executes the data exchange process as described above step by step.

It makes use of the JSON files in the folder ./assets/connector/.

ACCURIDS Demonstrator

The demonstrator makes use of the ACCURIDS FAIR data registry.

We demonstrate data exchange between industry and an organization.

For example, exchange of information related to clinical studies between a hospital and a pharmaceutical company.

The diagram below represents the use case of the demonstrator with two ACCURIDS instances, both acting as consumers and providers of information.

ACCURIDS Demonstrator

The ACCURIDS platform

ACCURIDS logo
www.accurids.com

ACCURIDS is a platform for centralized data governance with a FAIR data registry.

It enables users to uniquely identify their data objects across systems and processes at enterprise and industry scale.

ACCURIDS offers six core features:

Ontology

  • Reusing standard industry ontologies through ACCURIDS Hub
  • Creation of specific ontologies for business needs

Reference and Master Data

  • Managing reference and master data aligned to ontology definitions

Data Registry

  • Governance of data objects managed in many different systems through a centralized registration process
  • Management of data object identities

Graph

  • Connect data from different sources in knowledge graphs to enable question answering and analytics

Quality

  • Define and monitor data quality rules for different types of data objects

AI

  • Ask business questions in natural language and get answers based on facts from the knowledge graph

ExpandAI demonstrator

exapnad logo
www.expandai.de

Demonstrator overview

The goal of this demonstrator is a secure and legally compliant method to collect and manage data from various wearable sensor sources (e.g. smartphone and smartwatch), which can then be used to train AI algorithms. The prototype focuses on integrating AI-based DiGAs into the healthcare ecosystem by adhering to the FAIR (Findability, Accessibility, Interoperability, and Reusability) and Gaia-X principles. These guidelines ensure that data management is transparent, accessible, and fair. The ability to gather and analyze data about health conditions and influencing factors is crucial for developing new therapies and holistic care strategies.

ExpandAI Demonstrator

At the core of this project is a data integration center, which gathers and standardizes Parkinson patient data from a smartphone (iPhone 12 Pro Max) and smartwatch (Apple Watch Series 8). AI analytics within the center enable near real-time data insights, supporting Parkinson healthcare professionals and researchers in decision-making and advancing medical research. Our prototype proposes a framework for secure data exchange between scientific institutions and industry partners. This collaboration can empower healthcare providers to monitor Parkinson patient health using a simple traffic light system to indicate their condition: green for well, yellow for unwell but not critical, and red for emergencies. This system will be accessible via a web platform for doctors, providing valuable insights while minimizing data sharing. Moreover, limited access to these databases for companies developing DiGAs could enhance AI model training and product development, ultimately advancing personalized healthcare and improving patient outcomes.

Dashboard

The dashboard provides a clear, organized interface for displaying both results and finger-tap data, offering insights into patient motor performance. The use of visual elements like charts and traffic lights effectively communicates the data trends and potential areas of concern. The chart highlights key metrics, thumb and digit movements over time, allowing users to track progress or deterioration. The traffic light system adds a simple yet powerful tool for quick assessments, signaling performance status with an intuitive red-yellow-green scheme.

ExpandAI Demonstrator Dashboard

ExpandAI Demonstrator Dashboard

Technologies used

The FAIR data space enables information exchange between research and industry organizations.

  • Java
  • Spring Boot
  • React
  • Gaia-x standards
  • Docker Containers
  • Github
  • REST and OpenAPI

Geo Engine Demonstrator

The Geo Engine FAIR-DS demonstrator realizes multiple use cases that show the capabilities of the NFDI data spaces and address important societal issues like the loss of biodiversity and climate change. Within the FAIR-DS project, the Geo Engine software is extended with FAIR data services and capabilities for data spaces. New dashboards and visualizations are developed to demonstrate added value from the FAIR-DS integration.

Overview of the Geo Engine Platform

Geo Engine Architecture

Geo Engine is a cloud-ready geo-spatial data processing platform. It consists of the backend and several frontends. The backend is subdivided into three subcomponents: services, operators, and data types. Data types specify primitives like feature collections for vector or gridded raster data. The Operators block contains the processing engine and operators. The Services block contains data management and APIs. It also contains external data connectors, that allow accessing data from external sources. Here, the connection to the data spaces is established.

Frontends for the Geo Engine are geoengine-ui for building web applications on top of Geo Engine. geoengine-python offers a Python library that can be used in Jupyter Notebooks. 3rd party applications like QGIS can access Geo Engine via its OGC interfaces.

All components of Geo Engine are fully containerized and Docker-ready. A GAIA-X-compatible self-description of the service is available. Geo Engine builds upon several technologies, including GDAL, arrow, Angular, and OpenLayers.

Aruna Data Connector

The ArunaDataProvider implements the Geo Engine data provider interface. It uses the Aruna RPC API to browse, find, and access data from the NFDI4Biodiversity data space using the Aruna data storage system. By performing the translation between the Aruna API and the Geo Engine API, the connector allows to use data from the NFDI4Biodiversity data space in the Geo Engine platform. This makes it possible to integrate the data easily into new analyses and dashboards.

Copernicus Data Space Ecosystem

To show cross data space capabilities and to generate insightful results, the Geo Engine demonstrator is connected to the Copernicus Data Space. It uses the SpatioTemporal Asset Catalog (STAC) API and an S3 endpoint to generate the necessary meta data to access Sentinel data. Users can then access the Sentinel data as proper data sets and time series, rather than just as individual satellite images. This enables temporal analysis and the generation of time series data products.

Machine Learning Pipeline

The Demonstrator leverages Geo Engine's machine learning pipeline. It allows streaming data from the Geo Engine to Python and train a model. The model can then be uploaded and registered in the backend and then used as an operator to predict new and unseen data.

ECOMETRICS Dashboard

The ECOMETRICS dashboard is a custom app developed for FAIR-DS. It allows the user to visualize and analyze an ecological indicator, like vegetation or land use. The demonstrator consists of two parts. The UI is the user-facing part of the application while the indicator generation happens on the backend.

UI (Dashboard)

Screenshot of the ECOMETRICS app

As visible in the screenshot, the ECOMETRICS app consists of four components. The top left explains how the app works and lets the user select the ecological indicator. Upon selection, the indicator will be visualized on the map on the right. The user can then select a region of interest by drawing on the map and selecting a point in time on a slider. The region will be intersected with the indicator. Finally, the user can review the results in the bottom-most section of the app on a plot. In case of a continuous indicator like vegetation, the plot will show a histogram. In case of a classification indicator like land use, the plot will show a class histogram.

Indicator generation

An indicator for the ECOMETRICS app is a raster time series that is either continuous or a classification. Currently, there are two indicators available: NDVI and land use. Both indicators are built from Sentinel data. The NDVI, short for Normalized Difference Vegetation Index, is computed by aggregating all Sentinel tiles for a given year and by computing (A-B)/(A+B) where A is the near infrared channel and B is the red channel. The necessary data is provided by Geo Engine's Copernicus Data Space Ecosystem Connector.

Machine Learning

The land use indicator is created by training a machine learning model. The training process incorporates the Sentinel data and uses the land use and land cover survey (LUCAS) as ground truth. While the Sentinel data are again retrieved from the Copernicus Data Space, the LUCAS data are used via the Aruna data connector. In principle, any data accessible using the Aruna data connector can be used for machine learning with Geo Engine.

The model is trained in a Jupyter notebook that defines a Geo Engine workflow, feeds the resulting data into a machine learning framework, trains the model, and registers it with the Geo Engine. The workflow applies feature engineering on the data and re-arranges it via timeshift operations This gives the classifier the temporal development for each training pixel at once and allows learning the different land usage types.

Value Proposition

The demonstrator shows the benefits of FAIR Data Spaces for science and industry. By linking data from different sources and applying ML methods, new insights can be gained and innovative applications developed. It showcases a system that connects data from science and industry across data spaces, applying it through machine learning while adhering to FAIR principles.

Benefits:

  • For industry: The system enriches existing datasets with publicly available data and provides direct access to modern analysis frameworks using machine learning. This allows for direct data use, easy reuse of data and models, and avoids redundant calculations. A concrete example is combining observational and climate data with machine learning for biodiversity reporting, such as property classification.
  • For research: The system showcases the relevance of research data through transparent provenance and workflows. This facilitates funding applications and fosters collaboration between research institutions and companies, which is increasingly required by funding agencies.

Contribution to FAIR Data Spaces

The demonstrator contributes to the advancement of FAIR Data Spaces by scaling FAIR principles to broader data ecosystems and diverse sectors. It aims to create sustainable and efficient data infrastructures that support cross-sectoral data exchange and promote interoperability and the use of data standards through use cases that provide real benefits for science and industry. yy

ESG Indicator and Virtual Data Trustees

The ESG Indicator demonstrator provides companies with concrete data to assess their environmental impact, streamlining one step of the process of creating quantitative figures in ESG reports. By leveraging research data to develop practical models, it fosters collaboration between academia and industry. The demonstrator outlines incentives for the development of new applications in the fields of artificial intelligence and geospatial data, contributing to technological advancements and sustainable solutions.

Data Trustee

Data Trustee

The diagram illustrates a virtual data trustee model, where the trustee acts as an intermediary between data suppliers and data consumers, in this case research institutions and industry companies. The data supplier provides models and data, while the data consumer accesses and utilizes these resources. The trustee ensures the usability of the data and models for the consumer, logs usage for billing and reproducibility, and facilitates the exchange of data and models between the three data spaces. This model promotes efficient and secure data sharing and utilization.

UI (Dashboard)

Dashboard

The UI provides a user-friendly interface for calculating an ESG (Environmental, Social, and Governance) indicator for specific properties. Users can select a property on a map, specify a date range using a time slider, and initiate the analysis process. The calculated ESG score is displayed visually, and detailed usage logs are available for transparency. This tool empowers users to assess the sustainability performance of properties over time.

ML-based Indicator

ESG Indicator

The biodiversity indicator shown in the image assesses the vegetation within a given area over time. It utilizes a vegetation index (NDVI) to track changes in vegetation cover, with different colors representing various land cover types. For example, built-up areas and streets, that support no vegetation, are depicted in white in the vegetation index and red in the ESG indicator. Grasslands, which exhibit seasonal changes, appear light green in the vegetation index and light blue in the ESG indicator. Forests, with stable vegetation cover, are represented by green in the vegetation index and purple in the ESG indicator. This indicator provides valuable insights into the biodiversity and ecological health of a region, contributing to quantitative ESG reporting.

Value Proposition

This demonstrator establishes a virtual data trustee that connects data from industry (with DB as a pilot user), the space sector (Copernicus Data Space Ecosystem), and science (NFDI4Biodiversity-RDC). This data trustee acts as a secure intermediary, facilitating access to data from these different domains while upholding FAIR-DS principles.

Benefits

  • Enhanced Collaboration: The demonstrator has the potential to foster collaboration between industry and research in the field of biodiversity analysis. By connecting industrial data spaces to research data infrastructure, companies like Deutsche Bahn can benefit from the data and analytical methods available, while research institutions can improve their models and gain new insights through access to industrial data.
  • Mutual Gains for Research and Industry: Access to a broader range of data leads to improved analyses and models. Both sides can define data access and control the use of their own data. Practical Applications: The data trustee enables the creation of operational services, such as ESG reporting metrics, which can be used by industry. For research, it increases visibility and funding opportunities.

Addressing data confidentiality, particularly for sensitive information like location data, is a key concern. The data trustee allows derived data and products to be shared without revealing the original data. Provenance tracking ensures transparency and proper attribution of data sources.

By using a virtual data trustee, the demonstrator aims to enhance FAIR-DS principles, incentivize the use of data spaces, and address the needs of various stakeholders. It anticipates that the demonstrator can serve as a foundation for developing further applications in ESG reporting and biodiversity analysis, with the data trustee concept enabling the secure and trustworthy processing of sensitive data.

RuQaD

RuQaD Architecture

This document describes the architecture of the RuQaD Demonstrator

The RuQaD Demonstrator is the main product of the project "RuQaD Batteries - Reuse Quality-assured Data for Batteries". The project is a sub-project of the FAIR Data Spaces project, supported by the German Federal Ministry of Education and Research under the Förderkennzeichen FAIRDS09

Introduction and Goals

Requirements Overview

The aim of the project is to build a demonstrator that connects a research data infrastructure (PoLis) with an industrial data space (BatCAT) to reuse datasets for new applications.

Technically, this is to be realized as a tool (RuQaD) that exports data from the source system, the research data management system Kadi4Mat. RuQaD carries out quality and meta data checks (involving an existing pipeline) and pushes the data into the target system, the research data management system LinkAhead, that is the basis of the BatCAT data space.

The motivation is to allow a seamless exchange of data between research and industry. It is important to ensure that the FAIR criteria are met, therefore a dedicated check for meta data and the FAIR criteria is implemented. The FAIR criteria can be summarized to the following practical checks:

  • Is a PID present?
  • Is the domain-specific meta data complete?
  • Is there provenance information?
  • Does the data include license information?

The normative description of all requirements is contained in the non-public project proposal.

Quality Goals

RuQaD's main quality goals are, using the terms from ISO 25010 (see glossary for definitions):

  • Operability: As a demonstrator, the main quality goal of the software is to be understood and learned. This facilitates building prototypes or production software based on the functionality of the demonstrator.
  • Compatibility: The system connects different software systems and therefore acts as a compatibility component.
  • Transferability: In order to serve as a tool for demonstration in different environments, the system needs to be built at a high level of transferability.
  • Maintainability: The project's limited time frame also scopes the maintainability goal: As a demonstrator, some parts of the system can be considered work-in-progress and need to be modified, corrected or adapted over time.Whenever code is being contributed to components which have already a forseeable future in other contexts the maintainability must be taken in consideration. On the other hands, code which mainly serves the demonstration of the particular scenario (PoLiS/BatCAT) it is acceptable to provide only PoC implementations which need adjustments for a long-term maintenance.

Stakeholders

Role/Name Expectations
IndiScale Quality and FAIRness of entities are successfully annotated.
PoLis / Kadi4Mat Entities can be exported to other dataspaces.
BatCAT Dataspace Receive valuable dataset offers.
FAIR DS Project Showcase the concept of FAIR Data Spaces with a relevant, plausible and convincing use case.

System Landscape

System Landscape System Landscape BatCAT Data Space RuQaD Service PoLiS Research Data Infrastructure BatCAT   Data   Space Node   LinkAhead   and   EDC-based components   of   the   BatCAT Data   Space. Data   Consumer   R&D   departments   from   the consortial   partners   of   the BatCAT   Data   Space. Quality   assurance   4.2   Gitlab   pipeline   for   quality assurance   based   on   the demonstrator   4.2. Data   Curator   IndiScale   curates   data   from PoLis   and   offers   them   for reuse   in   the   BatCAT   Data Space. LinkAhead   Crawler   Framework   for   file   scanning, LinkAhead   entity   building and   synchronization. RuQaD   Demonstrator   A   purely   functional component   for   checking FAIRness,   invoking   the   QA pipeline   and   ingesting   data to   LinkAhead. Polis   Kadi4Mat   The   Kadi4Mat   electronic   lab notebook   instance   of   the PoLiS   Cluster   of   Excellence. Pull   battery-related datasets. Monitored   by Invoke   quality assurace   pipeline   on raw   data   from   PoLiS. Ingest   data   to LinkAhead. Ingest   data   to LinkAhead   with   the LinkAhead   crawler. Browse   datasets   and request   access   for reuse. Review   and   control the   offering   of datasets   to   the BatCAT   Data   Space. Legend     person     system     boundary (dashed)     BatCAT Data Space boundary (last back color, dashed)     BatCAT Data Space:RuQaD Service boundary (last back color, dashed)     PoLiS Research Data Infrastructure boundary (last back color, dashed)  
System Landscape Diagram

Building Blocks

Rationale

The RuQaD demonstrator uses service integration to achieve the goal of connecting dataspaces in a FAIR manner. It configures and combines existing services to multiple stages of FAIRness evalution and data integration.

Contained Building Blocks
  • Monitor: Checks for new data in a Kadi4Mat instance.
  • Quality checker: Passes new data to the quality checker which was developed in WP 4.2 of the previous Fair DS project.
  • RuQaD crawler: Calls the LinkAhead crawler for metadata checking and for insertion into the BatCAT data space node.

Monitor

Purpose / Responsibility

The monitor continuously polls a Kadi4Mat instance (representing the source dataspace) for new data items.

Interface(s)

Each new data item is passed on to the quality checker for evaluation of the data quality. Afterwards the monitor passes the quality check report and the original data to the crawler, which eventually leads to insertion in the BatCAT data space where the items can be checked by data curators and retrieved by data consumers.

Directory/File Location

Source code: src/ruqad/monitor.py

Quality Checker

Purpose / Responsibility

The quality checker executes data quality checks on the data which was retrieved from the input dataspace (a Kadi4Mat instance in this case). It provides a structured summary for other components and also a detailed report for human consumption.

Interface(s)

The quality checker is implemented as a Python class QualityChecker which provides mainly a check(filename, target_dir) method to check individual files. This class is available in the module ruqad.qualitycheck.

Quality / Performance Characteristics

The quality checker relies on the demonstrator 4.2 to perform the checks. Thus, RuQaD relies on further maintenance by the demonstrator's development team.

Directory/File Location

Source code: src/ruqad/qualitycheck.py

Open issues
  • The demonstrator 4.2 service currently relies on running as Gitlab pipeline jobs, which introduces a certain administrative overhead for production deployment.
  • It is possible and may be desirable to parallelize the quality check for multiple files by distributing the load on a number of service workers, instead of checking files sequentially.

RuQaD Crawler

Purpose/Responsibility

The RuQaD Crawler executes metadata checks and bundles data and metadata for insertion into the BatCAT dataspace.

Interface(s)

The crawler is implemented as a Python module with a function trigger_crawler(...) which looks for data and quality check files to evaluate and insert into the BatCAT dataspace. It uses LinkAhead's crawler framework for metadata checks, object creation and interaction with the target dataspace.

Directory / File Location

Source code: src/ruqad/crawler.py

Level 2

White Box RuQaD Crawler

RuQaD Demonstrator - RuQaD Crawler - Components RuQaD Demonstrator - RuQaD Crawler - Components RuQaD Crawler [Container] Transformer CFood   declaration Identifiables declaration Crawler   wrapper Converter RuQaD   monitor LinkAhead   Crawler   Framework   for   file   scanning, LinkAhead   entity   building and   synchronization. . . Uses Uses . . Legend     system     container     component     container boundary (dashed)  
Component Diagram

Motivation

The Crawler reuses functionality of the LinkAhead crawler:

  • Declarative creation of data objects from structured input data.
  • Idempotent and context sensitive scan-create-and-insert-or-update procedures.

This functionality is extended by a custom converters and data transformers.

Contained Building Blocks
  • Crawler wrapper: Calls the LinkAhead crawler on the files given by the RuQaD monitor, with the correct settings.
  • CFood declaration: Specification of how entities in BatCAT shall be constructed from input data.
  • Identifiables declaration: Specification of identifiying properties of entities in BatCAT.
  • Converter: Custom conversion plugin to create resulting (sub) entities.
  • Transformer: Custom conversion plugin to transform input data into properties.
Crawler wrapper
Purpose / Responsibility

The crawler wrapper scans files in specific directories of the file system and synchronizes them with the LinkAhead instance. Before insertion and updates of Records in LinkAhead, a meta data check is carried out to verify whether the meta data that was exported from kadi4mat is compatible with the target data model (in LinkAhead and the EDC). Validation failure leads to specific validation error messages and prevents insertions or updates of the scan result. The software component also carries out a check of data FAIRness of the data exported from kadi4mat (in ELN format).

Interface(s)

The crawler uses:

  • A cfood (file in YAML format) which specifies the mapping from data found on the file system to Records in LinkAhead.
  • A definition of the identifiables (file in YAML format) which defines the properties that are needed to uniquely identify Records of the data model in LinkAhead.
  • The data model definition (file in YAML format). This is needed by the crawler to do the meta data check.
  • Crawler extensions specific to the project (cusotm converters and custom transformers). These are python modules containing functions and classes that can be referenced within the cfood.

The interface is a Python-function that is implemented as a module into the RuQaD demonstrator. The function calls the scanner and main crawler functions of the LinkAhead crawler software.

Directory / File Location

Source code:

  • Main interface: ruqad/src/ruqad/crawler.py
  • Crawler extensions:
    • Custom converters (currently not used): ruqad/src/ruqad/crawler-extensions/converters.py
    • Custom transformers: ruqad/src/ruqad/crawler-extensions/transformers.py
Fulfilled Requirements
  • Data ingest from exported ELN file into LinkAhead.
  • Data ingest from quality check into LinkAhead.
  • Check of FAIRness of data from ELN file.
  • Meta data check

Runtime View

RuQaD Demonstrator [System] RuQaD   monitor Quality   checker RuQaD   Crawler Polis   Kadi4Mat   The   Kadi4Mat   electronic   lab notebook   instance   of   the PoLiS   Cluster   of   Excellence. BatCAT   Data   Space Node   LinkAhead   and   EDC-based components   of   the   BatCAT Data   Space. Quality   assurance   4.2   Gitlab   pipeline   for   quality assurance   based   on   the demonstrator   4.2. LinkAhead   Crawler   Framework   for   file   scanning, LinkAhead   entity   building and   synchronization. 1.   Poll   new   items 2.   Returns   new   item 3.   Check   quality   of data   item 6.   Return   quality report 4.   Trigger   quality check   pipeline 5.   Return   quality report 7.   Trigger   crawler 8.   Invoke   with   data item   and   quality report 9.   Return   generated items   with   FAIRness level 10.   Insert   and/or update   generated entities Legend     system     container     system boundary (dashed)  
Data item ingestion

  • The monitor continually polls the Kadi4Mat server for new items.
  • Each new item is sequentially passed to the Demonstrator 4.2 quality checker for data quality checks and the crawler component for metadata FAIRness evaluation before being inserted into the target dataspace at BatCAT.

Deployment View

Kubernetes Cluster Several Pods   The BatCAT Data Space Node and all other software systems are being grouped together in this diagram for simplicity. The overall number of pods in the testbed which are only concerned with the data space components is > 10. Kadi Pod Ruqad Pod Gitlab   The gitlab instance can not be integrated into the testbed. BatCAT   Data   Space Node   LinkAhead   and   EDC-based components   of   the   BatCAT Data   Space. Polis   Kadi4Mat   The   Kadi4Mat   electronic   lab notebook   instance   of   the PoLiS   Cluster   of   Excellence. RuQaD   Demonstrator   A   purely   functional component   for   checking FAIRness,   invoking   the   QA pipeline   and   ingesting   data to   LinkAhead. Quality   assurance   4.2   Gitlab   pipeline   for   quality assurance   based   on   the demonstrator   4.2. Ingest   data   to LinkAhead. Pull   battery-related datasets. Monitored   by Invoke   quality assurace   pipeline   on raw   data   from   PoLiS. Legend     system     node  
Deployment View