fairds

FAIR Dataspaces is a BMBF-funded project that aims to bridge the gap between research and industry-related data infrastructures. This is primarily done by combining data infrastructures from the German research data infrastructure (NFDI) with industry driven standards established by Gaia-X.

FAIR Dataspaces is made up of legal and technical experts from academia and industry, and aims to identify ways in which existing data infrastructures can be linked to create added value for the scientific and industrial communities.

Getting started

FAIR-DS uses a few core building blocks to enable data exchange.

  1. Federated catalog to register all available connectors and assets this is done via the W3C recommended DCAT Catalog vocabulary.
  2. Negotiation of data exchange is done via the Dataspace Protocol that allows for souvereign data exchange between independent dataspace participants.

Goals of the project

  • Connecting communities from NFDI, EOSC and Gaia-X.
  • Explore the ethical and legal frameworks needed for data exchange between research and industry
  • Demonstrate the use and benefits of Gaia-X technologies across different sectors of academia and industry.

Demonstrators

Demonstrators have been developed to show the possible interactions between different data infrastructures. These include:

FAQ

What is a dataspace ?

A data space is a distributed data system based on common guidelines and rules. The users of such a data space can access the data in a secure, transparent, trustworthy, simple and standardized manner. Data owners have control over who has access to their data and for what purpose and under what conditions it can be used. These data spaces enable digital sovereignty, as the data remains in the possession of the company or research institute until it is retrieved.

What technologies are used ?

FAIR Dataspaces relies on many primarily cloud focused technologies for more information see the technologies section of this book.

License

This work is licensed under a Creative Commons Attribution-NoDerivs 4.0 International License.

image

Technologies in FAIR Dataspaces

FAIR DS relies heavily on modern cloud-based data infrastructures. Many deployments are powered by the de.NBI cloud, one of Germany's largest community clouds.

Most deployments are done utilizing Containers in a Kubernetes environment.

Example messages for the EDC Connector

Example messages for the EDC connector

Data exchange with the EDC connector follows the specification of the International Dataspace Protocol.

Reading data from a data source is a stepwise process that initially requires the provider of the data to make it discoverable to consumers of the data space.

Once consumers have discovered the published asset, they can negotiate to fetch the data.

  1. Create asset [PROVIDER]
  2. Create policy [PROVIDER]
  3. Create contract [PROVIDER]
  4. Fetch catalog [CONSUMER]
  5. Negotiate contract [CONSUMER]
  6. Get contract agreement ID [CONSUMER]
  7. Start the data transfer [CONSUMER]
  8. Get the data [CONSUMER]
    1. Get the data address [CONSUMER]
    2. Get the actual data content [CONSUMER]

The workflow above is called Consumer-PULL in the language of the Eclipse Data Components. It is the consumer who initializes the data transfer, and finally fetches (pulls) the data content as the last step.

The examples below were extracted from the documentation about the EDC samples applications.

The prerequisites documentation describes how to build and run the EDC connector as consumer or provider. Our startup configuration determines if it is consumer or provider. Both share the same source code.

Once a consumer and a provider are running, you can establish a data exchange via HTTP with a REST endpoint as data source using the following steps.

In the examples below, consumer and provider are running on the same machine. The consumer listens to ports 2919x and the provider listens to ports 1919x.

1. Create asset [PROVIDER]

curl -d @path/to/create-asset.json -H 'content-type: application/json ' http://localhost:19193/management/v3/assets -s | jq

create-asset.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/"
  },
  "@id": "myAsset",
  "properties": {
    "name": "REST endpoint",
    "contenttype": "application/json"
  },
  "dataAddress": {
    "type": "HttpData",
    "name": "myAssetAddress",
    "baseUrl": "https://path.of.rest/endpoint",
    "proxyPath": "true",
    "proxyMethod": "true",
    "proxyBody": "true",
    "proxyQueryParams": "true",
    "authKey": "ApiKey",
    "authCode": "{{APIKEY}}",
    "method": "POST"
  }
}

baseUrl is the URL of the REST endpoint, It is the data source from where the data will be finally fetched.

proxyPath, proxyMethod, proxyBody, proxyQueryParams are flags that allow a provider to behave as a proxy that forwards requests to the actual data source.

In the example above, the provider will be a proxy for the consumer. The provider will fetch the data from the data source. The REST call from the provider to the data source will use the value of authKey as part of its header, toghether with the value of authCode as the token.

The secret token is known to the provider but not to the consumer.

Example Response

{
  "@type": "IdResponse",
  "@id": "myAsset",
  "createdAt": 1727429239982,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

2. Create policy [PROVIDER]

curl -d @path/to/create-policy.json -H 'content-type: application/json' http://localhost:19193/management/v3/policydefinitions -s | jq

create-policy.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  },
  "@id": "myPolicy",
  "policy": {
    "@context": "http://www.w3.org/ns/odrl.jsonld",
    "@type": "Set",
    "permission": [],
    "prohibition": [],
    "obligation": []
  }
}

Thre policy above does not specify special permissions, prohibitions, or oblogations. By default, all access is allowed.

Example response

{
  "@type": "IdResponse",
  "@id": "myPolicy",
  "createdAt": 1727429399794,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

3. Create contract [PROVIDER]

curl -d @path/to/create-contract-definition.json -H 'content-type: application/json' http://localhost:19193/management/v3/contractdefinitions -s | jq

create-contract-definition.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/"
  },
  "@id": "1",
  "accessPolicyId": "myPolicy",
  "contractPolicyId": "myPolicy",
  "assetsSelector": []
}

The contract above does not specify any assets to which the policy is connected. By default, it is all assets.

Example response

{
  "@type": "IdResponse",
  "@id": "myPolicy",
  "createdAt": 1727429399794,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

4. Fetch catalog [CONSUMER]

curl -X POST "http://localhost:29193/management/v3/catalog/request" -H 'Content-Type: application/json' -d @path/to/fetch-catalog.json -s | jq

Example response

{
  "@id": "773be931-40ff-410c-b6cb-346e3f6eaa6f",
  "@type": "dcat:Catalog",
  "dcat:dataset": {
    "@id": "myAsset",
    "@type": "dcat:Dataset",
    "odrl:hasPolicy": {
      "@id": "MQ==:YXNzZXRJZA==:ODJjMTc3NWItYmQ4YS00M2UxLThkNzItZmFmNGUyMTA0NGUw",
      "@type": "odrl:Offer",
      "odrl:permission": [],
      "odrl:prohibition": [],
      "odrl:obligation": []
    },
    "dcat:distribution": [
      {
        "@type": "dcat:Distribution",
        "dct:format": {
          "@id": "HttpData-PULL"
        },
        "dcat:accessService": {
          "@id": "17b36140-2446-4c23-a1a6-0d22f49f3042",
          "@type": "dcat:DataService",
          "dcat:endpointDescription": "dspace:connector",
          "dcat:endpointUrl": "http://localhost:19194/protocol"
        }
      },
      {
        "@type": "dcat:Distribution",
        "dct:format": {
          "@id": "HttpData-PUSH"
        },
        "dcat:accessService": {
          "@id": "17b36140-2446-4c23-a1a6-0d22f49f3042",
          "@type": "dcat:DataService",
          "dcat:endpointDescription": "dspace:connector",
          "dcat:endpointUrl": "http://localhost:19194/protocol"
        }
      }
    ],
    "name": "product description",
    "id": "myAsset",
    "contenttype": "application/json"
  },
  "dcat:distribution": [],
  "dcat:service": {
    "@id": "17b36140-2446-4c23-a1a6-0d22f49f3042",
    "@type": "dcat:DataService",
    "dcat:endpointDescription": "dspace:connector",
    "dcat:endpointUrl": "http://localhost:19194/protocol"
  },
  "dspace:participantId": "provider",
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "odrl": "http://www.w3.org/ns/odrl/2/",
    "dspace": "https://w3id.org/dspace/v0.8/"
  }
}

The consumer finds available assets in the catalog response. In the example above, the relevant identifier for the next steps of the data exchange is the offer id at the JSON path: ./"dcat:dataset"."odrl:hasPolicy"."@id" (here: MQ==:YXNz...NGUw).

5. Negotiate contract [CONSUMER]

curl -d @path/to/negotiate-contract.json -X POST -H 'content-type: application/json' http://localhost:29193/management/v3/contractnegotiations -s | jq

negotiate-contract.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/"
  },
  "@type": "ContractRequest",
  "counterPartyAddress": "http://localhost:19194/protocol",
  "protocol": "dataspace-protocol-http",
  "policy": {
    "@context": "http://www.w3.org/ns/odrl.jsonld",
    "@id": "MQ==:YXNzZXRJZA==:ODJjMTc3NWItYmQ4YS00M2UxLThkNzItZmFmNGUyMTA0NGUw",
    "@type": "Offer",
    "assigner": "provider",
    "target": "myAsset"
  }
}

The request for negotiating a contract about the upcoming data exchange contains the offer id from the previous catalog response.

Example response

{
  "@type": "IdResponse",
  "@id": "96e58145-b39f-422d-b3f4-e9581c841678",
  "createdAt": 1727429786939,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

6. Get contract agreement ID [CONSUMER]

curl -X GET "http://localhost:29193/management/v3/contractnegotiations/96e58145-b39f-422d-b3f4-e9581c841678" --header 'Content-Type: application/json' -s | jq

The negotiation of a contract happens asynchronously. At the end of the negotiation process, the consumer can get the agreement ID by calling the contractnegotiations API with the negotiation ID from the previous response (here: 96e58145-b39f-422d-b3f4-e9581c841678).

Example response

{
  "@type": "ContractNegotiation",
  "@id": "96e58145-b39f-422d-b3f4-e9581c841678",
  "type": "CONSUMER",
  "protocol": "dataspace-protocol-http",
  "state": "FINALIZED",
  "counterPartyId": "provider",
  "counterPartyAddress": "http://localhost:19194/protocol",
  "callbackAddresses": [],
  "createdAt": 1727429786939,
  "contractAgreementId": "441353d9-9e86-4011-9828-63b2552d1bdd",
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

7. Start data transfer [CONSUMER]

curl -X POST "http://localhost:29193/management/v3/transferprocesses" -H "Content-Type: application/json" -d @path/to/start-transfer.json -s | jq

start-transfer.json

{
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/"
  },
  "@type": "TransferRequestDto",
  "connectorId": "provider",
  "counterPartyAddress": "http://localhost:19194/protocol",
  "contractId": "441353d9-9e86-4011-9828-63b2552d1bdd",
  "assetId": "myAsset",
  "protocol": "dataspace-protocol-http",
  "transferType": "HttpData-PULL"
}

The consumer initiates the data transfer process with a POST message containing the agreement ID from the previous response (here: 441353d9-9e86-4011-9828-63b2552d1bdd), the asset identifier of the wanted data asset, and the transfer type.

Example response

{
  "@type": "IdResponse",
  "@id": "84eb10cc-1192-408c-a378-9ef407c156a0",
  "createdAt": 1727430075515,
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

8. Get the data [CONSUMER]

8.1. Get the data address [CONSUMER]

curl http://localhost:29193/management/v3/edrs/84eb10cc-1192-408c-a378-9ef407c156a0/dataaddress | jq

The result of starting a transfer process is a DataAddress that the consumer can get by calling the endpoint data reference API (edrs) with the identifier of the started transfer process (here: 84eb10cc-1192-408c-a378-9ef407c156a0).

Example response

{
  "@type": "DataAddress",
  "type": "https://w3id.org/idsa/v4.1/HTTP",
  "endpoint": "http://localhost:19291/public",
  "authType": "bearer",
  "endpointType": "https://w3id.org/idsa/v4.1/HTTP",
  "authorization": "eyJraWQiOiJwdW...7Slukn1OMJw",
  "@context": {
    "@vocab": "https://w3id.org/edc/v0.0.1/ns/",
    "edc": "https://w3id.org/edc/v0.0.1/ns/",
    "odrl": "http://www.w3.org/ns/odrl/2/"
  }
}

8.2. Get the data content [CONSUMER]

curl --location --request GET 'http://localhost:19291/public/' --header 'Authorization: eyJraWQiOiJwdWJsaW...7Slukn1OMJw'

The consumer can fetch the data about the wanted data asset by calling the endpoint that is included in the previous data address response (here: http://localhost:19291/public). The previous response also includes the required access token (here: eyJraWQiOiJwdWJsaW...7Slukn1OMJw shortened).

The response of this request is the actual data content from the data source.

The consumer calls the public API endpoint that was shared by the provider. The request is authorized with the token that was shared as well. The provider now fetches the data from the data source based on the configuration of the data asset from the first step, and forwards the response to the consumer.

Example script

The Python script connector.py executes the data exchange process as described above step by step.

It makes use of the JSON files in the folder ./assets/connector/.

ACCURIDS Demonstrator

The demonstrator makes use of the ACCURIDS FAIR data registry.

We demonstrate data exchange between industry and an organization.

For example, exchange of information related to clinical studies between a hospital and a pharmaceutical company.

The diagram below represents the use case of the demonstrator with two ACCURIDS instances, both acting as consumers and providers of information.

ACCURIDS Demonstrator

The ACCURIDS platform

ACCURIDS logo
www.accurids.com

ACCURIDS is a platform for centralized data governance with a FAIR data registry.

It enables users to uniquely identify their data objects across systems and processes at enterprise and industry scale.

ACCURIDS offers six core features:

Ontology

  • Reusing standard industry ontologies through ACCURIDS Hub
  • Creation of specific ontologies for business needs

Reference and Master Data

  • Managing reference and master data aligned to ontology definitions

Data Registry

  • Governance of data objects managed in many different systems through a centralized registration process
  • Management of data object identities

Graph

  • Connect data from different sources in knowledge graphs to enable question answering and analytics

Quality

  • Define and monitor data quality rules for different types of data objects

AI

  • Ask business questions in natural language and get answers based on facts from the knowledge graph

ExpandAI demonstrator

exapnad logo
www.expandai.de

Demonstrator overview

The goal of this demonstrator is a secure and legally compliant method to collect and manage data from various wearable sensor sources (e.g. smartphone and smartwatch), which can then be used to train AI algorithms. The prototype focuses on integrating AI-based DiGAs into the healthcare ecosystem by adhering to the FAIR (Findability, Accessibility, Interoperability, and Reusability) and Gaia-X principles. These guidelines ensure that data management is transparent, accessible, and fair. The ability to gather and analyze data about health conditions and influencing factors is crucial for developing new therapies and holistic care strategies.

ExpandAI Demonstrator

At the core of this project is a data integration center, which gathers and standardizes Parkinson patient data from a smartphone (iPhone 12 Pro Max) and smartwatch (Apple Watch Series 8). AI analytics within the center enable near real-time data insights, supporting Parkinson healthcare professionals and researchers in decision-making and advancing medical research. Our prototype proposes a framework for secure data exchange between scientific institutions and industry partners. This collaboration can empower healthcare providers to monitor Parkinson patient health using a simple traffic light system to indicate their condition: green for well, yellow for unwell but not critical, and red for emergencies. This system will be accessible via a web platform for doctors, providing valuable insights while minimizing data sharing. Moreover, limited access to these databases for companies developing DiGAs could enhance AI model training and product development, ultimately advancing personalized healthcare and improving patient outcomes.

Dashboard

The dashboard provides a clear, organized interface for displaying both results and finger-tap data, offering insights into patient motor performance. The use of visual elements like charts and traffic lights effectively communicates the data trends and potential areas of concern. The chart highlights key metrics, thumb and digit movements over time, allowing users to track progress or deterioration. The traffic light system adds a simple yet powerful tool for quick assessments, signaling performance status with an intuitive red-yellow-green scheme.

ExpandAI Demonstrator Dashboard

ExpandAI Demonstrator Dashboard

Technologies used

The FAIR data space enables information exchange between research and industry organizations.

  • Java
  • Spring Boot
  • React
  • Gaia-x standards
  • Docker Containers
  • Github
  • REST and OpenAPI

Geo Engine Demonstrator

The Geo Engine FAIR-DS demonstrator realizes multiple use cases that show the capabilities of the NFDI data spaces and address important societal issues like the loss of biodiversity and climate change. Within the FAIR-DS project, the Geo Engine software is extended with FAIR data services and capabilities for data spaces. New dashboards and visualizations are developed to demonstrate added value from the FAIR-DS integration.

Overview of the Geo Engine Platform

Geo Engine Architecture

Geo Engine is a cloud-ready geo-spatial data processing platform. It consists of the backend and several frontends. The backend is subdivided into three subcomponents: services, operators, and data types. Data types specify primitives like feature collections for vector or gridded raster data. The Operators block contains the processing engine and operators. The Services block contains data management and APIs. It also contains external data connectors, that allow accessing data from external sources. Here, the connection to the data spaces is established.

Frontends for the Geo Engine are geoengine-ui for building web applications on top of Geo Engine. geoengine-python offers a Python library that can be used in Jupyter Notebooks. 3rd party applications like QGIS can access Geo Engine via its OGC interfaces.

All components of Geo Engine are fully containerized and Docker-ready. A GAIA-X-compatible self-description of the service is available. Geo Engine builds upon several technologies, including GDAL, arrow, Angular, and OpenLayers.

Aruna Data Connector

The ArunaDataProvider implements the Geo Engine data provider interface. It uses the Aruna RPC API to browse, find, and access data from the NFDI4Biodiversity data space using the Aruna data storage system. By performing the translation between the Aruna API and the Geo Engine API, the connector allows to use data from the NFDI4Biodiversity data space in the Geo Engine platform. This makes it possible to integrate the data easily into new analyses and dashboards.

Copernicus Data Space Ecosystem

To show cross data space capabilities and to generate insightful results, the Geo Engine demonstrator is connected to the Copernicus Data Space. It uses the SpatioTemporal Asset Catalog (STAC) API and an S3 endpoint to generate the necessary meta data to access Sentinel data. Users can then access the Sentinel data as proper data sets and time series, rather than just as individual satellite images. This enables temporal analysis and the generation of time series data products.

Machine Learning Pipeline

The Demonstrator leverages Geo Engine's machine learning pipeline. It allows streaming data from the Geo Engine to Python and train a model. The model can then be uploaded and registered in the backend and then used as an operator to predict new and unseen data.

ECOMETRICS Dashboard

The ECOMETRICS dashboard is a custom app developed for FAIR-DS. It allows the user to visualize and analyze an ecological indicator, like vegetation or land use. The demonstrator consists of two parts. The UI is the user-facing part of the application while the indicator generation happens on the backend.

UI (Dashboard)

Screenshot of the ECOMETRICS app

As visible in the screenshot, the ECOMETRICS app consists of four components. The top left explains how the app works and lets the user select the ecological indicator. Upon selection, the indicator will be visualized on the map on the right. The user can then select a region of interest by drawing on the map and selecting a point in time on a slider. The region will be intersected with the indicator. Finally, the user can review the results in the bottom-most section of the app on a plot. In case of a continuous indicator like vegetation, the plot will show a histogram. In case of a classification indicator like land use, the plot will show a class histogram.

Indicator generation

An indicator for the ECOMETRICS app is a raster time series that is either continuous or a classification. Currently, there are two indicators available: NDVI and land use. Both indicators are built from Sentinel data. The NDVI, short for Normalized Difference Vegetation Index, is computed by aggregating all Sentinel tiles for a given year and by computing (A-B)/(A+B) where A is the near infrared channel and B is the red channel. The necessary data is provided by Geo Engine's Copernicus Data Space Ecosystem Connector.

Machine Learning

The land use indicator is created by training a machine learning model. The training process incorporates the Sentinel data and uses the land use and land cover survey (LUCAS) as ground truth. While the Sentinel data are again retrieved from the Copernicus Data Space, the LUCAS data are used via the Aruna data connector. In principle, any data accessible using the Aruna data connector can be used for machine learning with Geo Engine.

The model is trained in a Jupyter notebook that defines a Geo Engine workflow, feeds the resulting data into a machine learning framework, trains the model, and registers it with the Geo Engine. The workflow applies feature engineering on the data and re-arranges it via timeshift operations This gives the classifier the temporal development for each training pixel at once and allows learning the different land usage types.

ESG Indicator and Virtual Data Trustees

The ESG Indicator demonstrator provides companies with concrete data to assess their environmental impact, streamlining one step of the process of creating quantitative figures in ESG reports. By leveraging research data to develop practical models, it fosters collaboration between academia and industry. The demonstrator outlines incentives for the development of new applications in the fields of artificial intelligence and geospatial data, contributing to technological advancements and sustainable solutions.

Data Trustee

Data Trustee

The diagram illustrates a virtual data trustee model, where the trustee acts as an intermediary between data suppliers and data consumers, in this case research institutions and industry companies. The data supplier provides models and data, while the data consumer accesses and utilizes these resources. The trustee ensures the usability of the data and models for the consumer, logs usage for billing and reproducibility, and facilitates the exchange of data and models between the three data spaces. This model promotes efficient and secure data sharing and utilization.

UI (Dashboard)

Dashboard

The UI provides a user-friendly interface for calculating an ESG (Environmental, Social, and Governance) indicator for specific properties. Users can select a property on a map, specify a date range using a time slider, and initiate the analysis process. The calculated ESG score is displayed visually, and detailed usage logs are available for transparency. This tool empowers users to assess the sustainability performance of properties over time.

ML-based Indicator

ESG Indicator

The biodiversity indicator shown in the image assesses the vegetation within a given area over time. It utilizes a vegetation index (NDVI) to track changes in vegetation cover, with different colors representing various land cover types. For example, built-up areas and streets, that support no vegetation, are depicted in white in the vegetation index and red in the ESG indicator. Grasslands, which exhibit seasonal changes, appear light green in the vegetation index and light blue in the ESG indicator. Forests, with stable vegetation cover, are represented by green in the vegetation index and purple in the ESG indicator. This indicator provides valuable insights into the biodiversity and ecological health of a region, contributing to quantitative ESG reporting.

RuQaD Logo RuQaD Batteries

Like a rocket, but for batteries

The demonstrator RuQaD Batteries (Reuse Quality-assured Data for Batteries) demonstrates the bridging from a academic reasearch infrastructure to an international industrial dataspace for the efficient reuse of quality assured research data in the context of battery cell manufactoring.

Distributed Analytics Platform Demonstrator

Overview of the PADME Ecosystem

This section provides a top-level view of PADME and its context.
Further information can be found here.

Introduction

In healthcare environments, such as hospitals or medical centers, a large amount of data is collected describing symptoms, diagnoses, and various aspects of the patient’s treatment process. This data is typically reused for reviewing and comparing the patient’s condition during subsequent visits. Sometimes, selected data is shared for continued patient care when the patient moves to another hospital.

Healthcare data is also a fundamental source for medical research. The results of studies often depend on the amount of available patient data. Typically, the more data available for analysis, the more reliable the results. However, the reuse of patient data across different institutions is limited due to ethical, legal, and privacy regulations. Such restrictions are governed by laws like the General Data Protection Regulation (GDPR) in Europe, the Health Insurance Portability and Accountability Act (HIPAA) in the U.S., or the Data Protection Act (DPA) in the U.K.

These limitations have prompted the development of distributed analytics (DA) solutions that bring algorithms to the data, ensuring sensitive data never leaves its source and that data owners maintain control. This paradigm shift helps comply with privacy laws and regulations.

Our DA infrastructure is based on the Personal Health Train (PHT) paradigm, with privacy-enhancing features that ensure transparency and preserve privacy.

Background

In recent years, several emerging technologies have enabled privacy-preserving data analysis. The two main paradigms of DA are parallel and non-parallel approaches.

  • Parallel DA sends analysis replicas to data providers and uses protocols like Federated Learning (FL) to train data models.
  • Non-parallel DA transmits intermediate results between data providers, incrementally updating the results before returning them to the requester.

Our infrastructure leverages the PHT paradigm and incorporates novel features for containerized analysis, flexible execution environments, and transparent monitoring.

There are several methodologies based on these abstract principles:

  • Federated Learning (FL): FL is an approach where a central server aggregates results from multiple distributed data providers. Instead of collecting data in a central location, FL allows models to be trained across decentralized datasets. This is particularly useful in scenarios where data privacy and security are critical.
  • Incremental Learning: This approach is used when data needs to be processed in a specific sequence or when the output of one data provider becomes the input for the next. In machine learning, this involves updating a model incrementally as unseen data becomes available as analysis moves from one data provider to another. This means that a model is trained or updated at each data provider along the route.

These concepts offer varying degrees of flexibility and complexity, and our PHT-based DA infrastructure focuses on ease of integration, compliance with privacy regulations, and extendibility.

Architectural Overview

The PHT originates from an analogy to a railway system, where Trains represent analysis tasks, Stations host confidential data, and a Central Service (CS) coordinates the tasks.

  • Trains encapsulate analytic tasks, moving between stations to consume data. The tasks are executed in containers, making them self-sufficient and independent of the underlying environment.
  • Stations hold confidential data and execute analytic tasks securely, under the control of station administrators.
  • Central Services (CS) manage train orchestration, results storage, and security policies.

Trains

Trains represent individual analytic tasks, moving between stations to collect data. They are containerized in Docker images, ensuring independence from the environment. The results are processed incrementally and can include various types of data, such as aggregated cohort sizes or statistical model updates.

Trains are executed in Docker-in-Docker (DinD) containers, and their lifecycle is managed via a clear state chart. Researchers can inspect the train’s progress and outcomes at different stages.

Stations

Stations are data nodes that hold sensitive data. They are autonomous, and administrators control which analytic tasks are accepted and executed. The station software executes containerized algorithms and ensures only privacy-compliant results are transmitted back to the requester.

Central Services

The CS handles train management, repository storage, and user access. It stores analytic algorithms and distributes them to stations. The CS also acts as a semi-trusted entity, handling intermediate results and ensuring security during train routing.

Monitoring Components

In addition to the core operational functionality of the infrastructure (e.g., train management), we address the growing challenge of transparency as the number of participating stations increases. Without visibility into the processes, stakeholders may lose trust in the system.

To enhance transparency, we developed a novel metadata schema based on RDF(S), which enriches every digital asset with detailed semantics, such as metadata about station owners, datasets, and dynamic train execution information (e.g., current state, CPU usage).

The Metadata Processing Unit at each station collects and transmits this information to a central metadata store in the Central Service (CS). This metadata is then visualized for the train requester via a Grafana frontend (WIP), offering insights into the train's progress.

To maintain the autonomy of stations, a customizable filter allows administrators to control what metadata is shared. This feature ensures that legal requirements are respected while providing essential transparency to users. The metadata-driven transparency enhances trust, allowing both station administrators and train requesters to monitor the status and results of analysis tasks without exposing sensitive data.

EDC Integration Overview

The integration of Eclipse Data Components (EDC) within the PADME framework has been explored in this project. This integration aims to facilitate efficient and compliant data sharing and management. EDC provides a framework that supports secure and controlled data access.

What is Eclipse Data Components (EDC)?

EDC is an open-source framework designed to simplify data integration, management, and sharing in distributed environments. It employs connectors that enable seamless interaction with various data sources such as databases, APIs, and file systems. EDC aims to establish a standard for interoperability among data providers and consumers, ensuring that data can be accessed, shared, and analyzed while maintaining strict adherence to privacy regulations.

Key Features of EDC Integration

  • Flexible Data Access: The integration supports two primary data transfer modes: Provider Push and Consumer Pull. This flexibility allows data providers to choose the most appropriate method for sharing their data based on their operational capabilities and the needs of the data consumers.

  • Interoperability: EDC connectors enable communication between different systems, enhancing collaboration among various research institutions and organizations. This interoperability allows for creating larger datasets for analysis and improving research outcomes.

Data Transfer Modes

EDC facilitates two distinct modes for data transfer, each catering to different use cases and operational requirements:

  1. Provider Push:

    • In this mode, the EDC connector fetches data from its backend systems and pushes it to the consumer's designated data sink. This method is particularly useful for batch processing scenarios where data can be transmitted in one go.
    • Benefits:
      • Ensures timely data delivery from the provider's backend.
      • Reduces the complexity of data access for consumers, who receive the data directly without needing to manage retrieval protocols.
      • Ideal for use cases where large volumes of data need to be shared at once, such as periodic reporting or large-scale data exports.
  2. Consumer Pull:

    • This mode allows the data consumer to actively request and pull data from the provider. Upon receiving a transfer request, the provider exposes the necessary credentials for accessing the data.
    • Benefits:
      • Empowers consumers with control over when and what data they receive, making it suitable for dynamic querying and on-demand data access.
      • Enhances efficiency by allowing consumers to access only the datasets they require, reducing unnecessary data transmission.
      • Facilitates scenarios where consumers need specific datasets at regular intervals, such as ongoing research projects.

Use Cases of EDC Integration

The integration of EDC within the PADME framework is exemplified through two primary use cases:

  1. PADME as a Data Provider:

    • In this scenario, researchers utilize the PADME platform to share analysis results. After conducting their analyses, they can register the results within the PADME Provider Connector. The results, enriched with essential metadata, become available through the EDC data catalog.
    • Process:
      • Researchers complete their analysis using the PADME framework.
      • Essential metadata about the analysis results is collected and registered within the PADME Provider Connector.
      • Results are made discoverable through the data catalog, allowing other researchers and institutions to access valuable insights for their studies.
  2. PADME as a Data Consumer:

    • In this use case, the PADME platform consumes data from various providers within the data space using EDC connectors. Before execution, the analysis algorithm requires credentials to access the data.
    • Process:
      • The PADME Consumer Connector queries available data catalogs from different organizations within the data space.
      • Individual contracts for data access are negotiated, and the necessary credentials are provided to the PADME platform.
      • The platform can then execute its algorithms using the consumed data, leading to meaningful analytical outcomes.

Technical Implementation

The technical implementation of EDC integration with the PADME framework involves several components and processes:

The PADME framework employs Core EDC Connector and Sovity’s Community Edition EDC Connector, which extends the functionality of the core EDC connectors.

Aim of EDC Integration

The integration of EDC within the PADME framework presents several potential benefits:

  • Enhanced Collaboration: By facilitating data sharing and integration, EDC may encourage collaboration among researchers and institutions, leading to richer datasets for analysis.

  • Compliance with Privacy Regulations: The architecture is designed to adhere to privacy regulations, ensuring that sensitive data is protected while allowing for valuable analysis to occur.

Deployment in FAIRDS de.NBI Cluster

This section documents the deployment of PADME components in the de.NBI Kubernetes (K8s) cluster. We utilize Flux CD for continuous delivery and GitLab as our source repository platform.

Namespaces

Our deployments are spread across four key namespaces, each serving a specific function:

  • CentralService: Hosts the central service components.
  • Harbor: Manages and secures train images.
  • StationRegistry: Registers and onboards stations in the network.
  • StationSoftware: Hosts edge components for station-specific tasks and use case demonstrations.

Deployment Strategy

Source Repository - GitLab

We store our Kubernetes deployment files in GitLab.
GitLab Repo: Continuous delivery for the PHT demonstrator project in FAIRDS

Continuous Delivery - Flux CD

Flux CD continuously and automatically ensures that the state of the K8s cluster matches the configurations stored in GitLab. Any changes to the deployment files in GitLab trigger Flux CD to apply the updated configurations to the respective namespaces in Kubernetes.

Deployments and logs can be monitored for audit and troubleshooting purposes, allowing for proactive management of the cluster's health and performance.

The architecture of each namespace is structured around several core Kubernetes resource types:

  • Pods: The fundamental deployable units in Kubernetes. They may consist of one or more containers. In the diagram, pods are typically represented by individual units.
  • Services: They act as an abstraction layer, providing a stable endpoint to communicate with the dynamic pods. Services might be visualized as connecting different pods or components.
  • ConfigMaps & Secrets: These are mechanisms to inject configuration data or sensitive information into pods. They might be represented by separate components linked to the relevant pods they serve.
  • Ingress & Network Policies: They manage inbound and outbound traffic to and from the pods, respectively. If depicted, they would likely be shown as gateways or filters.
  • Dependencies: Arrows or lines connecting different components signify their relationships and dependencies. For instance, a pod that relies on a database might be connected to a database pod.

Deployment Details

The architecture of each namespace is structured around several core Kubernetes resource types. At the foundation, there are pods (pod) that represent the running instances of applications. These pods are managed by ReplicaSets (rs), ensuring the desired number of pod replicas are maintained. The creation and management of these ReplicaSets are handled by deployments (deploy).

For storage needs, certain pods have associations with Persistent Volume Claims (pvc), signifying that they rely on persistent storage. In terms of network communication, some pods are exposed via services (svc), which act as a gateway for external or internal access. Furthermore, a few of these services are linked to ingress (ing), implying they are accessible from outside the cluster or have specific routing rules applied.

CentralService Namespace

centralservice.png

Harbor Namespace

For Harbor deployment, we have utilized Helm charts. Helm charts are packages for Kubernetes applications, streamlining deployment and management. They offer consistency across deployments and support versioning, allowing for simplified and reproducible setups. The sources for these Helm charts are from the official Harbor repositories and documentation:

harbor.png

StationRegistry Namespace

stationregistry.png

StationSoftware Namespace

stationsoftware.png