RuQaD Architecture

This document describes the architecture of the RuQaD Demonstrator

The RuQaD Demonstrator is the main product of the project "RuQaD Batteries - Reuse Quality-assured Data for Batteries". The project is a sub-project of the FAIR Data Spaces project, supported by the German Federal Ministry of Education and Research under the Förderkennzeichen FAIRDS09

Introduction and Goals

Requirements Overview

The aim of the project is to build a demonstrator that connects a research data infrastructure (PoLis) with an industrial data space (BatCAT) to reuse datasets for new applications.

Technically, this is to be realized as a tool (RuQaD) that exports data from the source system, the research data management system Kadi4Mat. RuQaD carries out quality and meta data checks (involving an existing pipeline) and pushes the data into the target system, the research data management system LinkAhead, that is the basis of the BatCAT data space.

The motivation is to allow a seamless exchange of data between research and industry. It is important to ensure that the FAIR criteria are met, therefore a dedicated check for meta data and the FAIR criteria is implemented. The FAIR criteria can be summarized to the following practical checks:

  • Is a PID present?
  • Is the domain-specific meta data complete?
  • Is there provenance information?
  • Does the data include license information?

The normative description of all requirements is contained in the non-public project proposal.

Quality Goals

RuQaD's main quality goals are, using the terms from ISO 25010 (see glossary for definitions):

  • Operability: As a demonstrator, the main quality goal of the software is to be understood and learned. This facilitates building prototypes or production software based on the functionality of the demonstrator.
  • Compatibility: The system connects different software systems and therefore acts as a compatibility component.
  • Transferability: In order to serve as a tool for demonstration in different environments, the system needs to be built at a high level of transferability.
  • Maintainability: The project's limited time frame also scopes the maintainability goal: As a demonstrator, some parts of the system can be considered work-in-progress and need to be modified, corrected or adapted over time.Whenever code is being contributed to components which have already a forseeable future in other contexts the maintainability must be taken in consideration. On the other hands, code which mainly serves the demonstration of the particular scenario (PoLiS/BatCAT) it is acceptable to provide only PoC implementations which need adjustments for a long-term maintenance.

Stakeholders

Role/Name Expectations
IndiScale Quality and FAIRness of entities are successfully annotated.
PoLis / Kadi4Mat Entities can be exported to other dataspaces.
BatCAT Dataspace Receive valuable dataset offers.
FAIR DS Project Showcase the concept of FAIR Data Spaces with a relevant, plausible and convincing use case.

Context and Scope

Business Context

RuQaD connects an industrial data space (BatCAT data space for the development of a "Battery Cell Assembly Twin) with an external research infrastructure (PoLiS - Post Lithium Storage Cluster of Excellence). In this regard, RuQaD is a Value-Added Service and addresses several DSSC Building Blocks in compliance FAIR Guiding Principles.

RuQaD builds on fAIR Components (LinkAhead: F1, F2, F3,F4, A1, A2 .eln-File-Format, RO-Crate: F2, F3, I1, I2, I3, Data Models, Provenance & Tracability, EDC: A1, A2, I1, I2, I3). It is the focus of RuQaD to promote the Reusability (R) of datasets. A central barrier to reusing are quality requirements of the reusing party. RuQaD uses the quality check pipeline of the FAIR Data Spaces Demonstrator 4.2 to asses the quality of a given dataset (e.g. missing values).

Missing license statements (R1.1) or provenance information (R1.2) will be flagged by the RuQaD demonstrator as well as incompatibilty with expected standards (R1.3).

Loading the datasets to LinkAhead allows to offer the quality-assured and FAIR-compliant datasets in the BatCAT data space using the data space's EDC based infrastructure (Data, Services and Offerings Descriptions; Publication and Discovery).

Simple and cost-effective reuse of datasets for new products is a central value proposition of data spaces in general and the BatCAT data space in particular (Business Model Development).

The use case scenario of datasets from PoLiS is the enrichment of characterization data (e.g. porosity of cathode material of sodium-ion cells) which have been collected in the BatCAT project to develop more reliable and more robust ML models and algorithms (Use Case Development; Data Product Development).

While governance and legal Building Blocks will not be addressed directly by RuQaD, the demonstrator supports the protection of personal data and IP rights by checking the presence of licence and provenance meta data (Regulatory-Compliance).

Industry partners of the BatCAT data space attempt to develop a digital twin for cell assemply and manufacturing of batteries to build greener, more sustainable and cost-effective batteries.

To do this, they need both a large amount of data for training ML models and data from different sources to ensure the robustness of their models. They therefore benefit directly from the integration of quality-checked external data sources.

While some egal and ethical issues of data reuse can be addressed by the licencing of dataset as required by the R1.1 principle, others remain, e.g. the liability when dataset have been published with erroneous licences. RuQaD mitigates these issues by promoting the provenance of datasets and ensuring rich meta data annotations before publishing data into the data space.

Technical Context

System Landscape

System Landscape System Landscape BatCAT Data Space RuQaD Service PoLiS Research Data Infrastructure BatCAT   Data   Space Node   LinkAhead   and   EDC-based components   of   the   BatCAT Data   Space. Data   Consumer   R&D   departments   from   the consortial   partners   of   the BatCAT   Data   Space. Quality   assurance   4.2   Gitlab   pipeline   for   quality assurance   based   on   the demonstrator   4.2. Data   Curator   IndiScale   curates   data   from PoLis   and   offers   them   for reuse   in   the   BatCAT   Data Space. LinkAhead   Crawler   Framework   for   file   scanning, LinkAhead   entity   building and   synchronization. RuQaD   Demonstrator   A   purely   functional component   for   checking FAIRness,   invoking   the   QA pipeline   and   ingesting   data to   LinkAhead. Polis   Kadi4Mat   The   Kadi4Mat   electronic   lab notebook   instance   of   the PoLiS   Cluster   of   Excellence. Pull   battery-related datasets. Monitored   by Invoke   quality assurace   pipeline   on raw   data   from   PoLiS. Ingest   data   to LinkAhead. Ingest   data   to LinkAhead   with   the LinkAhead   crawler. Browse   datasets   and request   access   for reuse. Review   and   control the   offering   of datasets   to   the BatCAT   Data   Space. Legend     person     system     boundary (dashed)     BatCAT Data Space boundary (last back color, dashed)     BatCAT Data Space:RuQaD Service boundary (last back color, dashed)     PoLiS Research Data Infrastructure boundary (last back color, dashed)  
System Landscape Diagram

Building Block View

Whitebox Overall System

Building Blocks

Rationale

The RuQaD demonstrator uses service integration to achieve the goal of connecting dataspaces in a FAIR manner. It configures and combines existing services to multiple stages of FAIRness evalution and data integration.

Contained Building Blocks
  • Monitor: Checks for new data in a Kadi4Mat instance.
  • Quality checker: Passes new data to the quality checker which was developed in WP 4.2 of the previous Fair DS project.
  • RuQaD crawler: Calls the LinkAhead crawler for metadata checking and for insertion into the BatCAT data space node.

Monitor

Purpose / Responsibility

The monitor continuously polls a Kadi4Mat instance (representing the source dataspace) for new data items.

Interface(s)

Each new data item is passed on to the quality checker for evaluation of the data quality. Afterwards the monitor passes the quality check report and the original data to the crawler, which eventually leads to insertion in the BatCAT data space where the items can be checked by data curators and retrieved by data consumers.

Directory/File Location

Source code: src/ruqad/monitor.py

Quality Checker

Purpose / Responsibility

The quality checker executes data quality checks on the data which was retrieved from the input dataspace (a Kadi4Mat instance in this case). It provides a structured summary for other components and also a detailed report for human consumption.

Interface(s)

The quality checker is implemented as a Python class QualityChecker which provides mainly a check(filename, target_dir) method to check individual files. This class is available in the module ruqad.qualitycheck.

Quality / Performance Characteristics

The quality checker relies on the demonstrator 4.2 to perform the checks. Thus, RuQaD relies on further maintenance by the demonstrator's development team.

Directory/File Location

Source code: src/ruqad/qualitycheck.py

Open issues
  • The demonstrator 4.2 service currently relies on running as Gitlab pipeline jobs, which introduces a certain administrative overhead for production deployment.
  • It is possible and may be desirable to parallelize the quality check for multiple files by distributing the load on a number of service workers, instead of checking files sequentially.

RuQaD Crawler

Purpose/Responsibility

The RuQaD Crawler executes metadata checks and bundles data and metadata for insertion into the BatCAT dataspace.

Interface(s)

The crawler is implemented as a Python module with a function trigger_crawler(...) which looks for data and quality check files to evaluate and insert into the BatCAT dataspace. It uses LinkAhead's crawler framework for metadata checks, object creation and interaction with the target dataspace.

Directory / File Location

Source code: src/ruqad/crawler.py

Level 2

White Box RuQaD Crawler

RuQaD Demonstrator - RuQaD Crawler - Components RuQaD Demonstrator - RuQaD Crawler - Components RuQaD Crawler [Container] Transformer CFood   declaration Identifiables declaration Crawler   wrapper Converter RuQaD   monitor LinkAhead   Crawler   Framework   for   file   scanning, LinkAhead   entity   building and   synchronization. . . Uses Uses . . Legend     system     container     component     container boundary (dashed)  
Component Diagram

Motivation

The Crawler reuses functionality of the LinkAhead crawler:

  • Declarative creation of data objects from structured input data.
  • Idempotent and context sensitive scan-create-and-insert-or-update procedures.

This functionality is extended by a custom converters and data transformers.

Contained Building Blocks
  • Crawler wrapper: Calls the LinkAhead crawler on the files given by the RuQaD monitor, with the correct settings.
  • CFood declaration: Specification of how entities in BatCAT shall be constructed from input data.
  • Identifiables declaration: Specification of identifiying properties of entities in BatCAT.
  • Converter: Custom conversion plugin to create resulting (sub) entities.
  • Transformer: Custom conversion plugin to transform input data into properties.
Crawler wrapper
Purpose / Responsibility

The crawler wrapper scans files in specific directories of the file system and synchronizes them with the LinkAhead instance. Before insertion and updates of Records in LinkAhead, a meta data check is carried out to verify whether the meta data that was exported from kadi4mat is compatible with the target data model (in LinkAhead and the EDC). Validation failure leads to specific validation error messages and prevents insertions or updates of the scan result. The software component also carries out a check of data FAIRness of the data exported from kadi4mat (in ELN format).

Interface(s)

The crawler uses:

  • A cfood (file in YAML format) which specifies the mapping from data found on the file system to Records in LinkAhead.
  • A definition of the identifiables (file in YAML format) which defines the properties that are needed to uniquely identify Records of the data model in LinkAhead.
  • The data model definition (file in YAML format). This is needed by the crawler to do the meta data check.
  • Crawler extensions specific to the project (cusotm converters and custom transformers). These are python modules containing functions and classes that can be referenced within the cfood.

The interface is a Python-function that is implemented as a module into the RuQaD demonstrator. The function calls the scanner and main crawler functions of the LinkAhead crawler software.

Directory / File Location

Source code:

  • Main interface: ruqad/src/ruqad/crawler.py
  • Crawler extensions:
    • Custom converters (currently not used): ruqad/src/ruqad/crawler-extensions/converters.py
    • Custom transformers: ruqad/src/ruqad/crawler-extensions/transformers.py
Fulfilled Requirements
  • Data ingest from exported ELN file into LinkAhead.
  • Data ingest from quality check into LinkAhead.
  • Check of FAIRness of data from ELN file.
  • Meta data check

Runtime View

The RuQaD monitor runs continually and acts on new data items. The handling of one such data item is described here.

Ingestion of a data item

Runtime view: Ingestion of a data item Runtime view: Ingestion of a data item RuQaD Demonstrator [System] RuQaD   monitor Quality   checker RuQaD   Crawler Polis   Kadi4Mat   The   Kadi4Mat   electronic   lab notebook   instance   of   the PoLiS   Cluster   of   Excellence. BatCAT   Data   Space Node   LinkAhead   and   EDC-based components   of   the   BatCAT Data   Space. Quality   assurance   4.2   Gitlab   pipeline   for   quality assurance   based   on   the demonstrator   4.2. LinkAhead   Crawler   Framework   for   file   scanning, LinkAhead   entity   building and   synchronization. 1.   Poll   new   items 2.   Returns   new   item 3.   Check   quality   of data   item 6.   Return   quality report 4.   Trigger   quality check   pipeline 5.   Return   quality report 7.   Trigger   crawler 8.   Invoke   with   data item   and   quality report 9.   Return   generated items   with   FAIRness level 10.   Insert   and/or update   generated entities Legend     system     container     system boundary (dashed)  
Data item ingestion

  • The monitor continually polls the Kadi4Mat server for new items.
  • Each new item is sequentially passed to the Demonstrator 4.2 quality checker for data quality checks and the crawler component for metadata FAIRness evaluation before being inserted into the target dataspace at BatCAT.

Deployment View

The deployment of the demonstrator builds on the BatCAT Testbed. The BatCAT Testbed uses minikube to setup a local Kubernetes cluster where the core components of the BatCAT Data Space are being deployed for testing and development purposes, this includes EDC Connectors for several agents of the data space, identity management, federated catalog services, databases and LinkAhead instances for the data storage management.

The RuQaD demonstrator has been integrated into this testbed as well as the Kadi4Mat ELN.

The 4.2 Demonstrator has not been integrated into the testbed. This would entail setting up a full-blown Gitlab instance, setting up a Gitlab Runner and loading the runnner configuration into a Git repository which is just too complex for demontrating single API call.

The steps for setting up the BatCAT Testbed is documented in the RuQaD clone of the BatCAT Testbed repository.

Deployment - BatCAT Testbed Deployment - BatCAT Testbed Kubernetes Cluster Several Pods   The BatCAT Data Space Node and all other software systems are being grouped together in this diagram for simplicity. The overall number of pods in the testbed which are only concerned with the data space components is > 10. Kadi Pod Ruqad Pod Gitlab   The gitlab instance can not be integrated into the testbed. BatCAT   Data   Space Node   LinkAhead   and   EDC-based components   of   the   BatCAT Data   Space. Polis   Kadi4Mat   The   Kadi4Mat   electronic   lab notebook   instance   of   the PoLiS   Cluster   of   Excellence. RuQaD   Demonstrator   A   purely   functional component   for   checking FAIRness,   invoking   the   QA pipeline   and   ingesting   data to   LinkAhead. Quality   assurance   4.2   Gitlab   pipeline   for   quality assurance   based   on   the demonstrator   4.2. Ingest   data   to LinkAhead. Pull   battery-related datasets. Monitored   by Invoke   quality assurace   pipeline   on raw   data   from   PoLiS. Legend     system     node  
Deployment View

Cross-cutting Concepts

Using YAML for storing machine readable information

The YAML format is used in several components of the software for storing and exchanging information in a format that is machine-readable and also human-readable at the same time.

REST interfaces

Multiple components of the software use REST interfaces for data exchange.

  • Gitlab-API
  • kadi4mat-export
  • LinkAhead

FAIR

The FAIR guiding principles are key requirements for the FAIR Dataspaces Project in general and the RuQaD demonstrator in particular. They are designed as a "guideline for those wishing to enhance the reusability of their data holdings. ... the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data ..." (Wilkinson et al. https://doi.org/10.1038/sdata.2016.18)

The RuQaD demonstrator especially addresses the "R" principles:

  • R1. (Meta)data are richly described with a plurality of accurate and relevant attributes
    • R1.1. (Meta)data are released with a clear and accessible data usage license
    • R1.2. (Meta)data are associated with detailed provenance
    • R1.3. (Meta)data meet domain-relevant community standards

Data spaces

A data space is an "Interoperable framework, based on common governance principles, standards, practices and enabling services, that enables trusted data transactions between participants." (DSSC Blueprint, CEN CENELEC workshop agreement on Trusted Data Transactions)

While the BatCAT data space is important as a part of the demonstration and use-case scenario, building and maintaining the data space it not itself in scope of the RuQaD demonstrator. Instead, RuQad needs to adapt to the API of the BatCAT data space which is building on the research data management system LinkAhead as it's core component for storing research data.

ELN-FileFormat / ROCrate

The ELN-FileFormat is a standard for exchanging information between electronic lab notebooks and other research data management software. It is build on top of ROCrate (Research Object Crate) which is a standard for self-contained data storage in accordance with the FAIR principles.

ETL (Extract-Transform-Load)

In pinciple the whole pipeline can be considered an extract-transform-load process. Data is extracted from kadi4mat. It is transformed into a format that can be interpreted by LinkAhead. Afterwards it is loaded into LinkAhead and connected to the EDC.

Risks and Technical Debts

Software is dependent on external gitlab pipeline

The quality checker pipeline is run in an external instance of gitlab. As this external system might be subject to changes in software or API the whole procedure of the demonstrator can become unstable in case of incompatible changes.

The ELN-File-Format is work-in-progress

Parts of the ELN-File-Format are not completely specified and also software implementations (e.g. in kadi4mat) are in parts incomplete and contain bugs. Currently the demonstrator implements a few workarounds for known problems. These can be considered technical debts that need to be removed when the ELN-File-Format and the software implementing it reach a stable version.

Glossary

Term Definition
Operability (ISO 25010) "System can be understood, learned, used and is attractive to users."
Transferability (ISO 25010) "System can be transferred from one environment to another."
Maintainability (ISO 25010) "System can be modified, corrected, adapted or improved due to changes in environment or requirements."
Compatibility (ISO 25010) "Two or more systems can exchange information while sharing the same environment."
FAIR Findable, Accessible, Interoperable, Reusable (defined in: https://doi.org/10.1038/sdata.2016.18)
ELN Electronic Lab Notebook
DSSC Data Space Support Center, https://dssc.eu/
DSSC Building Blocks Building blocks of the data space architecture as defined by the DSSC Blue Print https://dssc.eu/space/bv15e/766061169/Data+Spaces+Blueprint+v1.5+-+Home
PoLiS Post-Lithium Storage Cluster of Excellence https://www.postlithiumstorage.org
BatCAT Battery Cell Assembly Twin, Horizon Europe Project https://www.batcat.info/

License

  • This architecture documentation is published under CC-BY-ND 4.0
  • Copyright (C) 2024 IndiScale GmbH mailto:info@indiscale.com
  • Copyright (C) 2024 Timm Fitschen
  • Copyright (C) 2024 Alexander Schlemmer
  • Copyright (C) 2024 Daniel Hornung
  • Copyright (C) 2024 Henrik tom Wörden