This document describes the architecture of the RuQaD Demonstrator
The RuQaD Demonstrator is the main product of the project "RuQaD Batteries - Reuse Quality-assured Data for Batteries". The project is a sub-project of the FAIR Data Spaces project, supported by the German Federal Ministry of Education and Research under the Förderkennzeichen FAIRDS09
The aim of the project is to build a demonstrator that connects a research data infrastructure (PoLis) with an industrial data space (BatCAT) to reuse datasets for new applications.
Technically, this is to be realized as a tool (RuQaD) that exports data from the source system, the research data management system Kadi4Mat. RuQaD carries out quality and meta data checks (involving an existing pipeline) and pushes the data into the target system, the research data management system LinkAhead, that is the basis of the BatCAT data space.
The motivation is to allow a seamless exchange of data between research and industry. It is important to ensure that the FAIR criteria are met, therefore a dedicated check for meta data and the FAIR criteria is implemented. The FAIR criteria can be summarized to the following practical checks:
The normative description of all requirements is contained in the non-public project proposal.
RuQaD's main quality goals are, using the terms from ISO 25010 (see glossary for definitions):
Role/Name | Expectations |
---|---|
IndiScale | Quality and FAIRness of entities are successfully annotated. |
PoLis / Kadi4Mat | Entities can be exported to other dataspaces. |
BatCAT Dataspace | Receive valuable dataset offers. |
FAIR DS Project | Showcase the concept of FAIR Data Spaces with a relevant, plausible and convincing use case. |
RuQaD connects an industrial data space (BatCAT data space for the development of a "Battery Cell Assembly Twin) with an external research infrastructure (PoLiS - Post Lithium Storage Cluster of Excellence). In this regard, RuQaD is a Value-Added Service and addresses several DSSC Building Blocks in compliance FAIR Guiding Principles.
RuQaD builds on fAIR Components (LinkAhead: F1, F2, F3,F4, A1, A2 .eln-File-Format, RO-Crate: F2, F3, I1, I2, I3, Data Models, Provenance & Tracability, EDC: A1, A2, I1, I2, I3). It is the focus of RuQaD to promote the Reusability (R) of datasets. A central barrier to reusing are quality requirements of the reusing party. RuQaD uses the quality check pipeline of the FAIR Data Spaces Demonstrator 4.2 to asses the quality of a given dataset (e.g. missing values).
Missing license statements (R1.1) or provenance information (R1.2) will be flagged by the RuQaD demonstrator as well as incompatibilty with expected standards (R1.3).
Loading the datasets to LinkAhead allows to offer the quality-assured and FAIR-compliant datasets in the BatCAT data space using the data space's EDC based infrastructure (Data, Services and Offerings Descriptions; Publication and Discovery).
Simple and cost-effective reuse of datasets for new products is a central value proposition of data spaces in general and the BatCAT data space in particular (Business Model Development).
The use case scenario of datasets from PoLiS is the enrichment of characterization data (e.g. porosity of cathode material of sodium-ion cells) which have been collected in the BatCAT project to develop more reliable and more robust ML models and algorithms (Use Case Development; Data Product Development).
While governance and legal Building Blocks will not be addressed directly by RuQaD, the demonstrator supports the protection of personal data and IP rights by checking the presence of licence and provenance meta data (Regulatory-Compliance).
Industry partners of the BatCAT data space attempt to develop a digital twin for cell assemply and manufacturing of batteries to build greener, more sustainable and cost-effective batteries.
To do this, they need both a large amount of data for training ML models and data from different sources to ensure the robustness of their models. They therefore benefit directly from the integration of quality-checked external data sources.
While some egal and ethical issues of data reuse can be addressed by the licencing of dataset as required by the R1.1 principle, others remain, e.g. the liability when dataset have been published with erroneous licences. RuQaD mitigates these issues by promoting the provenance of datasets and ensuring rich meta data annotations before publishing data into the data space.
The RuQaD demonstrator uses service integration to achieve the goal of connecting dataspaces in a FAIR manner. It configures and combines existing services to multiple stages of FAIRness evalution and data integration.
The monitor continuously polls a Kadi4Mat instance (representing the source dataspace) for new data items.
Each new data item is passed on to the quality checker for evaluation of the data quality. Afterwards the monitor passes the quality check report and the original data to the crawler, which eventually leads to insertion in the BatCAT data space where the items can be checked by data curators and retrieved by data consumers.
Source code: src/ruqad/monitor.py
The quality checker executes data quality checks on the data which was retrieved from the input dataspace (a Kadi4Mat instance in this case). It provides a structured summary for other components and also a detailed report for human consumption.
The quality checker is implemented as a Python class QualityChecker
which provides mainly a check(filename, target_dir)
method to check individual files. This class is available in the module ruqad.qualitycheck
.
The quality checker relies on the demonstrator 4.2 to perform the checks. Thus, RuQaD relies on further maintenance by the demonstrator's development team.
Source code: src/ruqad/qualitycheck.py
The RuQaD Crawler executes metadata checks and bundles data and metadata for insertion into the BatCAT dataspace.
The crawler is implemented as a Python module with a function trigger_crawler(...)
which looks for data and quality check files to evaluate and insert into the BatCAT dataspace. It uses LinkAhead's crawler framework for metadata checks, object creation and interaction with the target dataspace.
Source code: src/ruqad/crawler.py
The Crawler reuses functionality of the LinkAhead crawler:
This functionality is extended by a custom converters and data transformers.
The crawler wrapper scans files in specific directories of the file system and synchronizes them with the LinkAhead instance. Before insertion and updates of Records
in LinkAhead, a meta data check is carried out to verify whether the meta data that was exported from kadi4mat is compatible with the target data model (in LinkAhead and the EDC). Validation failure leads to specific validation error messages and prevents insertions or updates of the scan result. The software component also carries out a check of data FAIRness of the data exported from kadi4mat (in ELN format).
The crawler uses:
Records
in LinkAhead.Records
of the data model in LinkAhead.The interface is a Python-function that is implemented as a module into the RuQaD demonstrator. The function calls the scanner and main crawler functions of the LinkAhead crawler software.
Source code:
ruqad/src/ruqad/crawler.py
ruqad/src/ruqad/crawler-extensions/converters.py
ruqad/src/ruqad/crawler-extensions/transformers.py
The RuQaD monitor runs continually and acts on new data items. The handling of one such data item is described here.
The deployment of the demonstrator builds on the BatCAT Testbed. The BatCAT Testbed uses minikube to setup a local Kubernetes cluster where the core components of the BatCAT Data Space are being deployed for testing and development purposes, this includes EDC Connectors for several agents of the data space, identity management, federated catalog services, databases and LinkAhead instances for the data storage management.
The RuQaD demonstrator has been integrated into this testbed as well as the Kadi4Mat ELN.
The 4.2 Demonstrator has not been integrated into the testbed. This would entail setting up a full-blown Gitlab instance, setting up a Gitlab Runner and loading the runnner configuration into a Git repository which is just too complex for demontrating single API call.
The steps for setting up the BatCAT Testbed is documented in the RuQaD clone of the BatCAT Testbed repository.
The YAML format is used in several components of the software for storing and exchanging information in a format that is machine-readable and also human-readable at the same time.
Multiple components of the software use REST interfaces for data exchange.
The FAIR guiding principles are key requirements for the FAIR Dataspaces Project in general and the RuQaD demonstrator in particular. They are designed as a "guideline for those wishing to enhance the reusability of their data holdings. ... the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data ..." (Wilkinson et al. https://doi.org/10.1038/sdata.2016.18)
The RuQaD demonstrator especially addresses the "R" principles:
A data space is an "Interoperable framework, based on common governance principles, standards, practices and enabling services, that enables trusted data transactions between participants." (DSSC Blueprint, CEN CENELEC workshop agreement on Trusted Data Transactions)
While the BatCAT data space is important as a part of the demonstration and use-case scenario, building and maintaining the data space it not itself in scope of the RuQaD demonstrator. Instead, RuQad needs to adapt to the API of the BatCAT data space which is building on the research data management system LinkAhead as it's core component for storing research data.
The ELN-FileFormat is a standard for exchanging information between electronic lab notebooks and other research data management software. It is build on top of ROCrate (Research Object Crate) which is a standard for self-contained data storage in accordance with the FAIR principles.
In pinciple the whole pipeline can be considered an extract-transform-load process. Data is extracted from kadi4mat. It is transformed into a format that can be interpreted by LinkAhead. Afterwards it is loaded into LinkAhead and connected to the EDC.
The quality checker pipeline is run in an external instance of gitlab. As this external system might be subject to changes in software or API the whole procedure of the demonstrator can become unstable in case of incompatible changes.
Parts of the ELN-File-Format are not completely specified and also software implementations (e.g. in kadi4mat) are in parts incomplete and contain bugs. Currently the demonstrator implements a few workarounds for known problems. These can be considered technical debts that need to be removed when the ELN-File-Format and the software implementing it reach a stable version.
Term | Definition |
---|---|
Operability (ISO 25010) | "System can be understood, learned, used and is attractive to users." |
Transferability (ISO 25010) | "System can be transferred from one environment to another." |
Maintainability (ISO 25010) | "System can be modified, corrected, adapted or improved due to changes in environment or requirements." |
Compatibility (ISO 25010) | "Two or more systems can exchange information while sharing the same environment." |
FAIR | Findable, Accessible, Interoperable, Reusable (defined in: https://doi.org/10.1038/sdata.2016.18) |
ELN | Electronic Lab Notebook |
DSSC | Data Space Support Center, https://dssc.eu/ |
DSSC Building Blocks | Building blocks of the data space architecture as defined by the DSSC Blue Print https://dssc.eu/space/bv15e/766061169/Data+Spaces+Blueprint+v1.5+-+Home |
PoLiS | Post-Lithium Storage Cluster of Excellence https://www.postlithiumstorage.org |
BatCAT | Battery Cell Assembly Twin, Horizon Europe Project https://www.batcat.info/ |