EOSC Future is an EU-funded Horizon 2020 project that is implementing the European Open Science Cloud (EOSC). EOSC will give European researchers access to a wide web of FAIR data and related services.
This project builds a generic data transfer service that can be used in EOSC to transfer large amounts of data to cloud storage, by just indicating the source and destination. The EOSC Data Transfer Service features a RESTful Application Programming Interface (REST API).
The API covers three sets of functionalities:
This project uses Quarkus, the Supersonic Subatomic Java Framework. It requires Java 17.
All three groups of API endpoints mentioned above support authorization.
The generic data transfer service behind the EOSC Data Transfer API aims to be agnostic
with regard to authorization, thus the HTTP header Authorization
(if present) will be
forwarded as received.
Note that the frontend using this API might have to supply more than one set of credentials: (1) one for the data repository (determined by the DOI used as the source), (2) one for the transfer service that is automatically selected when a destination is chosen, and (3) one for the destination storage system. Only (2) is mandatory.
The API endpoints that parse DOIs usually call APIs that are open access, however the
HTTP header Authorization
(if present) will be forwarded as received. This ensures that
the EOSC Data Transfer API can be extended with new parsers
for data repositories that require authentication.
The API endpoints that create and manage transfers, as well as the ones that manage storage
elements, do require authorization, in the form of an access token passed via the HTTP
header Authorization
. This gets passed to the
transfer service registered to handle the destination storage.
The challenge is that some storage systems used as the target of the transfer may need
a different authentication and/or authorization (than the one the transfer service uses).
Thus, an additional set of credentials can be supplied to the endpoints in these groups
via the HTTP header Authorization-Storage
.
For example, for transfers to dCache, the configured transfer service that handles the transfers is EGI Data Transfer. These both can use the same EGI Check-in access token, thus no additional credentials are needed besides the access token for the transfer service, passed via the
Authorization
HTTP header.
When used, the HTTP header parameter Authorization-Storage
receives a
key value pair, separated by a colon (:
), no leading or trailing whitespace, which
is Base-64 encoded.
For example, to pass a username and password to the destination storage, you construct a string like
username:password
, then Base-64 encoded it todXNlcm5hbWU6cGFzc3dvcmQ=
, and finally pass this through the HTTP headerAuthorization-Storage
when calling e.g. the endpointGET /storage/folder/list
.
The API supports parsing digital object identifiers (DOIs) and will return a list of files in the repository indicated by the DOI. It will automatically identify the DOI type and will use the correct parser to retrieve the list of source files.
DOIs are persistent identifiers (PIDs) dedicated to identification of content over digital networks. These are registered by one of the registration agencies of the International DOI Foundation. Although in this documentation we refer to DOIs, the API endpoint that parses DOIs supports any PID registered in the global handle system of the DONA Foundation, provided it points to a data repository for which a parser is configured.
The API supports parsing DOIs to the following data repositories:
- Zenodo
- B2SHARE
- European Synchrotron Radiation Facility
- Any data repository that supports Signposting
The API endpoint GET /parser
that parses DOIs is extensible. All you have to do is
implement the parser interface for a specific data repository, then register
the Java class implementing the interface in the configuration.
Implement the Java interface ParserService
in a class of your choice.
public interface ParserService {
boolean init(ParserConfig config, PortConfig port);
String getId();
String getName();
String sourceId();
Uni<Tuple2<Boolean, ParserService>> canParseDOI(String auth, String doi, ParserHelper helper);
Uni<StorageContent> parseDOI(String auth, String doi, int level);
}
Your class must have a constructor that receives a
String id
, which must be returned by the methodgetId()
.
When the API GET /parser
is called to parse a DOI, all configured parsers will be tried,
by calling the method canParseDOI()
, until one is identified that can parse the DOI. If no
parser can handle the DOI, the API fails. In case your implementation of the method
canParseDOI()
cannot determine if your parser can handle a DOI just from the URL,
you can use the passed in ParserHelper
to check if the URL redirects to the data
repository you support.
After a parser is identified, the methods init()
and parseDOI()
are called in order.
The same
ParserHelper
is used when trying all parsers for a DOI. This helper caches the redirects, so you should trygetRedirectedToUrl()
before incurring one or more network calls by callingcheckRedirect()
.
Add a new entry in the configuration file under eosc/parser
for the
new parser, with the following settings:
name
is the human-readable name of the data repository.class
is the canonical Java class name that implements the interfaceParserService
for the data repository.url
is the base URL for the REST client that will be used to call the API of this data repository (optional).timeout
is the maximum timeout in milliseconds for calls to the data repository. If not supplied, the default value 5000 (5 seconds) is used.
The API supports creation of new data transfers (aka jobs), finding data transfers, querying information about data transfers, and canceling data transfers.
Every API endpoint that performs operations on or queries information about data transfers or storage elements in a destination storage has to be passed the destination storage type. This selects the data transfer service that will be used to perform the data transfer, freeing the clients of the API from having to know which data transfer service to pick for each destination. Each destination storage type is mapped to exactly one data transfer service in the configuration.
Note that the API uses the concept of a storage type, instead of the protocol type, to select the transfer service. This makes the API flexible, by allowing multiple destination storages that use the same protocol to be handled by different transfer services, but at the same time it also allows an entire protocol (e.g. FTP, see below) to be handled by a specific transfer service.
If you do not supply the
dest
query parameter when making an API call to perform a transfer or a storage element related operation or query, the default valuedcache
will be supplied instead.
Initially, the EGI Data Transfer is integrated into the EOSC Data Transfer API, supporting the following destination storages:
- dCache
- StoRM
- S3-compatible object storages
- FTP servers
Multiple instances of each supported transfer service can be configured, then you can mix and match what protocol(s) and/or storage type(s) each of them will handle.
The API for creating and managing data transfers is extensible. All you have to do is implement the generic data transfer interface to wrap a specific data transfer service, then register your class implementing the interface as the handler for one or more destination storage types.
Implement the Java interface TransferService
in a class of your choice.
public interface TransferService {
boolean initService(TransferServiceConfig config);
String getServiceName();
String translateTransferInfoFieldName(String genericFieldName);
Uni<UserInfo> getUserInfo(String tsAuth);
// Methods for data transfers
Uni<TransferInfo> startTransfer(String tsAuth, String storageAuth, Transfer transfer);
Uni<TransferList> findTransfers(String tsAuth, String fields, int limit,
String timeWindow, String stateIn,
String srcStorageElement, String dstStorageElement,
String delegationId, String voName, String userDN);
Uni<TransferInfoExtended> getTransferInfo(String tsAuth, String jobId);
Uni<Response> getTransferInfoField(String tsAuth, String jobId, String fieldName);
Uni<TransferInfoExtended> cancelTransfer(String tsAuth, String jobId);
// Methods for storage elements
Uni<StorageContent> listFolderContent(String tsAuth, String storageAuth, String folderUrl);
Uni<StorageElement> getStorageElementInfo(String tsAuth, String storageAuth, String seUrl);
Uni<String> createFolder(String tsAuth, String storageAuth, String folderUrl);
Uni<String> deleteFolder(String tsAuth, String storageAuth, String folderUrl);
Uni<String> deleteFile(String tsAuth, String storageAuth, String fileUrl);
Uni<String> renameStorageElement(String tsAuth, String storageAuth, String seOld, String seNew);
}
Your class must have a constructor with no parameters.
The methods can be split into two groups:
- The methods for handling data transfers must be implemented
- The methods for storage elements should only be implemented for storage types
for which the method
canBrowseStorage()
returnstrue
.
Add a new entry in the configuration file under eosc/transfer/service
for the new transfer service, with the following settings:
name
is the human-readable name of this transfer service.class
is the canonical Java class name that implements the interfaceTransferService
for this transfer service.url
is the base URL for the REST client that will be used to call the API of this transfer service.timeout
is the maximum timeout in milliseconds for calls to the transfer service. If not supplied, the default value 5000 (5 seconds) is used.trust-store-file
is an optional path to a keystore file containing certificates that should be trusted when connecting to the transfer service. Use it when the CA that issued the certificate(s) of the transfer service is not one of the well known-root CAs. The path is relative to foldersrc/main/resources
.trust-store-password
is the optional password to the keystore file.
Add entries in the configuration file under eosc/transfer/destination
for each destination storage type you want to support, and map it to one of the registered
transfer services.
The configuration of each storage type consists of:
service
is the key of the transfer service that will handle transfers to this storage type.description
is the human-readable name of this storage type.auth
is the type of authentication required by the storage system, one of these values:- token means the storage uses the same OIDC auth token as the transfer service
- password means the storage needs a username and a password for authentication
- keys means the storage needs an access key and a secret key for authentication
protocol
is the schema to use in URLs pointing to this storage.browse
signals whether the storage supports browsing (the endpoints to list and manage storage elements are available).
For storage types that are configured with either password
or keys
as the authentication
type, you will have to supply the HTTP header parameter Authorization-Storage
when calling
the API endpoints. See here for details.
In the enum Transfer.Destination
add new values for each of the storage types
you added in the previous step. Use the same values as the names of the keys.
This way each entry under the node eosc/transfer/destination
in the configuration
file becomes one possible value for the destination storage parameter dest
of the
API endpoints.
A storage element is where user's data is stored. It is a generic term meant to hide the complexity of different types of storage technologies. It can mean both an element of the storage system's hierarchy (directory, folder, container, bucket, etc.) and the entity that stores the data (file, object, etc.).
The API supports managing storage elements in a destination storage. Each data transfer
service that gets integrated can optionally implement this functionality. Moreover, data
transfer services that support multiple storage types can selectively implement this
functionality for just a subset of the supported storage types (see the
method TransferService::canBrowseStorage()
above).
Clients can query if this functionality is implemented for a storage type by
using the endpoint GET /storage/info
.
This functionality covers:
- listing the content of a storage element
- query information about a storage element
- rename a storage element (including its path, which means to move it)
- delete a storage element
- create a hierarchical storage element (folder/container/bucket)
Storage elements that store data (files/objects) can only be created by a data transfer, not by the API endpoints in this group.
The application configuration file is in src/main/resources/application.yml
.
See here for how to configure the data repository parsers used by the API and here for how to extend the API with new transfer services and storage types.
The API automatically generates metrics for the calls to the endpoints. You can also enable
histogram buckets for these metrics, to support quantiles in the telemetry dashboards (e.g.
the 0.95-quantile aka the 95th percentile of a metric). List the quantiles you want to
generate under the setting eosc/qos/quantiles
. Similarly, you can also enable buckets for
service level objectives (SLOs) by listing all SLOs for call duration, expressed in milliseconds,
under the setting eosc/qos/slos
.
You can review the metrics generated by the API at http://localhost:8081/metrics
The API needs a certificate for an EGI service account to register and configure S3 storage systems with the wrapped EGI Data Transfer service. Such a certificate is included in the file
src/main/resources/fts-keystore.jks
, but the password for this certificate is not included. Contact EGI to obtain a service account and a new certificate (together with its password) when deploying this API.
You can run your application in dev mode that enables live coding using:
./mvnw compile quarkus:dev
Then open the Dev UI, which is available in dev mode only, at http://localhost:8081/q/dev/.
The application can be packaged using:
./mvnw package
It produces the quarkus-run.jar
file in the target/quarkus-app/
directory.
Be aware that it’s not an über-jar as the dependencies are copied into the
target/quarkus-app/lib/
directory.
The application is now runnable using java -jar target/quarkus-app/quarkus-run.jar
.
If you want to build an über-jar, execute the following command:
./mvnw package -Dquarkus.package.type=uber-jar
The application, packaged as an über-jar, is now runnable using java -jar target/*-runner.jar
.
You can use Docker Compose to easily deploy and run the EOSC Data Transfer API. This will run multiple containers:
- This application that implements the REST API and serves it over HTTP
- SSL terminator - decrypts HTTPS traffic and forwards requests to the API
- OpenTelemetry collector - collects, batch processes, and forwards traces/metrics
- Jaeger - receives traces
- Loki - receives logs
- Prometheus - scrapes metrics
- Grafana - visualization of the telemetry dashboard
The architecture and interaction between these containers is illustrated below:
Steps to run the API in a container:
-
Copy the file
src/main/docker/.env.template
tosrc/main/docker/.env
, then:- Provide the domain name and port where you will deploy the API in the environment
variables
SERVICE_DOMAIN
andSERVICE_PORT
, respectively. - Provide an email address in the environment variable
SERVICE_EMAIL
to be used, together with the domain name, to automatically request a SSL certificate for the SSL terminator. - In the environment variable
FTS_KEY_STORE_FILE
provide a path to a Java keystore file containing a new EGI service account certificate, and in the environment variableFTS_KEY_STORE_PASSWORD
provide the password for it. - In the environment variable
TELEMETRY_PORT
provide the port on which to publish the Grafana telemetry dashboard. This will be available on the same domain name as the API itself.
- Provide the domain name and port where you will deploy the API in the environment
variables
-
Run the command
build.sh
(orbuild.cmd
on Windows) to build and run the containers that implement the EOSC Data Transfer API. -
The SSL terminator will automatically use Let's Encrypt to request an SSL certificate for HTTPS.
After the SSL terminator container is deployed and working properly, connect to it and make sure it is requesting an actual HTTPS certificate. By default, it will use a self-signed certificate and will only do dry runs for requesting a certificate to avoid the rate limits of Let's Encrypt. To do this:
- Run the command
sudo docker exec -it data-transfer-ssl /bin/sh
then - In the container change directory
cd /opt
- Edit the file
request.sh
and remove thecertbot
parameter--dry-run
In case you remove the containers of the EOSC Data Transfer API, retain the volume
certificates
, which contains the SSL certificate. This will avoid requesting a new one for the same domain, in case you redeploy the API (prevents exceeding Let's Encrypt rate limit).
- REST server implementation Writing reactive REST services
- REST client implementation: REST client to easily call APIs
- Configuration reference: Configuration reference guide
- YAML Configuration: Use YAML to configure your application
- Introduction to CDI: Contexts and dependency injection guide
- OpenTelemetry support: Adding observability to your application
- Metrics with Micrometer: Sending API metrics to Prometheus
- Swagger UI: User-friendly UI to document and test your API
- Mutiny Guides: Reactive programming with Mutiny
- Optionals: How to use Optional in Java