New node software for large-scale clients with PB-scale data onboarding to Filecoin network
⛔️ DEPRECATION WARNING
The V1 Singularity is deprecated in favor of Singularity V2.
Check how they are different and development progress
- singularity-import-boost - Automatically import deals to boost for Filecoin storage providers
- generate-car -
The internal tool used by
singularity
to generate car files as well as commp - generate-ipld-car -
The internal tool used by
singularity
to regenerate the CAR that captures the unixfs dag of the dataset. - singularity-browser - A next.js app for browsing singularity made deals
Looking for a complete end-to-end demonstration? Try Getting Started Guide
# Install nvm (https://github.com/nvm-sh/nvm#install--update-script)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
source ~/.bashrc
# Install node v18
nvm install 18
npm i -g @techgreedy/singularity
singularity -h
git clone https://github.com/tech-greedy/singularity.git
cd singularity
npm ci
npm run build
npm link
singularity -h
By default, npm will pull the pre-built binaries for dependencies. You can choose to build it from source and override the one pulled by npm.
# Make sure you have go v1.17+ installed
git clone https://github.com/tech-greedy/generate-car.git
cd generate-car
make
Then copy the generated binary to override the existing one from the PATH for your node environment, i.e.
- singularity installed globally
/home/user/.nvm/versions/node/v16.xx.x/lib/node_modules/.bin
- singularity cloned locally
./node_modules/.bin
Note that the path may change depending on the nodejs version.
If you cannot find the folder above, try searching for the generate-car
binary first (i.e.m find ~/.nvm -name 'generate-car'
).
To use the tool as a daemon, it needs to initialize the config and the database. To do so, run
singularity init
By default, a repository will be initialized at $HOME_DIR/.singularity
.
Set the environment variable SINGULARITY_PATH
to override this behavior.
# Unix
export SINGULARITY_PATH=/the/path/to/the/repo
# Windows
set SINGULARITY_PATH=/the/path/to/the/repo
Since the tool is modularized, it can be deployed in different ways and have different components enabled or disabled.
Below are configurations for common scenarios.
This is useful if you only need deal preparation but not deal making.
You can still have deal making enabled, but disabling it will use slightly less system resources.
In default.toml from your repo
- change
ipfs.enabled
to false - change
deal_tracking_service.enabled
to false - change
deal_replication_service.enabled
to false - change
deal_replication_worker.enabled
to false
This is useful if you know MongoDB, and you're hitting some bottlenecks or issues from the built-in MongoDb.
- Setup your own MongoDb instance
- In default.toml from your repo
- change
database.start_local
to false - change
connection.database
to the connection string of your own MongoDb database
- change
- On master server, set
deal_preparation_service.enabled
,database.start_local
to true and disable all other modules - On worker servers, set
deal_preparation_worker.enabled
to true and disable all other modules. Changeconnection.database
andconnection.deal_preparation_service
to the IP address of the master server
$ singularity
Usage: singularity [options] [command]
A tool for large-scale clients with PB-scale data onboarding to Filecoin network
Visit https://github.com/tech-greedy/singularity for more details
Options:
-V, --version output the version number
-h, --help display help for command
Commands:
init Initialize the configuration directory in SINGULARITY_PATH
If unset, it will be initialized at HOME_DIR/.singularity
daemon Start a daemon process for deal preparation and deal making
preparation|prep Manage deal preparation
help [command] display help for command
export SINGULARITY_PATH=/the/path/to/the/repo
singularity daemon
Deal preparation contains two parts
- Scanning Request - an initial effort to scan the directory and make plans of how to assign different files and folders to different chunks
- Generation Request - subsequent works to generate the car file and compute the commP
$ singularity prep -h
Usage: singularity preparation|prep [options] [command]
Manage deal preparation
Options:
-h, --help display help for command
Commands:
create [options] <datasetName> <datasetPath> <outDir> Start deal preparation for a local dataset
status [options] <dataset> Check the status of a deal preparation request
list [options] List all deal preparation requests
generation-manifest [options] <generationId> Get the Slingshot v3.x manifest data for a single deal generation request
generation-status [options] <generationId> Check the status of a single deal generation request
pause Pause scanning or generation requests
resume Resume scanning or generation requests
retry Retry scanning or generation requests
remove [options] <dataset> Remove all records from database for a dataset
help [command] display help for command
This will create a scanning request for a dataset. While the dataset is being scanned, it will also produce generation requests to be taken by workers.
$ singularity prep create -h
Usage: singularity preparation create [options] <datasetName> <datasetPath> <outDir>
Start deal preparation for a local dataset
Arguments:
datasetName A unique name of the dataset
datasetPath Directory path to the dataset
outDir The output Directory to save CAR files
Options:
-s, --deal-size <deal_size> Target deal size, i.e. 32GiB (default: "32 GiB")
-t, --tmp-dir <tmp_dir> Optional temporary directory. May be useful when it is at least 2x faster than the dataset source, such as when the dataset is on network mount, and the I/O is the bottleneck
-f, --skip-inaccessible-files Skip inaccessible files. Scanning may take longer to complete.
-m, --min-ratio <min_ratio> Min ratio of deal to sector size, i.e. 0.55
-M, --max-ratio <max_ratio> Max ratio of deal to sector size, i.e. 0.95
-h, --help display help for command
The deal preparation supports public S3 bucket natively. Temporary directory is mandatory when using with S3 bucket. i.e.
singularity prep create -t <tmp_dir> <dataset_name> s3://<bucket_name>/<optional_prefix>/ <out_dir>
For each dataset preparation request, it always starts with scanning request, once enough files can be packed into a single deal, it will create a generation request. In other words, each preparation request is a single scanning request and a bunch of generation requests.
You can pause/resume/retry the scanning request or generation requests.
singularity prep pause -h
singularity prep resume -h
singularity prep retry -h
Append a new directory to an existing dataset. This will add all entries under the new directory into the dataset.
Just like the singularity prep create
command, the directory will be considered as the root.
User is responsible for making sure there are no duplicate entries in the dataset
otherwise the file with same path may be corrupted during retrieval.
singularity preparation append <dataset> <newPath>
Example:
singularity prep create myData /my/data-2020 /my/out
singularity prep append myData /my/data-2021
singularity prep append myData /my/data-2022
The whole data preparation requests can be removed from database. All generated CAR files can also be deleted by
specifying --purge
option.
singularity prep remove -h
List all the deal preparation requests, including whether scanning has completed and how many generation requests have completed or hit errors for each of them.
singularity prep list
Check status for a specific deal preparation request, including the status of the initial scanning request and all corresponding generation requests.
singularity prep status -h
Look into a specific generation request, including what are the files or folders included in that request and their corresponding size, cid, selector, etc.
singularity prep generation-status -h
singularity prep generation-manifest -h
WEB3_STORAGE_TOKEN="eyJ..." singularity prep upload-manifest -h
singularity monitor
Deal replication module supports both lotus-market and boost based storage providers (later on we might deprecate lotus-market support). Currently it is required to have both lotus and boost cli binary in order for this module to work.
Look for default.toml
in the initialized repo, verify in the [deal_replication_worker] section, both binary can be
accessed.
If you need to specify environment variable like FULLNODE_API_INFO, it can also be specified there.
In order to make deals, we recommend setting up a lite node to use with the tool.
Once you have the lite node setup, you can import your wallet key for the verified client address.
If your target SP runs on Boost, boost executable is also needed to be able to make deal.
Once you have the boost cli initialized, you can import your wallet key for the verified client address.
$ singularity repl start -h
Usage: singularity replication start [options] <datasetid> <storage-providers> <client> [# of replica]
Start deal replication for a prepared local dataset
Arguments:
datasetid Existing ID of dataset prepared.
storage-providers Comma separated storage provider list
client Client address where deals are proposed from
# of replica Number of targeting replica of the dataset (default: 10)
Options:
-u, --url-prefix <urlprefix> URL prefix for car downloading. Must be reachable by provider's boostd node. (default: "http://127.0.0.1/")
-p, --price <maxprice> Maximum price per epoch per GiB in Fil. (default: "0")
-r, --verified <verified> Whether to propose deal as verified. true|false. (default: "true")
-s, --start-delay <startdelay> Deal start delay in days. (StartEpoch) (default: "7")
-d, --duration <duration> Duration in days for deal length. (default: "525")
-o, --offline <offline> Propose as offline deal. (default: "true")
-m, --max-deals <maxdeals> Max number of deals in this replication request per SP, per cron triggered. (default: "0")
-c, --cron-schedule <cronschedule> Optional cron to send deals at interval. Use double quote to wrap the format containing spaces.
-x, --cron-max-deals <cronmaxdeals> When cron schedule specified, limit the total number of deals across entire cron, per SP.
-xp, --cron-max-pending-deals <cronmaxpendingdeals> When cron schedule specified, limit the total number of pending deals determined by dealtracking service, per SP.
-l, --file-list-path <filelistpath> Path to a txt file that will limit to replicate only from the list. Must be visible by deal replication worker.
-n, --notes <notes> Any notes or tag want to store along the replication request, for tracking purpose.
-csv, --output-csv <outputCsv> Print CSV to specified folder after done. Folder must exist on worker.
-f, --force Force resend even if this pieceCID have been proposed / active by the provider. (default: false)
-h, --help display help for command
A simple example to send all car files in one prepared dataset "CommonCrawl" to one storage provider f01234 immediately:
singularity repl start CommonCrawl f01234 f15djc5avdxihgu234231rfrrzbvnnqvzurxe55kja
A more complex example, send 10 deals to storage provider f01234 and f05678, every hour on the 1st minute from prepared dataset "CommonCrawl", until all CAR files are dealt.
singularity repl start -m 10 -c "1 * * * *" CommonCrawl f01234,f05678 f15djc5avdxihgu234231rfrrzbvnnqvzurxe55kja
- Storage providers have full control of deal making speed
- Client no longer needs to spend time to pause or adjust deal making speed
$ singularity repl ss create -h
Usage: singularity replication selfservice create [options] <client> <provider> <dataset>
Create a deal making self service policy
Arguments:
client Client address to send deals from
provider Provider address to send deals to
dataset Id or name of the dataset
Options:
--minDelay <minDelay> Minimum delay in days for the deal start epoch (default: "7")
--maxDelay <maxDelay> Maximum delay in days for the deal start epoch (default: "7")
-r, --verified <verified> Whether to propose deal as verified. true|false. (default: "true")
-p, --price <price> Maximum price per epoch per GiB in Fil. (default: "0")
--minDuration <minDuration> Minimum duration in days for the deal (default: "525")
--maxDuration <maxDuration> maxDuration duration in days for the deal (default: "525")
-h, --help display help for command
$ singularity repl ss delete -h
Usage: singularity replication selfservice delete [options] <id>
Delete a deal making self service policy
Arguments:
id Policy id to delete
Options:
-h, --help display help for command
$ singularity repl ss list -h
Usage: singularity replication selfservice list [options]
List all deal making self service policies
Options:
--json Output with JSON format
-h, --help display help for command
curl "http://localhost:7005/pieceCids?provider=f0xxxx&dataset=datasetName"
# Without pieceCid
$ curl "http://localhost:7005/propose?provider=f0xxxx&dataset=datasetName"
# With pieceCid
$ curl "http://localhost:7005/propose?provider=f0xxxx&dataset=datasetName&pieceCid=bafyxxxx"
# All possible options
$ curl "http://localhost:7005/propose?\
> provider=f0xxxx&\
> dataset=datasetName&\
> pieceCid=bafyxxxx&\
> startDays=7&\
> durationDays=525&\
> client=f0xxxxx"
The logic behind the scene is as follows:
- Try to find all policies that match the
provider
anddataset
- Filter all applicable policies by options provided, such as
client
,startDays
,durationDays
- Randomly select one of the matching policy (this is possible if multiple client addresses are used for the same dataset)
- If pieceCid is provided, then check if the pieceCid belongs to the dataset and has not been proposed
- Otherwise, find a pieceCid from the dataset that has not yet been proposed to the provider
- Propose the deal and return the proposalId
To only expose the /pieceCids and /propose API to SP, you can configure nginx
like below
location /pieceCids {
proxy_pass http://localhost:7005;
}
location /propose {
proxy_pass http://localhost:7005;
}
The recommended way for Retrieval is via bitswap protocol. You need the storage provider to run booster-bitswap.
Then you may use ipfs get <RootCid>/sub/path/to/file
to retrieve the file or folder. The ipfs
version needs to be 0.18.0+.
The RootCid
can be found in singularity prep list
and will be automatically generated when the dataset is fully prepared.
If you find RootCid
missing, or you're using an older version of Singularity (before 3.0.0),
you can regenerate the RootCid
by running singularity prep dag <dataset>
.
This will generate another CAR file that encapsulates the IPLD DAG of the whole dataset.
You will need to get that new CAR file sealed before you can perform bitswap retrieval.
Look for default.toml
in the initialized repo.
This sets the MongoDb connection string. The default value corresponds to the built-in MongoDb server shipped with this software. If you choose to use a standalone MongoDb service, set the connection string here.
Sets the API endpoint of deal preparation service.
The software is shipping with a built-in MongoDb server. For small to medium-sized dataset, this should be sufficient.
For users who're onboarding large scale datasets, we recommend running your own MongoDb service which fits into your
infrastructure by setting this value to false
.
To connect to a standalone MongoDb service, set the value of connection string here.
Not that the MongoDB server may consume as much as 80% of usable memory.
The path of the database files the built-in MongoDb will be using, as well as the IP and port to bind the service to.
Service to manage preparation requests
Whether to enable the service and which IP and port to bind the service to
If the service crashes or is interrupted, there may be incomplete CAR files generated. Enabling this can clean them up.
The default min/max ratio of CAR file size divided by the target deal size. The dataset splitting is performed with below logic
- Perform a Glob pattern match and get all files in sorted order
- Iterate through all the files and keep accumulating file sizes into a chunk
- Once the size of a chunk is between min and max ratio, pack this chunk to a CAR file and start with a new chunk
- If the size of the file is too large to fit into a chunk, split the file to hit the min ration
Worker to scan the dataset, make plan and generate Car file and CIDs
Whether to enable the worker and how many worker instances. As a rule of thumb, use min(cpu_cores / 2, io_MBps / 20)
Each generation worker consumes negligible RAM, 20-50 MiB/s disk I/O and 100-250% of CPU.
Each 32GiB deal takes ~10 minutes to be generated on AMD EPYC CPU with NVME drive.
- When dealing with lots of small files, CPU usage increases while generation speed decreases. Meanwhile, IO may become the bottleneck if not using SSD.
- When using S3 bucket public as the dataset, the Internet Speed may become the bottleneck
The repo ~/.singularity
or the folder specified by SINGULARITY_PATH
contains all state of the service.
To backup, simply backup the repo folder.
Starting version 2.0.0, anonymous data including error messages, data preparation and deal making statistics
will be collected for us to better understand how the software is used and improve the software. To disable behavior,
create and set metrics.enabled
to false
in default.toml
.
docker pull techgreedy/singularity
docker tag techgreedy/singularity singularity
# Initialize the repo config [optional]
docker run \
-v ~/.singularity:/root/.singularity \
singularity init
# Start daemon service in background
# Use ~/.singularity as the repo for config, database and logs
# Use /mnt/storage as the storage
docker run -d \
-v ~/.singularity:/root/.singularity \
-v /mnt/storage:/app/storage \
-p 7001:7001 \
singularity daemon
# Stop daemon service
docker ps | grep singularity | cut -d' ' -f1 | xargs docker kill
# Interact with the daemon with native singularity CLI
singularity prep create --force testData /app/storage/dataset /app/storage/output
# Interact with the daemon with dockerized singularity CLI
docker run -it --rm --network=host \
singularity prep create --force testData /app/storage/dataset /app/storage/output
# Interact with the daemon with HTTP API directly
curl http://localhost:7001/preparations
Use --skip-inaccessible-files
when creating the data preparation request singularity prep create
.
For existing generation requests, use singularity prep retry gen --skip-inaccessible-files
,
however this currently only works when the tmpDir is used.
This software is not extensively tested on Windows.
In case that one CAR contains more files than allowed by OS, you will need to increase the open file limit with ulimit
, or LimitNOFILE
if using systemd.
Depending on the version, NodeJS by default has a max heap memory of 2GB. To increase this limit, i.e. to increase to
4G, set environment variable
NODE_OPTIONS="--max-old-space-size=4096"
.
If you are using network mount such as NFS or Goofys, a temporary network issue may cause the CAR file generation to
fail.
If the error rate is less than 10%, you may assume they are transient and can be fixed by performing
a retry.
If the error is consistent, you will need to dig into the root cause of what have gone wrong. It could be incorrectly
configured permission or DNS resolver, etc. You can find more details in /var/log/syslog
.
Avoid using root, or try the fix below
chown -R $(whoami) ~/
npm config set unsafe-perm true
npm config set user 0
Something wrong while starting MongoDB. Check what has gone wrong
MONGOMS_DEBUG=1 singularity daemon
If the error shows libcrypto.so.1.1
cannot be found. Try this solution.
Create a bug report or request a feature.