SIP: 5
Title: File Encryption and Erasure Encoding Standard
Author: Braydon Fuller <[email protected]>
Status: Active
Type: Standard
Created: 2017-05-01
This document describes a standard for storing files on the Storj network. It details the specification for encryption, erasure encoding and verification of file integrity and authenticity. Files are encrypted using AES-256-CTR, authenticated using HMAC-SHA512 and encoded with Reed-Solomon error correction codes. File and bucket names are encrypted using AES-256-GCM with built-in authentication.
A seed secret key, the Encryption Key, is used for all encryption of files and meta data, each with a derived sub-key. This secret is based on BIP39. A machine generated entropy of 128 to 256 bits is translated into several human words for the ease of writing down and transferring between devices and applications.
The file is encrypted using AES-256-CTR using these steps to derive an encryption key and counter:
- Generate the seed from the Encryption Key, sometimes referred to as the mnemonic words, see From mnemonic to seed on BIP39. Note that the passphrase is not used when generating in this case.
- The seed data is concatenated with the bucket id. Example:
[64 byte seed|12 byte bucket id]
- That buffer is hashed with sha512, and the first half (32 bytes) of the digest is the bucket key.
- The bucket key is then concatenated with a random 32 byte buffer, the file index. Example:
[32 byte bucket key|32 byte file index]
- That buffer is hashed with sha512, and the first half is the digest is the file key
- The first 16 bytes of the 32 byte random buffer in step 4, is used as the file counter
The use of counter mode provides the ability to decrypt and encrypt parts of the file without having access to the entire data. The counter can be incremented to the byte position of the file to decrypt different portions of the file. It's essential that the same counter is not reused with the same key to encrypt different data. The file index needs to be stored with the file meta data to later derive the file key.
A file name is encrypted using AES-256-GCM using these steps:
- Take the hmac sha512 digest using the bucket key from step 3 of File Encryption above, and update with a constant 32 byte bucket meta index of
42964710327258a0a3239a41a2d5e2d7468a393d3413d2aa26a4a2c856c90251
- The first 32 byte half of the digest is used as the file name key
- Next take the hmac sha512 digest using the same bucket key and this time updated with the bucket id and the file name
- The first 32 byte half is used as the file name iv
- The file name is then encrypted using the file name key and the file name iv
- The cypher text is the concatenated with the GCM digest and the IV. Example:
[16 byte GCM digest|32 byte file name iv|variable-length cypher text]
A bucket name is encrypted using AES-256-GCM using these steps:
- Generate the seed from the Encryption Key, as similar to step 1 in File Encryption
- Concatenate the seed with a constant bucket name index of
398734aab3c4c30c9f22590e83a95f7e43556a45fc2b3060e0c39fde31f50272
- Get the sha512 digest of the concatenated buffer in step 2
- The bucket name key is the first half of the sha512 digest using the buffer from step 3 as the key, and updated with the bucket meta index, as similar to step 1 in File Name Encryption.
- The bucket name iv is the first half of the sha512 digest using the buffer from step 3 as the key, and updated with the bucket name.
- As similar to step 6 of File Name Encryption the cypher text is concatenated with the GCM digest and IV. Example:
[16 byte GCM digest|32 byte file name iv|variable-length cypher text]
Files are encoded using Reed-Solomon (based on Vandermonde matrices) and can use any number of data to parity shard ratios, and is encoded after the encryption step. This provides the ability to reconstruct lost shards from a service that does not have access to the Encryption Key.
A file is broken up into several shards in multiples of 2MiB. There are 256 total possible data shards. The shard size for a file can be determined by selecting 4 hops back from the closest value to 2MiB * 2 ^ n
, example:
2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
^ ^
shard size file size
All of the parity shards and data shards, except the last data shard, must be the same size. To account for the fact that the last shard will most always not equal the size of the other shards, the parity shards are generated with the assumption that the remaining extra space for the last shard are all zeros. This data can then be truncated off the file during recovery of the last shard. The size of each shard is necessary to reconstruct the data; this can be accomplished by marking which shards are parity and which are data.
Every file has an hmac sha512 generated to verify the integrity of the file as well as the authenticity. This provides a means to verify the file in such a way that only the person with the encryption key is able to verify.
The hmac sha512 is generated using the following steps:
- For each shard, including parity shards (after the previous encryption and encoding steps) take the ripemd160 digest of a sha256 digest of the shard data.
- Next take the hmac sha512 digest using the file key from step 5 of File Encryption from above, and update with each shard hash from the previous step.
The hmac will be in the form:
HMACSHA512([RIPMD160(SHA256(shard_1_data))|RIPMD160(SHA256(shard_2_data))|...])
Generating the hmac after encryption and encoding gives the ability to verify the integrity and authenticity before any shard data transfer, erasure decoding or decryption will take place. This not only verifies that the data has not been tampered with, but also that the correct key is being used to decrypt for usability purposes. The hmac needs to be stored with the file meta data.
https://github.com/storj/libstorj
- https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki
- https://en.wikipedia.org/wiki/Hash-based_message_authentication_code
- https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation
- https://en.wikipedia.org/wiki/Galois/Counter_Mode
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.