title |
---|
Notes on S3 |
- S3 Is a global and public service. It's a part of public AWS network
- S3 Buckets are region specific, they live in a region.
- Bucket's resiliance is a region resiliance.
- Bucket name should be globally unique. A name once occupied, can not be reused in any region in any account.
- Bucket soft limit: 100 buckets in account upon request it can be increased to 1000
- An object size limit is 5TB.
- The largest object that can be uploaded in a single PUT is 5 GB. (So for 5TB object we need multiple PUT requests to send 5GB Chunks)
- Files are stored as flats, no folders are there.
- Folders in UI represented as prefixes. There are no file names. They are keys and values.
- The key
/jedi/anakin.jpg
will represent in UI such a way that Jedi is a folder and anakin.jpg is file. - No limit of number of objects in bucket.
- Only one bucket policy per bucket, one bucket can have multiple statements.
- Combination of IAM Policy + bucket policy applies for authenticated IAM principal, while only bucket policy applies for anonymous access.
- Versioning are disabled by default, can be changed to enabled.
- Once enabled can't be changed back to disabled, but can be changed to suspended.
- Suspended versioning can be changed back to enabled.
- Use multipart upload instead of post request.
- Multiplart requires 100 mb of data minimum.
- Use S3 Transfer Accelaration. Which is disabled by default.
- Transfer Accelaration uses nearby edge location to transfer the data first and than this Edge location transfers the data to S3 bucket using AWS Infrastructure.
- Bucket name must be DNS compatible and can not contain periods.
- When Transfer Accelaration is enabled, specific endpoint provided by S3 needs to be used.
- Transfer Accelaration state can be changed between Enabled and Suspended. Just like Object versioning, Transfer Accelaration once enabled can not be disabled.
General notes about encryption
- In S3 buckets aren't encrypted, only encryption done is on object level.
- When a bucket default encryption is mentioned. And Object requires another encryption method. Object encryption method will be prefered.
- The communication from client and S3 is 3-way, Client, S3 and S3-Storage.
- S3 is sitting in middle between client and S3-Storage.
- Client sending data directly to S3 which is always encryption in transit.
- S3 storing data to S3-Storage can be encryption at rest.
- Two type of encryption. Client side and server side.
- Client side encryption: Client encrypts the data from their side before sending object to S3, and S3 sends it to S3-Storage As it is, S3 does not apply encryption to their side when client applies encryption.
- Server side encryption: Client sends object to S3 in Plaintext. S3 encrypts it and stores it in S3-Storage using Encryption at rest.
- Three methods.
- SSE-C (Server Side Encryption with Customer managed key)
- SSE-S3 (Server Side Encryption with S3 Managed key)
- SSE-KMS (Server Side Encryption with CMKs stored in KMS) (CMKs and KMS are mentioned in KMS Document)
- Customer manages the key, while S3 manages the process.
- Customer sends the key and plaintext to S3.
- S3 encrypts the data, attaches the hash of the key and discards key before sending it to S3-Storage.
- While decrypting, client sends the key, S3 gets encrypted data and hash. The hash is compared to key for verification. Once verified S3 decrypts the data and discards the key and sends decrypted data back to client.
- S3 manages the master key
- Each new object will have it's own key.
- Object is encrypted/decrypted by it's own key and key is encrypted/decrypted by master key.
- Both are stored side by side.
- Role separation is not possible in this encryption.
- AES-256 is used as algorithm.
- Header:
x-amz-server-side-encryption
, Value:AES-256
.
- S3 Creates/uses CMK in KMS
- This CMK generates DEK (Data Encryption Key) and sends plain text and encrypted key to S3.
- S3 uses plain text key to encrypt the data and stores encrypted data + keys to S3-Storage.
- When decryption CMK decrypts DEK and decrypted DEK decrypts object itself.
- Allows role separation and rotation control.
- Offers Envelope Encryption.
- Header:
x-amz-server-side-encryption
, Value:aws:kms
.
Six classes
- S3 Standard (Default)
- S3 Standard IA
- S3 One Zone IA
- S3 Glacier
- S3 Glacier Deep Archive
- S3 Intelligent Tiering
- IA Stands for Infrequent Access.
- Default
- Replication in 3+ AZs atleast.
- Low latency, high throughput.
- No minimum, delay or penalty.
- Should be default and should have strong reasons to use anything else.
- 99.99% availability
- Less frequent but Rapid Access
- Cheaper base storage compared to S3.
- 128 Kb minimum charge per object. (Object < 128 Kb it still charged as it's 128 Kb)
- 30 days minimum duration charge per object. (Object stored for one day will be charged for 30 days)
- per GB data retrieval fees.
- Long term storage, Backups or data store for disaster recovery files.
- 99.9% availability.
- Same tradeoffs as S3 Standard IA
- More cheaper than S3 Standard IA
- Data stored only in Single AZ.
- Good for secondary copies, easily re-creatable less frequent access data.
- 99.5% availability.
- Designed for archival data.
- 40 Kb minimum charge per object. (Object < 40 Kb it still charged as it's 40 Kb)
- 90 days minimum duration charge per object. (Object stored for one day will be charged for 90 days)
- Expedited retrieval: 1-5 mins
- Standard retrieval: 3-5 hours
- Bulk retrieval: 5-12 hours
- Provisioned Capacity: Ensures retrieval capacity for expedited retrieval is available when needed.
- Each Unit of capacity provides atleast 3 expedited retrieval can be performed every 5 mins and provides upto 150MB/s throughput.
- Almost same as Glacier, it's tape drive replacements for literal long term backups.
- 180 days minimum duration charge per object. (Object stored for one day will be charged for 180 days)
- Expedited retrieval: Not available
- Standard retrieval: upto 12 hours
- Bulk retrieval: upto 48 hours
- Cost $0.0025 per 1000 objects.
- Frequent and Infrequent Access classes.
- S3 monitors usage and if an object is not used for 30 days, it's moved to infrequent access class.
- Once object in infrequent access is accessed it's moved to frequent access class.
- Set of rules that consist some actions to be performed on bucket or group of objects.
- Transition action: Action changes class
- Expiration action: Action expires (deletes) the objects.
- Transition action is a waterfall model. You can move object from frequent access class to less frequent access class using lifecycle rules but not in the reverse direction
- Cross Region: Replication is done from bucket of one reagin (e.g. ap-southeast-2) to bucket of another region (us-east-1)
- Same region: Replication is done between buckets of same regions (This is recent feature)
- Replication is not retro-active: Object created before replication is enabled, will not be replicated to new bucket.
- Both Source & Destination buckets needs to have versioning on.
- One way replication from source to destination.
- On replication, encryption using SSE-C (or any other encryption methods where AWS does not have access to key) will not be available.
- For other replications, extra key is required.
- Owner of the source bucket needs permissions to object being replicated.
- No system event, Glacier, Glacier Deep Archive objects can be replicated.
- Deletion is not replicated.
- User can create url for any object, even if user doesn't have access to said object.
- Though all policies can apply, so even if user can generate the url of the object whose access is not with user, but object can't be accessed because of roles.
- When using the url, permissions applies from the identity which generated the url.
- When using URL, current permission is being applied. Not the permissions at the time of url being generated.
- Generate Presigned URL with IAM principals, not roles. Remember: Roles have temporary credentials, which needs to be renewed, and Presigned URLs can't renew them.
- In S3 or Glacier we interact with full object by default.
- So a when we use or download the object, bandwith of full size of the object is used even though we need only parts of it. (E.g. for a 5 gb data dump, we download full 5 gb first and then use part of this data even though the part we need is < 100 mb).
- S3/Glacier select lets us use SQL like statements.
- And only result of that SQL like statement is returned to user.
- Only selected formats.
- CSV, JSON (Along with BZIP2 Compression) or Apache Parquet