Skip to content

S3 Storage

lyx edited this page Jan 17, 2025 · 1 revision

In the S3Stream repository, S3 storage serves as a core component. The Write-Ahead Log (WAL) is used solely for write acceleration and fault recovery, while S3 is the actual storage location for data. With vast amounts of data migrating to the cloud, object storage has become the de facto storage engine for big data and data lake ecosystems. Today, we see numerous data-intensive software applications transitioning from file APIs to object APIs, and streaming data integration into data lakes, represented by Kafka, is a prevailing trend.

S3Stream leverages the Object API to provide efficient streaming data ingestion and retrieval. By employing a storage-compute separation architecture, it integrates Apache Kafka's storage layer with object storage. This approach fully capitalizes on the technical and cost advantages brought by shared storage:

  • Alibaba Cloud's object storage standard edition with intra-city redundancy is priced at 0.12 RMB/GiB/month, which is more than 8 times cheaper than ESSD PL1's pricing (1 RMB/1 GiB/month). Additionally, object storage inherently offers multi-availability zone availability and durability, eliminating the need for additional data replication. Compared to the traditional 3-replica architecture based on cloud disks, this can reduce costs by 25 times.

  • The shared storage architecture, in contrast to the Shared-Nothing architecture, truly embodies storage-compute separation, where data is not bound to computing nodes. Consequently, AutoMQ can perform partition reassignment without data duplication, achieving truly lossless partition reassignment in seconds. This capability is fundamental to supporting AutoMQ's real-time traffic self-balancing and second-level node horizontal scaling.

S3 Storage Architecture

All data in AutoMQ is stored in object storage through S3Stream, defining two types of Objects within object storage:

  • Stream Set Object: When uploading WAL data, the majority of smaller Streams are consolidated and uploaded as a single Stream Set Object.

  • Stream Object: The Stream Object contains data from a single Stream, facilitating precise data deletion for Streams with different lifecycles.

Data from a single Kafka partition is mapped onto multiple Streams, with two core components:

  • Metadata Stream: This stream stores data indices, Kafka LeaderEpoch snapshots, Producer snapshots, and other metadata.

  • Data Stream: This stream holds the complete Kafka data within the partition.

Metadata information of objects on S3 is stored in KRaft. To reduce the scale of metadata in scenarios involving a large number of partitions, the S3Stream internally provides two compaction mechanisms to merge small objects and thereby reduce the metadata scale.

StreamSet Object Compact

StreamSet Object Compaction is performed periodically in the background on the Broker at 20-minute intervals. Similar to RocksDB SST, StreamSet Object Compaction selects an appropriate list of StreamSet Objects on the current Broker based on certain policies and merges them using a merge sort method:

  • Streams that exceed 16MiB after merging will be split into individual Stream Objects and uploaded separately;

  • The remaining Streams are merged and written into a new StreamSet Object using merge sort.

Merge sorting for StreamSet Object compaction can merge up to 15TiB of StreamSet Objects under the constraints of 500MiB memory usage and a 16MiB read/write range size.

Through StreamSet Object Compact, fragmented small stream data segments are merged into larger data segments, significantly reducing API calls and enhancing read efficiency for cold data catch-up read scenarios.

Stream Object Compact

The core purpose of Stream Object Compact is to minimize the total amount of metadata required to maintain object mapping in the cluster and to improve the aggregation of Stream Object data, thereby reducing API call costs for cold data catch-up read.

Stream Objects involved in Compact are typically already 16MB, meeting the minimum part limit of object storage. Stream Object Compact uses the MultiPartCopy API of object storage to directly perform range copy uploads, avoiding the waste of network bandwidth from reading and then writing back to object storage.

Multi-Bucket Architecture

As one of the most critical storage services offered by various cloud providers, object storage provides twelve nines of data durability and up to four nines of availability. However, software failures can never be completely eliminated, and object storage can still experience significant software failures, rendering AutoMQ services unavailable.

On the other hand, multi-cloud, multi-region, and even hybrid cloud architectures are gradually emerging to meet enterprises' more flexible IT governance needs.

In view of this, the AutoMQ Business Edition innovatively adopts a multi-bucket architecture to further enhance system availability and meet enterprises' more flexible IT governance needs.

AutoMQ Business Edition supports configuring one or multiple buckets for data storage, and S3Stream supports four different write strategies to meet various business scenarios.

Round Robin

Multiple Buckets are treated equally, and data is written to them in a round-robin manner. This method is generally used to bypass the bandwidth limitations imposed by Cloud providers on a single Bucket or account. For instance, if a single Bucket supports only 5 GiB/s of bandwidth, combining two Buckets can achieve 10 GiB/s of bandwidth, supporting ultra-high traffic business scenarios.

Failover

Object storage can still experience failures, and software-level failures can sometimes be more severe than zone-level failures. For business scenarios with extremely high availability requirements, data can be written to two Buckets using a failover approach. The Bucket configuration in a failover scenario might be:

  • One as the primary Bucket, located in the same region as the business, where data is preferentially written.

  • Another as a standby Bucket, created in a different region or even on a different Cloud. Connectivity between the primary and standby regions is established through dedicated lines or the public internet. When the primary region's object storage is unavailable, new data is submitted to the standby Bucket. Although the standby link incurs higher network costs, these costs are relatively controllable since they only occur when the primary Bucket is unavailable.

This write strategy significantly enhances AutoMQ's disaster recovery scenarios, enabling AutoMQ to build multi-region disaster recovery, multi-cloud disaster recovery, and even hybrid cloud disaster recovery architectures at a low cost.

Replication

A failover-based multi-bucket architecture only redirects write traffic during failures, while read operations must wait for delayed reads. If your business cannot tolerate read delays during failures and requires an active-active read/write architecture, you can configure the write strategy to be in a replication mode, where data is synchronously written to multiple buckets.

This approach is very costly and is only suitable for mission-critical applications.

Dynamic Sharding

Data is written to multiple buckets in a dynamic, adjustable ratio, with round-robin being a specific scenario of this strategy. Dynamic sharding write methods are often suitable for multi-cloud and hybrid cloud architectures. In this architecture, you configure buckets across different cloud services or combine Public Cloud and Private Cloud object storage. By dynamically adjusting the traffic ratio, the architecture can always maintain the flexibility for multi-cloud migration or even cloud exit.

AutoMQ Wiki Key Pages

What is automq

Getting started

Architecture

Deployment

Migration

Observability

Integrations

Releases

Benchmarks

Reference

Articles

Clone this wiki locally