-
Notifications
You must be signed in to change notification settings - Fork 9
Architecture
This page outlines the general architecture and design principles of COMS. It is mainly intended for a technical audience, and for people who want to have a better understanding of how the system works.
The Common Object Management Service is a cloud-native containerized microservice. It is primarily designed for use within a Kubernetes/OpenShift container ecosystem. COMS can leverage horizontal scaling by monitoring and reacting to incoming load and demand. The following diagram provides a general logical overview of main component relations. Main network traffic flows are shown in fat arrows, while secondary network traffic relations are shown with a simple black line.
Figure 1 - The general infrastructure and network topology of COMS
Figure 1 depicts what a fully deployed COMS ecosystem could look like. However, COMS can operate without a database and provide simpler S3 interfacing options. Removing the database simplifies the deployment lifecycle complexity. This can be a viable deployment pattern depending on your own needs.
The COMS API and Database are all designed to be highly available within an OpenShift environment. The Database achieves high availability by leveraging Patroni. COMS is designed to be a scalable and atomic microservice. On the OCP4 platform, there can be between 2 to 16 running replicas of the COMS microservice depending on service load. This allows the service to reliably handle a large variety of request volumes and scale resources appropriately.
In general, all network traffic enters through the API Gateway. A specifically tailored Network Policy rule exists to allow only network traffic we expect to receive from the API Gateway. When a client connects to the COMS API, they will be going through OpenShift's router and load balancer before landing on the API gateway. That connection then gets forwarded to one of the COMS API pod replicas. Figure 1 represents the general network traffic direction with the outlined fat arrows. The direction of those arrows represents which component is initializing the TCP/IP connection.
COMS uses a database network pool to maintain persistent database connections. Pooling allows the service to avoid the overhead of repeated TCP/IP 3-way handshakes to start a connection. By reusing existing connections in a network pool, we can pipeline and improve network efficiency. We pool connections from COMS to Patroni within our architecture. The OpenShift 4 load balancer follows general default Kubernetes scheduling behavior.
User clients of COMS will generally need to authenticate with the standard Open ID Connect flow first before requesting a specific resource. In Figure 2, we show the general step-by-step connection order that occurs between the client, COMS and S3.
Figure 2 - The general network flow for a typical COMS object request
The most anticipated use case for COMS is to download a specific object. Clients will first authenticate to acquire a JWT. Assuming clients are authorized, their request for an object resource will yield a 302 temporary redirect link pointing directly to the S3 endpoint. The client will be able to download their object after following the redirect. As the 302 redirect link is designed to be temporary (the signature of the link will eventually expire, usually after 300 seconds), unauthorized access direct to the S3 store is not possible as you will require COMS to generate a signed temporary link on your behalf.
COMS supports either the redirect flow as illustrated in Figure 2, or directly through COMS as a proxy. COMS uses the redirect flow by default because it avoids unnecessary network hops. For significantly large object transactions, redirection also has the added benefit of maximizing COMS microservice availability. Since the large transaction does not pass through COMS, it is able to remain capable of handling other client requests.
The Postgres database is written and handled via managed, code-first migrations. We generally store tables containing users, objects, buckets, permissions, and how they relate to each other. As COMS is a backend microservice, lines of business can leverage COMS without being tied to a specific framework or language. The following figures depict the database schema structure as of April 2023 for the v0.4.0 release.
Figure 3 - The public schema for a COMS database
Database design focuses on simplicity and succinctness. It effectively tracks the user, the object, the bucket, the permissions, and how they relate to each other. We enforce foreign key integrity by invoking onUpdate and onDelete cascades in Postgres. This ensures that we do not have dangling references when entries are removed from the system. Metadata and tags are represented as many-to-many relationships to maximize reverse search speed.
Figure 4 - The audit schema for a COMS database
We use a generic audit schema table to track any update and delete operations done on the database. This table is only modified by database via table triggers, and is not normally accessible by the COMS application itself. This should meet most general security, tracking and auditing requirements.
COMS is a relatively small and compact microservice with a very focused approach to handling and managing objects. However, not all design choices are self-evident just from inspecting the codebase. The following section will cover some of the main reasons why the code was designed the way it is.
The code structure in COMS follows a simple, layered structure following best practice recommendations from Express, Node, and ES6 coding styles. The application has the following discrete layers:
Layer | Purpose |
---|---|
Controller | Contains controller express logic for determining what services to invoke and in what order |
DB | Contains the direct database table model definitions and typical modification queries |
Middleware | Contains middleware functions for handling authentication, authorization and feature toggles |
Routes | Contains defined Express routes for defining the COMS API shape and invokes controllers |
Services | Contains logic for interacting with either S3 or the Database for specific tasks |
Validators | Contains logic which examines and enforces incoming request shapes and patterns |
Each layer is designed to focus on one specific aspect of business logic. Calls between layers are designed to be deliberate, scoped, and contained. This hopefully makes it easier to tell at a glance what each piece of code is doing and what it depends on. For example, the validation layer sits between the routes and controllers. It ensures that incoming network calls are properly formatted before proceeding with execution.
COMS middleware focuses on ensuring that the appropriate business logic filters are applied as early as possible. Concerns such as feature toggles, authentication and authorization are handled here. Express executes middleware in the order of introduction. It will sequentially execute and then invoke the next callback as a part of its call stack. Because of this, we must ensure that the order we introduce and execute our middleware adhere to the following pattern:
- Run the
require*
middleware functions first (these generally invole the middleware found infeatureToggle.js
) - Permission and authorization checks
- Validation and structural cheks
- Any remaining middleware hooks before invoking the controller
This execution order must followed to maximize security as well as failing-fast where it makes sense. For example, suppose COMS is running without a database. In this situation, should a client call an endpoint that does not function in this mode, it is more beneficial to immediately let them know of this fact, than to mask it behind authorization.
On the other hand, suppose an unauthorized client peforms a malformed request. We should be blocking this request and returning an HTTP 403 as soon as possible as they are not authorized to use the service. In this situation, there is no logical reason to even check the shape of the request, as they are not even a qualified actor to use the service. We can maximize security and accessibility concerns when the middleware executes in the right order. This minimizes the amount of logic that needs to be executed before it can respond to the client.
We introduced network pooling for Patroni connections to mitigate network traffic overhead. As our volume of traffic increased, it became expensive to create and destroy network connections for each transaction. While low volumes of traffic are capable of operating without any notable delay to the user, we started encountering issues when scaling up and improving total transaction flow within COMS.
By reusing connections whenever possible, we were able to avoid the TCP/IP 3-way handshake done on every new connection. Instead we could leverage existing connections to pipeline traffic and improve general efficiency. We observed up to an almost 3x performance increase in total transaction volume flow by switching to pooling.
In order to make sure our application can horizontally scale (run many copies of itself), we had to ensure that all processes in the application are self-contained and atomic. Since we do not have any guarantees of which pod instance would be handling what task at any specific moment, the only thing we can do is to ensure that every unit of work is clearly defined and atomic so that we can prevent situations where there is deadlock, or double executions.
While implementing Horizontal Autoscaling is relatively simple by using a Horizontal Pod Autoscaler construct in OpenShift, we can only take advange of it if the application is able to handle the different types of lifecycles. Based on usage metrics such as CPU and memory load, the HPA can increase or decrease the number of replicas on the platform in order to meet the demand.
We found that in our testing, we were able to reliably scale up to around 17 pods before we began to crash out our Patroni database. While we haven't been able to reliably isolate the cause of this, we suspect that the underlying Postgres database can only handle up to 100 concurrent connections (and is thus ignoring Patroni's max connection limit of 500) or that the database containers are simply running out of memory before being able to handle more connections. As such, this is why we decided to cap our HPA to a maximum of 16 pods at this time.
Our current limiting factor for scaling higher is the ability for our database to support more connections for some reason or another. If we get into the situation where we need to scale past 16 pods, we will need to consider more managed solutions for pooling db connections such as PgBouncer.
Return Home
API User Guide:
- Authentication
- Endpoint Notes
- Permissions
- Metadata and Tags
- Managing buckets
- Synchronization
- Use-Case Examples
Deployment Guide:
The Hosted Service: