-
Notifications
You must be signed in to change notification settings - Fork 9
Architecture
This page outlines the general architecture and design principles of COMS and is mainly intended for a technical audience, and for people who want to have a better understanding of how the system works.
The Common Object Management Service is designed from the ground up to be a cloud-native containerized microservice. It is intended to be operated within a Kubernetes/OpenShift container ecosystem, where it can dynamically scale based on incoming load and demand. The following diagram provides a general logical overview of how the main components relate to one another, and how network traffic generally flows between the components as of April 2022.
Figure 1 - The general infrastructure and network topology of COMS
Figure 1 depicts what a fully deployed COMS ecosystem could look like. However, COMS can operate without a database and provide simpler S3 interfacing options. Depending on your own needs, you may choose to forego a database which will simplify the deployment lifecycle complexity.
The COMS API and Database are all designed to be highly available within an OpenShift environment. The Database achieves high availability by leveraging Patroni. COMS is designed to be a scalable and atomic microservice. On the OCP4 platform, depending on service load, there can be between 2 to 16 running replicas of the COMS microservice. This allows the service to reliably handle a large variety of request volumes and scale resources accordingly.
In general, all network traffic follows the standard OpenShift Route to Service pattern. When a client connects to the COMS API, they will be going through OpenShift's router and load balancer, which will then forward that connection to one of the COMS API pod replicas. Figure 1 represents the general network traffic direction through the use of the outlined fat arrows, and the direction of those arrows represents which component is initializing the TCP/IP connection.
Since this service depends on persistent database connections, we have them configured to leverage a network pool. This allows the service to avoid the overhead of the TCP/IP 3-way handshake that has to be done on every new connection and instead be able to leverage already existing connections to pipeline traffic and improve general efficiency. We pool connections from COMS to Patroni within our architecture. The OpenShift 4 Route and load balancer follows the general default scheduling behavior as defined by Kubernetes.
User clients using COMS will generally follow the standard Open ID Connect flow to authenticate themselves before requesting a specific resource. In Figure 2, we show the general step-by-step connection order that occurs between the client, COMS and S3.
Figure 2 - The general network flow for a typical COMS object request
For most cases where the user is requesting to download a specific object, they will connect to COMS after authenticating. COMS will generate a 302 temporary redirect link pointing directly to the S3 endpoint. The client will be able to download their object after following the redirect. As the 302 redirect link is designed to be temporary (the signature of the link will eventually expire, usually after 300 seconds), unauthorized access direct to the S3 store is not possible as you will require COMS to generate a signed temporary link on your behalf.
WHile COMS can support direct file downloads in addition to the redirect flow as shown in Figure 2, COMS by default has user clients using the redirect flow because it minimizes unnecessary network traffic hops between disparate infrastructure endpoints. In the event objects are very large in size, redirection has the added benefit of ensuring that the COMS microservice does the minimum amount of work required to get users where they need to go and maximizes availability of its service.
The Postgres database is written and handled via managed, code-first migrations as specified in the repository. We generally store tables regarding users, object, permissions, and how they relate to each other. Since COMS itself is a backend-focused mocroservice, any kind of frontend line of business application can leverage COMS to extend their features and capabilities without being tied to a specific framework. The following figures depict the database schema structure as of April 2022 for the v0.1.0 release.
Figure 3 - The public schema for a COMS database
For the most part, the database design is very simple and succinct, focusing only on the user, the object, the permissions, and how they relate to each other. We enforce foreign key integrity by invoking onUpdate and onDelete cascades in Postgres, ensuring that when entries are removed from the system, we do not have dangling references.
Figure 4 - The audit schema for a COMS database
In order to satisfy general security, tracking and auditing requirements, we use a generic audit schema table to track any update and delete operations done on the database. This table is only written to directly by the database via table triggers, and is not normally accessible by the COMS application itself.
While COMS itself is a relatively small and compact microservice with a very focused approach to handling and managing objects, not all design choices are self-evident just from inspecting the codebase. The following section will cover some of the main reasons why the code was designed the way it is.
The code structure in COMS follows a simple, layered structure following best practice recommendations from Express, Node, and ES6 coding styles. The application has the following discrete layers:
Layer | Purpose |
---|---|
Controller | Contains controller express logic for determining what services to invoke and in what order |
DB | Contains the direct database table model definitions and typical modification queries |
Middleware | Contains middleware functions for handling authentication, authorization and feature toggles |
Routes | Contains defined Express routes for defining the COMS API shape and invokes controllers |
Services | Contains logic for interacting with either S3 or the Database for specific tasks |
Each layer is designed to focus on one specific aspect of business logic; calls between layers are designed to be deliberate, scoped, and contained so that it is easy to tell at a glance what each piece of code is doing and what it depends on. Down the line we will also be introducing a validation layer which should be situated between the routes and controllers to ensure that incoming network calls are properly formatted before proceeding with execution.
We introduced network pooling for Patroni connections because we noticed that as our volume of traffic started going up, creating and destroying network connections for each transaction was extremely time consuming and costly. While low volumes of traffic are capable of operating without any apparently delay to the user, we started encountering issues scaling up and improving total transaction flow within COMS.
By reusing connections whenever possible, we were able to avoid the TCP/IP 3-way handshake that has to be done on every new connection and instead be able to leverage already existing connections to pipeline traffic and improve general efficiency. While this doesn't seem significant, in our testing, when we switched to pooling, we observed up to an almost 3x performance increase in total transaction volume flow.
In order to make sure our application can horizontally scale (run many copies of itself), we had to ensure that all processes in the application are self-contained and atomic. Since we do not have any guarantees of which pod instance would be handling what task at any specific moment, the only thing we can do is to ensure that every unit of work is clearly defined and atomic so that we can prevent situations where there is deadlock, or double executions.
While implementing Horizontal Autoscaling is relatively simple by using a Horizontal Pod Autoscaler construct in OpenShift, we can only take advange of it if the application is able to handle the different types of lifecycles. Based on usage metrics such as CPU and memory load, the HPA can increase or decrease the number of replicas on the platform in order to meet the demand.
We found that in our testing, we were able to reliably scale up to around 17 pods before we began to crash out our Patroni database. While we haven't been able to reliably isolate the cause of this, we suspect that the underlying Postgres database can only handle up to 100 concurrent connections (and is thus ignoring Patroni's max connection limit of 500) or that the database containers are simply running out of memory before being able to handle more connections. As such, this is why we decided to cap our HPA to a maximum of 16 pods at this time.
Our current limiting factor for scaling higher is the ability for our database to support more connections for some reason or another. If we get into the situation where we need to scale past 16 pods, we will need to consider more managed solutions for pooling db connections such as PgBouncer.
Return Home
API User Guide:
- Authentication
- Endpoint Notes
- Permissions
- Metadata and Tags
- Managing buckets
- Synchronization
- Use-Case Examples
Deployment Guide:
The Hosted Service: