From 9aa69a342ca8f21fe9e72dba3c0b9e2175fb23f0 Mon Sep 17 00:00:00 2001
From: Fabrizio Sestito <fabrizio.sestito@suse.com>
Date: Fri, 25 Oct 2024 08:09:06 +0200
Subject: [PATCH] docs: add architecture and design RFC

Signed-off-by: Fabrizio Sestito <fabrizio.sestito@suse.com>
---
 docs/minikube-walkthrough.md                  | 110 -----
 ...0001_scanner_architecture_and_design.md.md | 384 ++++++++++++++++++
 2 files changed, 384 insertions(+), 110 deletions(-)
 delete mode 100644 docs/minikube-walkthrough.md
 create mode 100644 docs/rfc/0001_scanner_architecture_and_design.md.md

diff --git a/docs/minikube-walkthrough.md b/docs/minikube-walkthrough.md
deleted file mode 100644
index 2fb0d39..0000000
--- a/docs/minikube-walkthrough.md
+++ /dev/null
@@ -1,110 +0,0 @@
-# Minikube walkthrough
-
-This document will take you through setting up and trying the sample apiserver on a local minikube from a fresh clone of this repo.
-
-## Pre requisites
-
-- Go 1.7.x or later installed and setup. More information can be found at [go installation](https://go.dev/doc/install)
-- Dockerhub account to push the image to [Dockerhub](https://hub.docker.com/)
-
-## Install Minikube
-
-Minikube is a single node Kubernetes cluster that runs on your local machine. The Minikube docs have installation instructions for your OS.
-- [minikube installation](https://github.com/kubernetes/minikube#installation)
-
-
-## Clone the repository
-
-In order to build the sample apiserver image we will need to build the apiserver binary.
-
-```
-cd $GOPATH/src
-mkdir -p k8s.io
-cd k8s.io
-git clone https://github.com/kubernetes/sample-apiserver.git
-```
-
-## Build the binary
-
-Next we will want to create a new binary to both test we can build the server and to use for the container image.
-
-From the root of this repo, where ```main.go``` is located, run the following command:
-```
-export GOOS=linux; go build .
-```
-if everything went well, you should have a binary called ```sample-apiserver``` present in your current directory.
-
-## Build the container image
-
-Using the binary we just built, we will now create a Docker image and push it to our Dockerhub registry so that we deploy it to our cluster.
-There is a sample ```Dockerfile``` located in ```artifacts/simple-image``` we will use this to build our own image.
-
-Again from the root of this repo run the following commands:
-```
-cp ./sample-apiserver ./artifacts/simple-image/kube-sample-apiserver
-docker build -t <YOUR_DOCKERHUB_USER>/kube-sample-apiserver:latest ./artifacts/simple-image
-docker push <YOUR_DOCKERHUB_USER>/kube-sample-apiserver
-```
-
-## Modify the replication controller
-
-You need to modify the [artifacts/example/deployment.yaml](/artifacts/example/deployment.yaml) file to change the ```imagePullPolicy``` to ```Always``` or ```IfNotPresent```.
-
-You also need to change the image from ```kube-sample-apiserver:latest``` to ```<YOUR_DOCKERHUB_USER>/kube-sample-apiserver:latest```. For example:
-
-```yaml
-...
-      containers:
-      - name: wardle-server
-        image: <YOUR_DOCKERHUB_USER>/kube-sample-apiserver:latest
-        imagePullPolicy: Always
-...
-```
-
-Save this file and we are then ready to deploy and try out the sample apiserver.
-
-## Deploy to Minikube
-
-We will need to create several objects in order to setup the sample apiserver so you will need to ensure you have the ```kubectl``` tool installed. [Install kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/).
-
-```
-# create the namespace to run the apiserver in
-kubectl create ns wardle
-
-# create the service account used to run the server
-kubectl create -f artifacts/example/sa.yaml -n wardle
-
-# create the rolebindings that allow the service account user to delegate authz back to the kubernetes master for incoming requests to the apiserver
-kubectl create -f artifacts/example/auth-delegator.yaml -n kube-system
-kubectl create -f artifacts/example/auth-reader.yaml -n kube-system
-
-# create rbac roles and clusterrolebinding that allow the service account user to use admission webhooks
-kubectl create -f artifacts/example/rbac.yaml
-kubectl create -f artifacts/example/rbac-bind.yaml
-
-# create the service and replication controller
-kubectl create -f artifacts/example/deployment.yaml -n wardle
-kubectl create -f artifacts/example/service.yaml -n wardle
-
-# create the apiservice object that tells kubernetes about your api extension and where in the cluster the server is located
-kubectl create -f artifacts/example/apiservice.yaml
-```
-
-## Test that your setup has worked
-
-You should now be able to create the resource type ```Flunder``` which is the resource type registered by the sample apiserver.
-
-```
-kubectl create -f artifacts/flunders/01-flunder.yaml
-# outputs flunder "my-first-flunder" created
-```
-
-You can then get this resource by running:
-
-```
-kubectl get flunder my-first-flunder
-
-#outputs
-# NAME               KIND
-# my-first-flunder   Flunder.v1alpha1.wardle.example.com
-```
diff --git a/docs/rfc/0001_scanner_architecture_and_design.md.md b/docs/rfc/0001_scanner_architecture_and_design.md.md
new file mode 100644
index 0000000..13209ab
--- /dev/null
+++ b/docs/rfc/0001_scanner_architecture_and_design.md.md
@@ -0,0 +1,384 @@
+|              |                                 |
+| :----------- | :------------------------------ |
+| Feature Name | Scanner architecture and design |
+| Start Date   | Oct 24th, 2024                  |
+| Category     | Architecture                    |
+| RFC PR       |                                 |
+| State        | **ACCEPTED**                    |
+
+# Summary
+
+[summary]: #summary
+
+Create a SBOM-centric registry vulnerability scanner that integrates well with Rancher.
+
+# Motivation
+
+[motivation]: #motivation
+
+The purpose of this RFC is to define a vulnerability scanner that scans container images and artifacts in a registry,
+generates a Software Bill of Materials (SBOM), and provides vulnerability reports that include discovered CVEs and other security issues.
+
+Another goal is to create a scanner that integrates seamlessly with Rancher,
+offering an easy way to access scanner results through the Rancher UI and connect with other Rancher components, such as Kubewarden and SUSE Observability.
+
+## Examples / User Stories
+
+[examples]: #examples
+
+- As a user, I want to scan all the images in my registry/repository for vulnerabilities.
+- As a user, I want to see the vulnerabilities found in my images in the Rancher UI.
+- As a user, I want to know which layers of my images are affected by the vulnerabilities.
+
+Examples of user stories that can be achieved with the integration with other Rancher components:
+
+- As a user, I want to deploy/write Kubewarden policies based on the vulnerabilities found in my images.
+- As a user, I want to see the vulnerabilities found in my container images in the SUSE Observability dashboard.
+
+# Detailed design
+
+[design]: #detailed-design
+
+## Processes
+
+We define two main operations:
+
+- the "discovery operation", as the process of cataloging all the images in a registry/repository, retrieving the image metadata and layers, and generating a SBOM.
+- the "scan operation", as the process of scanning the images for vulnerabilities, generating a vulnerability report containing the discovered CVE and other security issues.
+
+Discovery and scan operations can be triggered by the user or by a schedule.
+
+## CRD
+
+The following CRDs will be added to the cluster. Please note that the detailed definition of the CRDs is outside of the context of this RFC.
+
+### Registry
+
+Registry represents a registry to be scanned. It contains the registry URL, the name of the secret containing auth credentials, the repositories to be scanned, and the discovery and scan schedules.
+
+```yaml
+apiVersion: scanner.rancher.io/v1alpha1
+kind: Registry
+metadata:
+  name: registry-example
+  namespace: default
+spec:
+  url: "https://registry-1.docker.io"
+  type: "docker" # registry type, e.g., docker, GCR, ECR, etc...
+  auth:
+    secretName: "registry-secret" # secret name used for authentication
+  discoveryPeriod: "1h" # discovery new images every 1 hour
+  scanPeriod: "1d" # scan images every day
+  repositories: # optional, if not specified, scan all repositories
+    - "repo1"
+    - "repo2"
+```
+
+### Image
+
+`Image` represents an image to be scanned. It contains the layers of the image.
+Labels are used to select the image by the registry, repository, and tag.
+
+```yaml
+apiVersion: scanner.rancher.io/v1alpha1
+kind: Image
+metadata:
+  name: "uuid"
+  namespace: default
+  labels:
+    "scanner.rancher.io/image": "nginx:v1.19.0" # tag of the image
+    "scanner.rancher.io/digest": "sha256:example" # digest of the image
+    "scanner.rancher.io/registry": "registry-example" # registry name
+    "scanner.rancher.io/registry-namespace": "default" # registry namespace
+    "scanner.rancher.io/repository": "prod" # repository name
+spec:
+  layers:
+    - ...
+    # list of the image layers
+```
+
+### SBOM
+
+`SBOM` represents a Software Bill of Materials of an image.
+
+```yaml
+apiVersion: canner.rancher.io/v1alpha1
+kind: SBOM
+metadata:
+  name: "uuid" # uuid of the image
+  namespace: default
+  labels:
+    "scanner.rancher.io/image": "nginx:v1.19.0" # tag of the image
+    "scanner.rancher.io/digest": "sha256:example" # digest of the image
+    "scanner.rancher.io/registry": "registry-example" # registry name
+    "scanner.rancher.io/registry-namespace": "default" # registry namespace
+    "scanner.rancher.io/repository": "prod" # repository name
+spec:
+  sbom:
+    # the SBOM content in json SPDX format
+```
+
+### VulnerabilityReport
+
+The `VulnerabilityReport` CRD represents the vulnerabilities found in an image.
+The content of the report is in [SARIF](https://sarifweb.azurewebsites.net/) format, which is a standard format for the output of static analysis tools
+and it is approved as an [OASIS](https://www.oasis-open.org/) standard.
+
+```yaml
+apiVersion: scanner.rancher.io/v1alpha1
+kind: VulnerabilityReport
+metadata:
+  name: "uuid" # uuid of the image
+  namespace: default
+  labels:
+    "scanner.rancher.io/image": "nginx:v1.19.0" # tag of the image
+    "scanner.rancher.io/digest": "sha256:example" # digest of the image
+    "scanner.rancher.io/registry": "registry-example" # registry name
+    "scanner.rancher.io/registry-namespace": "default" # registry namespace
+    "scanner.rancher.io/repository": "prod" # repository name
+spec:
+  report:
+    # vulnerabilities found in the image in SARIF format
+```
+
+### DiscoveryJob
+
+A DiscoveryJob represents a discovery operation that can be triggered by the user or by a schedule.
+It tracks the status condition of the discovery operation.
+
+```yaml
+apiVersion: scanner.rancher.io/v1alpha1
+kind: DiscoveryJob
+metadata:
+  name: discovery-job-example
+  namespace: default
+spec:
+  registry: registry-example # registry name
+status:
+  conditions:
+    - type: "Progressing"
+      status: "True"
+      reason: "Finished"
+      ...
+    - type: "Completed"
+      status: "True"
+      reason: "Succeeded"
+      ...
+```
+
+### ScanJob
+
+A `ScanJob` represents a scan operation that can be triggered by the user or by a schedule.
+It tracks the status condition of the scan operation.
+
+```yaml
+apiVersion: scanner.rancher.io/v1alpha1
+kind: ScanJob
+metadata:
+  name: scan-job-example
+  namespace: default
+spec:
+  registry: registry-example # registry name
+status:
+  conditions:
+    - type: "Progressing"
+      status: "True"
+      reason: "Finished"
+      ...
+    - type: "Completed"
+      status: "True"
+      reason: "Succeeded"
+      ...
+```
+
+## Components
+
+The scanner is composed of three main components:
+
+- Worker: one or more workers that are responsible for cataloging the registry, generating the SBOM, and scanning the images.
+- Storage: an API server extension that stores SBOMs and vulnerability reports in a database, to avoid using Etcd as a storage backend.
+
+### Controller
+
+The controller is a Kubernetes controller responsible for reconciling the scanner's CRDs and initiating the discovery of the registry, as well as scanning the images.
+It will manage the scheduling of recurring discovery and scan operations as specified in the CRD, along with handling user-initiated scan and discovery requests.
+
+The controller will use a worker queue to communicate with the workers.
+We decided to use [NATS](https://nats.io/) as the message broker.
+NATS can be embedded in the controller, making it easier to deploy and manage, this means that the controller should run with at least 3 replicas to ensure high availability.
+The user will have the option to use an external NATS server if needed.
+This setup enables the creation of atomic jobs, such as generating the SBOM for a single image or scanning an individual image, while allowing for the independent scaling of workers apart from the controller.
+An alternative to NATS would be to use Kubernetes Jobs; however, running one Pod per job can be costly and inefficient.
+
+### Worker
+
+The worker is responsible for cataloging the registry, generating the SBOMs, and scanning the images.
+Multiple workers can be deployed to scale the discovery and scan operations.
+This allows Kubernetes to automatically scale the worker pool to match demand, using mechanism such as [Horizontal Pod Autoscaling](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/).
+
+Workers pull jobs from the NATS queue and process them.
+After completing a job, the worker will create or update the related CRD with the results.
+
+The cataloging process will be implemented using the [go-containerregistry](https://github.com/google/go-containerregistry) library,
+which is a Go library that provides a high-level API to interact with container registries.
+The image metadata and layers will be stored in the `Image` CRD.
+
+The SBOM generation and the scan process will be implemented by using the adapter pattern, allowing the worker to be configured to use different scanners.
+For the first implementation, we will use [Trivy](https://github.com/aquasecurity/trivy) as both the SBOM generator and the scanner.
+Trivy can be used as a library directly in the worker, avoiding the need to spawn a new process for each scan.
+As a future improvement, we can implement other adapters to use different scanners, such as [Grype](https://github.com/anchore/grype) and [Clair](https://github.com/quay/clair).
+
+We rely on SBOMs as the primary source of truth for the CVE scanner, as they enable caching of the image inventory,
+eliminating the need to retrieve the image from the registry each time a scan is initiated.
+This approach also allows us to deduplicate images with identical SHA256 hashes but different tags.
+A different scanner, such as the secret scanner, may need to pull the image from the registry.
+However, this analysis is performed only once when the image is created.
+
+Another type of optimization is to verify if the CVE database contains relevant updates for the vulnerabilities associated with the image’s dependencies,
+as determined from the SBOM analysis, before proceeding with the scan.
+For instance, when the vulnerability database is updated with new vulnerabilities for Alpine Linux, the scanner will target only the SBOMs of images that are based on Alpine Linux.
+
+### Storage
+
+The storage is responsible for extending the Kubernetes API server to store SBOMs and vulnerability reports in a database.
+This is needed to avoid using Etcd as a storage backend, as it is not designed to store large amounts of data.
+Please refer to the [Kubernetes Extension API server] (https://kubernetes.io/docs/tasks/extend-kubernetes/setup-extension-api-server/) documentation for more information.
+
+The storage will expose the `SBOM` and `VulnerabilityReport` CRDs to the other components and the user.
+Utilizing CRDs will facilitate easier integration with other Rancher components.
+For instance, the Rancher UI can transparently retrieve the SBOM and vulnerability reports from storage as they function like standard CRDs.
+In Kubewarden, policies that utilize [context-aware](https://docs.kubewarden.io/explanations/context-aware-policies) calls will have the ability to access the vulnerabilities associated with an image.
+
+It will be possible to use different database adapters such as MySQL, PostgreSQL, or SQLite.
+The user will be able to configure the database connection in the storage helm chart.
+
+The storage will also serve as a deduplication layer by utilizing the same data for SBOMs, vulnerability reports, and image data associated with the same image sha256 hash.
+However, users will still see distinct resources, as the deduplication process will be managed internally, allowing consumers of the data to remain unaware of it.
+
+### Helm Chart
+
+A Helm chart will be provided to deploy the scanner components.
+
+## Operational Flow
+
+1. The user creates a `Registry` resource to define the registry designated for scanning.
+2. The controller submits a request for a discovery job to the worker queue.
+3. A worker pulls the discovery job and initiates the discovery process.
+4. The worker generates `Image` resources for each image identified in the registry.
+5. The `Image` reconciler receives the `Image` resources and issues a request to generate the SBOM to the worker queue.
+6. A worker pulls the SBOM generation job and commences the generation process.
+7. The worker produces the SBOM and stores it in the `SBOM` resource.
+8. The `SBOM` reconciler receives the `SBOM` resources and issues a scan job to the worker queue.
+9. A worker pulls the scan job and begins the scanning process.
+10. The worker generates the vulnerability report and stores it in the `VulnerabilityReport` resource.
+
+The discovery process and scan process are scheduled by the controller based on the `discoveryPeriod` and `scanPeriod` fields specified in the `Registry` resource.
+Moreover, users have the option to manually trigger discovery and scan operations by creating a `DiscoveryJob` or a `ScanJob` resource.
+
+### Multi-tenancy
+
+All the CRDs will be namespaced to enable multi-tenancy support.
+The controller will manage multiple registries across various namespaces.
+The scan results will be stored within the same namespace as the associated registry.
+
+# Architectural Diagram
+
+```mermaid
+graph LR
+    Workers -->|Pulls Jobs| NATS
+    Workers -->|Creates| Image
+    Workers -->|Scans| SBOM
+    Workers -->|Generates| VulnerabilityReport
+    Workers -->|Discovers| Registry
+
+    subgraph Workers
+    subgraph W[" "]
+        Worker1["Worker 1"]
+        Worker2["Worker 2"]
+        ...
+        WorkerN["Worker N"]
+    end
+    W --> RegistryDiscovery["RegistryDiscovery"]@{shape: flag}
+    W --> ScannerAdapter["Scanner Adapter"]@{shape: flag}
+    end
+
+    subgraph Storage
+        subgraph StorageCRD["CRD"]
+            Image@{shape: doc }
+            SBOM@{ shape: doc }
+            VulnerabilityReport@{ shape: doc }
+        end
+
+        DB[(Database)]
+        Storage1["API Server Extension"]
+        Storage1 --> DB
+
+        Image --> Storage1
+        SBOM --> Storage1
+        VulnerabilityReport --> Storage1
+    end
+
+    subgraph Controller
+        subgraph CRD
+            Registry@{ shape: doc }
+            ScanJob@{ shape: doc }
+            DiscoveryJob@{ shape: doc }
+        end
+
+        NATS["NATS Server"]@{ shape: das }
+        Controller1["Controller Replica 1 (Leader)"]
+        Controller2["Controller Replica 2"]
+        Controller3["Controller Replica 3"]
+        Controller1  --> NATS
+        Controller2  --> NATS
+        Controller3  --> NATS
+    end
+
+    Controller1 <--> |Reconciles| CRD
+    Controller1 <--> |Reconciles| StorageCRD
+```
+
+# Drawbacks
+
+[drawbacks]: #drawbacks
+
+<!---
+Why should we **not** do this?
+
+  * obscure corner cases
+  * will it impact performance?
+  * what other parts of the product will be affected?
+  * will the solution be hard to maintain in the future?
+--->
+
+# Alternatives
+
+[alternatives]: #alternatives
+
+Currently, there is no registry vulnerability scanner that it is tailored to the Rancher ecosystem.
+Few options exist, but they are focused on the whole cluster, including nodes vulnerability and compliance, and they don't provide
+registry discovery capabilities.
+Also, most of the existing scanners double as policy enforcers, overlapping with [Kubewarden](https://kubewarden.io/).
+
+A few notable alternatives are:
+
+- [Trivy Operator](https://github.com/aquasecurity/trivy-operator)
+- [KubeScape](https://github.com/kubescape/kubescape)
+- [Harbor](https://github.com/kubescape/kubescape)
+
+# Unresolved questions
+
+[unresolved]: #unresolved-questions
+
+<!---
+- What are the unknowns?
+- What can happen if Murphy's law holds true?
+--->
+
+```
+
+```
+
+```
+
+```