Releases: litmuschaos/litmus
2.0.0-Beta5
Minor SA fix in eventtracker (namespace) (#2760) Signed-off-by: Raj Das <[email protected]>
2.0.0-Beta4
Major Updates
- Fixes the inability to successfully register the agents/targets when litmus portal server is brought up with loadbalancer/nodeport service type
- Makes MyHub source configurable by branch so that latest stable versions of experiments are pulled for custom & predefined workflows
- Updates the chaos operator dependencies on the subscriber to make use of the latest api changes for chaos resources
- Updates the chaos operator, runner & exporter image tunables/ENVs in the subscriber so that the latest stable versions are installed on the targets
- Updates Okteto dev setup instructions to reflect latest image versions and changes in specification (env) as well as instructions
- Updates the chaosengine CRD validation schema for annotation injection in the manifests maintained & installed by the subscriber
Minor Updates
- Improves the icons for revert chaos and workflow scheduling
- Optimizes the teaming code to remove redundant conditions
- Improved styling & background adopted from litmus-ui
2.0.0-Beta3
Litmus 2.0.0-Beta3
Major Updates
- Support for policy-based control of event tracker where users can define their own policy using JMESPath query and based on that event-tracker will react to the application changes.
- Enhanced UI for workflow Scheduling, gives users the ability to tune annotations, target application details like application namespace, labels, and kind, and probe data using User Interface.
- New UI for workflow visualization for showing information about workflow and nodes in a better way.
- We made the onboarding process for users and easier to use through the new UI.
- Enhanced the homepage to show information like Recent workflow runs, Agent details, and Project details.
- Shifting project switching from using Redux-based technique to URL-based technique to avoid caching problems.
- Migrated CircleCI to GitHub workflow and enhanced the continuous integration of the project.
- Enhanced the analytics module in terms of UI and computation
- Enhanced the browse workflows table to show resilience score and the total number of experiments passed for the listed workflows.* Support role-based access control in the backend for handling authorization for all requests.
- Support for storing scheduled workflow templates and adding some new podtato-head predefined workflow templates
Minor Updates
- Increment in the Better Code Hub(BCH) score
- Optimized the frontend by shifting the resiliency score calculation to the backend.
- Restructured the directory structure for settings in the frontend to modularise the code.
- Support for a reinstall of litmus agents by moving the
litmus-portal-config
configmap independent of the subscriber. - Support for Ingress and Load balancer network type for connecting external agents with Litmus Portal. Based on the server service type, it will generate the endpoint for the external agent.
2.0.0-Beta2
Added beta2 fixes for auth and teaming (#2612) Signed-off-by: Saranya-jena <[email protected]>
2.0.0-Beta1
Major Updates
- Support for in-built analytics, where users can connect their data sources and generate dashboard panels.
- Support for Git as a single source of truth for workflow artifacts. This enables users to have their workflows synced between the portal and Git source.
- Introduces the event-tracker microservice to trigger chaos workflows automatically upon change to application images. This feature works in tandem with GitOps frameworks that rollout changes to applications upon manual changes in the Git source or upon image push to registries.
- Support for re-running of existing chaos workflow from the litmus portal.
- Adding a command-line tool called
litmusctl
to manage litmus portal services. The key role of litmusctl is to connect the external cluster with the litmus server and install the external agents. - Redesigning the teaming user interface and adding some significant features such as leave project, decline invitation.
- Recreating litmus docs for litmus 2.0.x. For more information, visit https://litmusdocs-beta.netlify.app/
- Integration of Litmus-UI with litmus portal components
- Major directory restructuring of litmus portal’s server for database handlers
Minor updates
- Changing MongoDB kind from deployment to statefulsets
- Adding chaos-exporter as default external cluster agents for litmusportal
- Refactoring authentication server to accommodate new teaming integration
- Removing some unnecessary inputs from the welcome modal and predefined chaos workflow
2.0.0-Beta0
Fixed default error state for password fields and fixed modal padding…
1.13.8
New Features & Enhancements
-
Introduces upgraded pod-cpu-hog & pod-memory-hog experiments that inject stress-ng based chaos stressors into target containers pid namespace (non-exec model).
-
Supports multi-arch images for chaos-scheduler controller
-
Supports CIDR apart from destination IPs/hostnames in the network chaos experiments
-
Refactors the litmus-python repository structure to match the litmus-go & litmus-ansible repos. Introduces a sample python-based pod-delete experiment with the same flow/constructs as its go-equivalent to help establish a common flow for future additions. Also adds a BYOC folder/category to hold non-litmus native experiment patterns.
-
Refactors the litmus-ansible repo to remove the stale experiments (which have been migrated and improved in litmus-go). Retains (improves) samples to help establish a common flow for future additions
-
Adds GCP chaos experiments (GCP VM stop, GPD detach) in technical-preview mode
Major Bug Fixes
-
Fixes erroneous logs in the chaos-operator seen while attempting to remove finalizer on chaosengine
-
Fixes a condition where the chaos revert information is present in both annotations as well as the status of chaosresult CR (the inject/revert status is typically maintained/updated as an annotation on the chaosresult before it is updated into the status and cleared/removed from annotations)
-
Removes hardcoded experiment job entrypoint, instead of picking from the ChaosExperiment CR’s
.spec.definition.command
-
Fixes a scheduler bug that interprets a minChaosInterval mentioned in hours (ex: 1h) in minutes
-
Improves the scheduler reconcile to stop flooding/logging every “reconcile” seconds irrespective of the minChaosInterval
-
Enables the scheduler to start off with the chaos injection immediately upon application of the ChaosSchedule CR without waiting for the first installment of minChaosInterval period - in repeat mode with only the minChaosInterval specified
-
Handles edge/boundary conditions where chaos
StartTime
is behindCreationTimeStamp
of ChaosSchedule OR next iteration of chaos as per minChaosInterval is beyond the EndTime -
Adds a check to ignore chaos pods (operator, runner, experiment/helper/probe pods) and blacklist them from being chaos candidates (esp. needed when appinfo.applabel is configured with exclusion patterns such as:
!keys
OR<key> notin <value>
) -
Removes hostIPC,
hostNetwork
permissions for pod stress chaos experiments -
Fixes an incorrect env key for TOTAL_CHAOS_DURATION in pod-dns experiments
-
Fixes a regression introduced in 1.13.6 wherein the experiment expected the parent workloads (deployment, statefulset et al) to carry labels specified in
appinfo.applabel
, apart from just the pods even when.spec.annotationCheck
was set to false in the ChaosEngine. Prior to this, the parent workloads needed to have the label only when.spec.annotationCheck
was set to true. This has been re-corrected as per earlier expectations.
Limitations
-
Chaos abort (via .spec.engineState set to stop OR via chaosengine deletion) operation is known to have an issue with the namespace scoped chaos-operator in 1.13.8, i.e., an operator running with WATCH_NAMESPACE env set to a specific value and using role permissions. In such cases, the finalizer on the ChaosEngine needs to be removed manually and the resource deleted to ensure the operator functions properly.
This is not needed/necessary for cluster scoped operators (which is the default mode of usage)(where WATCH_NAMESPACE env is set to empty string to cover all ns & leverages clusterrole permissions.)
The fix for correcting the behavior of namespace scoped operators will be added in the next patch.
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.8.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs
1.13.6
New Features & Enhancements
-
Supports automated rollback/abort of chaos depending upon predefined conditions (defined in the probes). The probes can now be configured with a StopOnFailure property set to true or false to control the execution flow of the experiment.
-
Enhances the ChaosResult status schema to provide details of (a) the target resource impacted (b) success of the chaos revert operation.
-
Introduces additional labels for the “interleaved” chaos metrics (
litmus_awaited_experiments
&litmus_experiment_verdict
) to indicate workflow name & chaos injection timestamp. This is expected to help in the construction of more meaningful dashboards to track app behavior under chaos. -
Adds the golang chaoslib and experiment logic for docker-service-kill (from ansible)
-
Introduces the tech-preview of a new category (aws-ssm) of chaos experiments that can inject common resource and network chaos in EC2 instances (which is part of a kubernetes cluster or a standalone/vanilla instance).
-
Introduces the tech-preview of refactored pod-cpu-hog & pod-memory-hog chaos experiments that can inject resource chaos on target apps externally (non-exec mode) via cgroup operations.
-
Improves/dockerizes the build process for most components (removes vendor packages stored on the repo and migrates to github workflows)
-
Reduces the size of the experiment (go-runner) image by creating a single chaos helper component that takes specific chaos operations as flags
-
Extends the StatusCheckTimeout property to the helper pods (earlier releases had this only for pre/post chaos checks), thereby helping the flexible evaluation of application availability/readiness during the chaos
-
Adds a new event for “Abort” on the ChaosResult
-
Increases coverage in the commit-based e2e runs on the litmus-go repo with the addition of node chaos tests
-
Adds a new helm chart for kube-aws (chaos experiment bundle) in the litmus-helm repository.
-
Enhances the litmus-sdk to (a) create a highly generic experiment scaffolding that can trigger and kill chaos via shell commands passed as environment variables (change from an earlier sample of pod-delete) and (b) push all non-code files (CR yamls) into a dedicated directory that can be directly copied/committed to the chaos-charts repo.
-
Cuts the first tagged release on the test-tools repository and sets up downloadable artifacts for the dependent chaos utils (nsutil, pauseutil, promql, dns-interceptor).
Major Bug Fixes
-
Adds missing environment variables for kill sequence and pod affected percentage in the kafka-broker-pod-failure experiment
-
Fixes the missing environment variable for defining the spoof map within the dns-spoof experiment.
-
Fixes the ChaosScheduler to work with the latest versions of the chaos-operator and updates documentation with missing mandatory properties in the .spec.engineTemplate
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.6.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs
1.13.5
New Features & Enhancements
-
Introduces category for VMWare chaos with VM power-off experiment (supported for vCenter 6.x)
-
Adds chaos experiments for simulating DNS errors (inability to resolve hosts) and redirection to incorrect/faulty services (using a spoof map that can redirect specific requests)
-
Makes the chaos annotationCheck against applications “false” by default, making it simpler for users to get started with chaos without any instrumentation step for the application targets.
-
Updates the CRD version to v1, the min. supported Kubernetes version moved to 1.15
-
Enhances the disk fill experiment with a tunable to specify write block size for quicker capacity use and fs aligned writes.
-
Supports label-based selection of node targets for (node-level) chaos injection.
-
Adds chaos abort routines for AWS chaos experiments
-
Adds the ability to target EBS volumes by tag, with a sequential and parallel injection of chaos, with support for both simple as well as EKS persistent volumes.
-
Places non-litmus core images (dependencies such as argo, MongoDB for portal driven chaos) into litmuschaos image registry, while maintaining image names and release tags to simplify the user experience for those who need to set up local mirrors or are in air-gapped environments
-
Adds support for Openshift Route in the litmus helm charts
-
Refactors and optimizes chaos libraries for code reuse and simplified flow. Updates the litmus-sdk to generate refactored experiment templates
-
Adds GitHub actions based workflow/pipeline for node-level chaos experiments in e2e suite
Major Bug Fixes
-
Fixes the inability to define certain attributes within the ChaosEngines, for which the OpenAPI validation was missing (due to migration of CRD version to v1) using the “preserve-unknown-fields” option. Also adds the validations for a number of properties/attributes.
Fixes a panic encountered in the chaos-runner upon the inability to access the ChaosEngine resource -
Fixes the node restart experiment to perform the right verification checks on helper pods executing the chaos
Fixes behavior where helper pods that complete quickly (run for short durations) are treated as failed by verifying for “succeeded” state. -
Removes ambiguity in filtering/accessing helper pods by assigning standard label format
-
Fixes an erroneous decision in pod-cpu & memory hog experiments which considered a non-zero response (137) upon chaos process kill (SIGKILL) as failure to revert/rollback
-
Adds a check to verify the status of application target containers before attempting an exec operation to perform the desired chaos action
-
Fixes the ec2-terminate-by-tag experiment to consider only the running instances for stop/termination
-
Adds the missing PORTAL_ENDPOINT environment to facilitate namespaced mode of execution of the litmus-portal
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.5.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs
1.13.3
New Features & Enhancements
-
For updates on 2.0.0-Beta releases, refer the notes for Litmus 2.0.0-Beta3.
-
Enhances the EC2 termination experiments to filter targets by tags (apart from IDs), along with support for list and percentage-based selection of instances, serial and parallel failure modes
-
Supports collection of chaos metrics for all ChaosEngine resources by default instead of selective monitoring controlled via spec attributes
-
Supports the definition of ‘context’ (metadata) for an experiment via a Kubernetes label on the ChaosEngine that translates to a metric label value on Prometheus. This can be used to group experiment results via context/reason or derive useful insights from metrics.
-
Introduces a new chaos metric
litmuschaos_experiment_verdict
that provides an instance-specific run result (instead of cumulative result stats) that can be used alongside thelitmuschaos_awaited_experiments
to obtain improved chaos interleaved dashboards. -
Adds documentation around supported chaos metrics and their utility.
-
Allows users to specify the terminationGracePeriodSeconds for the chaos experiment and helper pods to allow abort routines to go through (useful in clusters with high API traffic or under group chaos execution on multiple apps at once)
-
Provides new environment variables (translating to stress-ng flags) for node resource chaos experiments to ensure the granular definition of the load/stress profile.
-
Adds abort routines for infra/node and autoscaler experiments and optimizes the same for pod experiments in which they are already defined.
-
Introduces a randomness factor in the pod-delete experiment to ensure that the delete operations occur at random intervals (the random periods being picked within a time range defined by lower-upper bounds).
-
Enhances the pumba chaoslib for stress experiments by providing an additional ENV var for defining the stress image (that is pulled at runtime on the target pod’s node to inject the stressor). This is useful for folks running experiments with images from their private registries.
-
Introduces a tech-preview of a DNS-chaos experiment (available in litmuschaos/go-runner:ci image) that can cause dns errors/failure in target containers
-
Updates the Chaos Github Actions used in the PR/commit-based e2e suite on the litmus-go repository.
-
Improves the e2e dashboard to represent the experiment e2e coverage in a clearer way.
-
Begins the migration process of specific e2e pipelines to GitHub Actions from Gitlab to aid definition of multiple component/feature-based workflows from within a single branch
-
Adds a new utility (nsutil) to execute commands on the target containers namespace, with potential usage in multiple pod-level chaos experiments
Major Bug Fixes
-
Fixes repeated scheduling of experiment pods upon helper failure/ungraceful exits (error state )- the pods will now enter the completed state upon first error.
-
Appends missing CRD validation schema for image pull policy for experiments
-
Upgrades all litmus artifacts containing CRD spec to use version v1 from v1beta1 to support newer Kubernetes platforms
-
Adds checks to validate the definition of app labels when annotation checks are set to false on the ChaosEngine (and fail fast with appropriate error).
-
Fixes the behavior where multiple “downstream” probes defined in the same phase (pre/post/on chaos) fail if the first probe evaluates to failure.
-
Fixes an issue that is seen when running chaos on multiple application replicas/targets at once, where chaos injection against the last replica/target alone is considered for the success of the experiment.
-
Adds retries to factor in the pending status of helper pods in populated/dense clusters where it takes time for the pod to be scheduled.
-
Adds logs to the Kafka liveness/load pod launched during the Kafka broker failure experiments to verify successful service discovery & topic creation success/failure.
Major Known Issues & Limitations
Issue:
The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.
Workaround:
Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.
Fix:
This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.
Installation
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.3.yaml
Verify your installation
-
Verify if the chaos operator is running
kubectl get pods -n litmus
-
Verify if chaos CRDs are installed
kubectl get crds | grep chaos
For more details refer to the documentation at Docs