From 7c31cb1826558be5cb4dd833e8ff1b2bacc87ccf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?R=C3=A9my=20Coutable?= <remy@rymai.me>
Date: Wed, 10 Apr 2024 07:26:48 +0000
Subject: [PATCH] Document flaky tests management process

---
 .../engineering-productivity/_index.md        | 239 ++----------------
 .../flaky-tests-management-and-processes.md   |  74 ++++++
 .../engineering-productivity/flaky-tests.md   |  55 ----
 .../project-management.md                     | 155 ++++++++++--
 .../test-intelligence.md                      |  60 +++++
 5 files changed, 290 insertions(+), 293 deletions(-)
 create mode 100644 content/handbook/engineering/infrastructure/engineering-productivity/flaky-tests-management-and-processes.md
 delete mode 100644 content/handbook/engineering/infrastructure/engineering-productivity/flaky-tests.md
 create mode 100644 content/handbook/engineering/infrastructure/engineering-productivity/test-intelligence.md

diff --git a/content/handbook/engineering/infrastructure/engineering-productivity/_index.md b/content/handbook/engineering/infrastructure/engineering-productivity/_index.md
index bfd212fa75..d5d0d33921 100644
--- a/content/handbook/engineering/infrastructure/engineering-productivity/_index.md
+++ b/content/handbook/engineering/infrastructure/engineering-productivity/_index.md
@@ -3,20 +3,6 @@ title: "Engineering Productivity team"
 description: "The Engineering Productivity team increases productivity of GitLab team members and contributors by shortening feedback loops and improving workflow efficiency for GitLab projects."
 ---
 
-## Child Pages
-
-[Issue triage](/handbook/engineering/infrastructure/engineering-productivity/issue-triage/)
-{.h4}
-
-[Wider Community Merge Request triage](/handbook/engineering/infrastructure/engineering-productivity/merge-request-triage/)
-{.h4}
-
-[Project Management](/handbook/engineering/infrastructure/engineering-productivity/project-management/)
-{.h4}
-
-[Triage Operations](/handbook/engineering/infrastructure/engineering-productivity/triage-operations/)
-{.h4}
-
 ## Mission
 
 - Constantly improve efficiency for our entire engineering team, to ultimately increase value for our customer.
@@ -38,7 +24,6 @@ description: "The Engineering Productivity team increases productivity of GitLab
 | {{< member-by-name "Peter Leitzen" >}}           | Staff Backend Engineer, Engineering Productivity    |
 | {{< member-by-name "Rémy Coutable" >}}           | Principal Engineer, Infrastructure                  |
 
-
 ### Stable Counterpart
 
 | Person | Role |
@@ -109,8 +94,9 @@ graph LR
 * **Do it for wider community**: Increase efficiency for wider GitLab Community contributions.
 * **Dogfood build**: Enhance and add new features to the GitLab product to improve engineer productivity.
 
+## Metrics
 
-## KPIs
+### KPIs
 
 [Infrastructure Performance Indicators](/handbook/engineering/infrastructure/performance-indicators/) are our single source of truth
 - [Master Pipeline Stability](/handbook/engineering/infrastructure/performance-indicators/#master-pipeline-stability)
@@ -128,13 +114,33 @@ graph LR
 - [Quality Department Promotion Rate](/handbook/engineering/infrastructure/performance-indicators/#quality-department-promotion-rate)
 - [Quality Department Discretionary Bonus Rate](/handbook/engineering/infrastructure/performance-indicators/#quality-department-discretionary-bonus-rate)
 
+### Dashboards
+
+The Engineering Productivity team creates metrics in the following sources to aid in operational reporting.
+
+- [Engineering Productivity Collection](https://10az.online.tableau.com/#/site/gitlab/collections/fc447e0e-d368-4bc2-a8c6-ac782318ab96)
+- [Broken Master Pipeline Root Cause Analysis](https://10az.online.tableau.com/#/site/gitlab/workbooks/2296993/views)
+- [Time to First Failure](https://10az.online.tableau.com/#/site/gitlab/workbooks/2300061/views)
+- [Flaky test issues](https://10az.online.tableau.com/#/site/gitlab/workbooks/2283052/views)
+- [Test Intelligence Accuracy](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTTestIntelligenceAccuracy/TestIntelligenceAccuracy)
+- [Engineering Productivity Pipeline Durations](https://10az.online.tableau.com/#/site/gitlab/workbooks/2312755/views)
+- [Engineering Productivity Jobs Durations](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTEP-JobsDurations/EP-JobsDurations)
+- Engineering Productivity Package And QA Durations (to be replaced in Tableau)
+- GDK - Jobs Durations ([to be replaced in Tableau](https://gitlab.com/gitlab-data/tableau/-/issues/253#note_1730258820))
+- [Issue Types Detail](https://10az.online.tableau.com/#/site/gitlab/workbooks/2203014/views)
+- [GitLab-Org Native Insights](https://gitlab.com/groups/gitlab-org/-/insights)
+- [Review Apps monitoring dashboard](https://app.google.stackdriver.com/dashboards/6798952013815386466?project=gitlab-review-apps)
+- Triage Reactive monitoring dashboards
+  - [Overview dashboard](https://console.cloud.google.com/monitoring/dashboards/builder/e3e9d8fc-54cd-4a98-b4a3-e81f01d37e26?project=gitlab-qa-resources&dashboardBuilderState=%257B%2522editModeEnabled%2522:false%257D&timeDomain=1w)
+  - [Processors dashboard](https://console.cloud.google.com/monitoring/dashboards/builder/3338d66b-649c-4ea9-aec9-14ffba96c25f?project=gitlab-qa-resources&dashboardBuilderState=%257B%2522editModeEnabled%2522:false%257D&timeDomain=1w)
+
 ## OKRs
 
 Objectives and Key Results (OKRs) help align our sub-department towards what really matters. These happen quarterly and are based on company OKRs. We follow the OKR process defined [here](/handbook/company/okrs/#okr-process-at-gitlab).
 
 Here is an [overview](https://gitlab.com/gitlab-com/gitlab-OKRs/-/issues/?sort=created_date&state=opened&type%5B%5D=objective&label_name%5B%5D=Engineering%20Productivity&first_page_size=100) of our current OKRs.
 
-### Communication
+## Communication
 
 | Description | Link |
 | --- | --- |
@@ -154,82 +160,6 @@ Engineering Productivity has [weekly team meeting ](https://docs.google.com/docu
 - Part 1 is Tuesdays 11:00 UTC, 04:00 PST
 - Part 2 is Tuesdays 22:00 UTC, 15:00 PST
 
-### Work prioritization
-
-The Engineering Productivity team has diverse responsibilities and reactive work. Work is categorized as planned and reactive.
-
-### Guiding principles
-
-- We focus on OKRs, corrective actions and preventative work.
-- We adhere to the general release milestones like %x.y.
-- We are ambitious with our targeted planned work per milestone. These targets are not reflective of a commitment. Reactive work load will ebb and flow and we do not expected to accomplish everything planned for the current milestone.
-- [Priority labels](/handbook/engineering/infrastructure/engineering-productivity/issue-triage/#priority) are used to indicate relative priority for a milestone.
-
-### Weighting
-
-We follow the [department weighting guidelines](/handbook/engineering/infrastructure/test-platform/#weights) to relatively weight issues over time to understand a milestone velocity and increase predictability.
-
-When weighting, think about knowns and complexity related to recently completed work. The goal with weighting is to allow for some estimation ambiguity that allows for a consistent predictable flow of work each milestone.
-
-### Prioritization activities
-
-| When | Activity | DRI |
-| --- | --- | --- |
-| Weekly | Assign `~priority::1`, `~priority::2` issues to a milestone | Engineering Productivity Engineering Manager |
-| Weekly | Weight issues identified with `~"needs weight"` | Engineering Productivity Backend Engineer |
-| Weekly | Prioritize all `~"Engineering Productivity"` issues | Engineering Productivity Engineering Manager |
-| 2 weeks prior to milestone start | Milestone planned work is identified and scheduled | Engineering Productivity Engineering Manager |
-| 2 weeks prior to milestone start | Provide feedback on planned work | Engineering Productivity team |
-| 1 week prior to milestone start | Transition any work that is not in progress for current milestone to upcoming milestone | Engineering Productivity Engineering Manager |
-| 1 week prior to milestone start | Adjust planned work for upcoming milestone | Engineering Productivity Engineering Manager |
-| 1 week prior to milestone start | Final adjustments to planned scope | Engineering Productivity team |
-| During milestone | Adjust priorities and scope based on newly identified issues and reactive workload | Engineering Productivity Engineering Manager |
-
-### Projects
-
-The Engineering Productivity team recently reviewed (2023-05-19) all our projects and discussed relative priority. Aligning this with our business goals and priorities is very important. The list below is ordered based on aligned priorities and includes primary domain experts for communication as well as a documentation reference for self-service.
-
-| Project | Domain Knowledge | Documentation |
-| ------- | ------------------------------------------ | ----- |
-| GitLab CI Pipeline configuration optimization and stability | Jen-Shin, David, Nao | [Pipelines for the GitLab project](https://docs.gitlab.com/ee/development/pipelines/index.html) |
-| Triaging master-broken | Jenn, Nao | [Broken Master](https://about.gitlab.com/handbook/engineering/workflow/#broken-master) |
-| GitLab Development Kit (GDK) continued development | Nao, Peter | [GitLab Development Kit](https://gitlab.com/gitlab-org/gitlab-development-kit/) |
-| Triage operations for issues, merge requests, community contributions | Jenn, Alina | [triage-ops](https://gitlab.com/gitlab-org/quality/triage-ops/) |
-| Review Apps | David, Rémy | [Using review apps in the development of GitLab](https://docs.gitlab.com/ee/development/testing_guide/review_apps.html) |
-| Triage engine, used by GitLab triage operations | Jen-Shin, Rémy | [GitLab Triage](https://gitlab.com/gitlab-org/ruby/gems/gitlab-triage/) |
-| Danger & Dangerfiles (includes Reviewer roulette) for shared Danger rules and plugins | Rémy, Jen-Shin, Peter | [`gitLab-dangerfiles` Ruby gem](https://gitlab.com/gitlab-org/ruby/gems/gitlab-dangerfiles) for shared [Danger](https://docs.gitlab.com/ee/development/dangerbot.html#danger-bot) rules and plugins |
-| JiHu | Jen-Shin | [JiHu Support](https://about.gitlab.com/handbook/ceo/office-of-the-ceo/jihu-support/) |
-| Development department metrics for measurements of Quality and Productivity | Jenn, Rémy | [Development Department Performance Indicators](https://about.gitlab.com/handbook/engineering/development/performance-indicators/) |
-| RSpec Profiling Statistics for profiling information on RSpec tests in CI | Peter | [rspec_profiling_stats](https://gitlab.com/gitlab-org/rspec_profiling_stats) |
-| RuboCop & shared RuboCop cops | Peter | [`gitLab-styles` Ruby gem](https://gitlab.com/gitlab-org/ruby/gems/gitlab-styles) for shared [RuboCop cops](https://docs.gitlab.com/ee/development/contributing/style_guides.html#ruby-rails-rspec) |
-| Feature flag alert for reporting on GitLab feature flags | Rémy | [Gitlab feature flag alert](https://gitlab.com/gitlab-org/gitlab-feature-flag-alert) |
-| Chatops (especially for feature flags toggling) | Rémy | [Chatops scripts for managing GitLab.com from Slack](https://gitlab.com/gitlab-com/chatops) |
-| CI/CD variables, Triage ops, and Internal workspaces infrastructure | David, Rémy | [Engineering Productivity infrastructure](https://gitlab.com/gitlab-org/quality/engineering-productivity-infrastructure) |
-| Tokens management | Rémy | ["Rotating credentials" runbook](https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/blob/main/runbooks/rotating-credentials.md) |
-| Gems management | Rémy | [Rubygems committee project](https://gitlab.com/gitlab-dependency-committees/rubygems-committee) |
-| Shared CI/CD config & components | David, Rémy | [`gitlab-org/quality/pipeline-common`](https://gitlab.com/gitlab-org/quality/pipeline-common) and [`gitlab-org/components`](https://gitlab.com/gitlab-org/components) |
-| Dependency management (Gems, Ruby, Vue, etc.) | Jen-Shin, Peter | [Renovate GitLab bot](https://gitlab.com/gitlab-org/frontend/renovate-gitlab-bot) |
-
-### Metrics
-
-The Engineering Productivity team creates metrics in the following sources to aid in operational reporting.
-
-- [Engineering Productivity Collection](https://10az.online.tableau.com/#/site/gitlab/collections/fc447e0e-d368-4bc2-a8c6-ac782318ab96)
-- [Broken Master Pipeline Root Cause Analysis](https://10az.online.tableau.com/#/site/gitlab/workbooks/2296993/views)
-- [Time to First Failure](https://10az.online.tableau.com/#/site/gitlab/workbooks/2300061/views)
-- [Flaky test issues](https://10az.online.tableau.com/#/site/gitlab/workbooks/2283052/views)
-- [Test Intelligence Accuracy](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTTestIntelligenceAccuracy/TestIntelligenceAccuracy)
-- [Engineering Productivity Pipeline Durations](https://10az.online.tableau.com/#/site/gitlab/workbooks/2312755/views)
-- [Engineering Productivity Jobs Durations](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTEP-JobsDurations/EP-JobsDurations)
-- Engineering Productivity Package And QA Durations (to be replaced in Tableau)
-- GDK - Jobs Durations ([to be replaced in Tableau](https://gitlab.com/gitlab-data/tableau/-/issues/253#note_1730258820))
-- [Issue Types Detail](https://10az.online.tableau.com/#/site/gitlab/workbooks/2203014/views)
-- [GitLab-Org Native Insights](https://gitlab.com/groups/gitlab-org/-/insights)
-- [Review Apps monitoring dashboard](https://app.google.stackdriver.com/dashboards/6798952013815386466?project=gitlab-review-apps)
-- Triage Reactive monitoring dashboards
-  - [Overview dashboard](https://console.cloud.google.com/monitoring/dashboards/builder/e3e9d8fc-54cd-4a98-b4a3-e81f01d37e26?project=gitlab-qa-resources&dashboardBuilderState=%257B%2522editModeEnabled%2522:false%257D&timeDomain=1w)
-  - [Processors dashboard](https://console.cloud.google.com/monitoring/dashboards/builder/3338d66b-649c-4ea9-aec9-14ffba96c25f?project=gitlab-qa-resources&dashboardBuilderState=%257B%2522editModeEnabled%2522:false%257D&timeDomain=1w)
-
 ### Communication guidelines
 
 The Engineering Productivity team will make changes which can create notification spikes or new behavior for
@@ -265,129 +195,6 @@ Be sure to give a heads-up to `#development`,`#eng-managers`,`#product`, `#ux` S
 and the Engineering week in review when an automation is expected to triage more
 than 50 notifications or change policies that a large stakeholder group use (e.g. team-triage report).
 
-### Asynchronous Issue Updates
-
-Communicating progress is important but status doesn't belong in one on ones as it can be more appropriately communicated with a broader audience using other methods. The "standup" model used by a lot of organizations practicing scrum assumes a certain time of day for those to happen. In the context of a timezone distributed team, there is no "9am" that the team shares. Additionally, the act of losing and gaining context after completing work for the day only to gain it again to share a status update is context switching. The intended audience of the standup model assumes that it's just the team but in GitLab's model, that means folks need to be aware of where this is being communicated (slack, issues, other). Since this information isn't available to the intended audience, the information needs to be duplicated which at worst means there's no single source of truth and at a minimum means copy pasting information.
-
-The proposal is to trial using an Asynchronous Issue Update model, similar to [what the Package Group uses](/handbook/engineering/development/ops/package/#async-issue-updates). This process would replace the existing daily standup update we post in Slack with `Geekbot`. The time period for the trial would be a milestone or two, depending on feedback cycles.
-
-The async daily update communicates the progress and confidence using an issue comment and the milestone health status using the Health Status field in the issue. A daily update may be skipped if there was no progress. Merge requests that do not have a related issue should be updated directly. It's preferable to update the issue rather than the related merge requests, as those do not provide a view of the overall progress. Where there are blockers or you need support, Slack is the preferred space to ask for that. Being blocked or needing support are more urgent than email notifications allow.
-
-When communicating the health status, the options are:
-- `on track` - when the issue is progressing as planned
-- `needs attention` - when the issue requires attention or intervention to keep it on schedule
-- `at risk` - when there is a risk the issue will not be completed according to schedule
-
-The async update comment should include:
-- what percentage complete the work is, in other words, how much work is done to put all the required MRs in review
-- the confidence of the person that their estimate is correct
-- notes on what was done and/or if review has started
-- it could be good to specify the relevant dependencies in the update, if there are multiple people working on it
-
-Example:
-```
-**Status**: 20% complete, 75% confident
-
-Expecting to go into review tomorrow.
-```
-
-Include one entry for each associated MR
-
-Example:
-```
-**Issue status**: 20% complete, 75% confident
-
-Expecting to go into review tomorrow.
-
-**MR statuses**:
-
-- !11111+ - 80% complete, 99% confident - docs update - need to add one more section
-- !21212+ - 10% complete, 70% confident - api update - database migrations created, working on creating the rest of the functionality next
-```
-
-##### How to measure confidence?
-
-Ask yourself, how confident am I that my % of completeness is correct?.
-
-For things like bugs or issues with many unknowns, the confidence can help communicate the level of unknowns. For example, if you start a bug with a lot of unknowns on the first day of the milestone you might have low confidence that you understand what your level of progress is.
-Your confidence in the work may go down for whatever reason, it's acceptable to downgrade your confidence. Consideration should be given to retrospecting on why that happened.
-#### Weekly Epic updates
-
-A weekly update should be added to epics you're assigned to and/or are actively working on. The update should provide an overview of the progress across the feature. Consider adding an update if epic is blocked, if there are unexpected competing priorities, and even when not in progress, what is the confidence level to deliver by the expected delivery date. A weekly update may then be skipped until the situation changes. Anyone working on issues assigned to an epic can post weekly updates.
-
-The epic updates communicate a high level view of progress and status for quarterly goals using an epic comment. It does not need to have issue or MR level granularity because that is part of each issue updates.
-
-The weekly update comment should include:
-- Status: ok, so-so, bad? Is there something blocked in the general effort?
-- How much of the total work is done? How much is remaining? Do we have an ETA?
-- What's your confidence level on the completion percentage?
-- What is next?
-- Is there something that needs help/support? (tag specific individuals so they know ahead of time)
-
-##### Examples
-
-Some good examples of epic updates that cover the above aspects:
-- <https://gitlab.com/groups/gitlab-org/-/epics/8628#note_1090732793>
-- <https://gitlab.com/groups/gitlab-org/-/epics/5152#note_1029337901>
-
-
-## Test Intelligence
-
-As the owner of [pipeline configuration](https://docs.gitlab.com/ee/development/pipelines/index.html) for the [GitLab project](https://gitlab.com/gitlab-org/gitlab), the Engineering Productivity team has adopted several test intelligence strategies aimed to improve pipeline efficiency with the following benefits:
-- Shortened feedback loop by prioritizing tests that are most likely to fail
-- Faster pipelines to scale better when Merge Train is enabled
-
-These strategies include:
-- Predictive test jobs via test mapping
-- Fail-fast job
-- Re-run previously failed tests early
-- Selective jobs via pipeline rules
-- Selective jobs via labels
-
-#### Predictive test jobs via test mapping
-
-Tests that provide coverage to the code changes in each merge request are most likely to fail. As a result, merge request pipelines for the [GitLab project](https://gitlab.com/gitlab-org/gitlab) run only the predictive set of tests by default. These include:
-- [RSpec predictive jobs](https://docs.gitlab.com/ee/development/pipelines/#rspec-predictive-jobs) which runs relevant RSpec tests that are mapped to the code changes
-- [Jest predictive jobs](https://docs.gitlab.com/ee/development/pipelines/#jest-predictive-jobs) which runs relevant Jest tests that are mapped to the code changes
-
-See <https://docs.gitlab.com/ee/development/pipelines/index.html#predictive-test-jobs-before-a-merge-request-is-approved> for more information.
-
-#### Fail-fast job
-
-There is a [fail-fast job](https://docs.gitlab.com/ee/development/pipelines/#fail-fast-job-in-merge-request-pipelines) in each merge request pipeline aimed to run all the RSpec tests that provide coverage for the code changes, hence are most likely to fail. It uses the same [test_file_finder](https://gitlab.com/gitlab-org/ruby/gems/test_file_finder) gem for test mapping. The job provides faster feedback by running early and stops the rest of the pipeline right away if any of the fail-fast job tests fail.
-Take a look at this [youtube video](https://www.youtube.com/watch?v=FCCbxZky5Nk) for details on how [GitLab](https://gitlab.com/gitlab-org/gitlab) implements the fail-fast job with test_file_finder.
-Note that the current design only works with low-impacting merge requests which are only mapped to a small set of tests. If there is a large number of tests that are likely to fail for a merge request, putting them in a single job is not feasible and could result in a long-running bottleneck which defeats its purpose.
-
-See <https://docs.gitlab.com/ee/development/pipelines/index.html#fail-fast-job-in-merge-request-pipelines> for more information.
-
-Premium GitLab customers, who wish to incorporate the `Fail-Fast job` into their Ruby projects, can set it up with our [Verify/Failfast](https://docs.gitlab.com/ee/ci/testing/fail_fast_testing.html) template.
-
-#### Re-run previously failed tests early
-
-Tests that previously failed in a merge request are likely to fail again, so they provide the most urgent feedback in the next run.
-To grant these tests the highest priority, the [GitLab](https://gitlab.com/gitlab-org/gitlab) pipeline [prioritizes previously failed tests by re-running them early](https://docs.gitlab.com/ee/development/pipelines/#re-run-previously-failed-tests-in-merge-request-pipelines) in a dedicated job, so it will be one of the first jobs to fail if attention is needed.
-
-See <https://docs.gitlab.com/ee/development/pipelines/index.html#re-run-previously-failed-tests-in-merge-request-pipelines> for more information.
-
-#### Selective jobs via pipeline rules
-
-The GitLab pipeline consists of hundreds of jobs, but not all are necessary for each merge request. For example, a merge request with only changes to documenation files do not need to run any backend tests, so we can exclude all backend test jobs from the pipeline.
-See [specify-when-jobs-run-with-rules](https://docs.gitlab.com/ee/ci/jobs/job_control.html#specify-when-jobs-run-with-rules) for how to include/exclude CI jobs based on file changes.
-Most of the pipeline rules for the [GitLab project](https://gitlab.com/gitlab-org/gitlab) can be found in <https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/ci/rules.gitlab-ci.yml>.
-
-#### Selective jobs via labels
-
-Developers can add labels to run jobs in addition to the ones selected by the pipeline rules. Those labels start with `pipeline:` and multiple can be applied. A few examples that people commonly use:
-
-- `~"pipeline:run-all-rspec"`
-- `~"pipeline:run-all-jest"`
-- `~"pipeline:run-as-if-foss"`
-- `~"pipeline:run-as-if-jh"`
-- `~"pipeline:run-praefect-with-db"`
-- `~"pipeline:run-single-db"`
-
-See [docs](https://docs.gitlab.com/ee/development/pipelines/) for when to use these pipeline labels.
-
 ## Experiments
 
 This is a list of Engineering Productivity experiments where we identify an opportunity, form a hypothesis and experiment to test the hypothesis.
diff --git a/content/handbook/engineering/infrastructure/engineering-productivity/flaky-tests-management-and-processes.md b/content/handbook/engineering/infrastructure/engineering-productivity/flaky-tests-management-and-processes.md
new file mode 100644
index 0000000000..26c583ba57
--- /dev/null
+++ b/content/handbook/engineering/infrastructure/engineering-productivity/flaky-tests-management-and-processes.md
@@ -0,0 +1,74 @@
+---
+
+title: "Flaky tests management and processes"
+---
+
+## Introduction
+
+A flaky test is an unreliable test that occasionally fails but passes eventually if you retry it enough times.
+In a test suite, flaky tests are inevitable, so our goal should be to limit their negative impact as soon as possible.
+
+Out of all the factors that affects master pipeline stability, flaky tests contribute to at least 30% of master pipeline failures each month.
+
+## Current state and assumptions
+
+| Current state | Assumptions |
+| ------------- | ----------- |
+| `master` success rate [was at 89% for March 2024](https://handbook.gitlab.com/handbook/engineering/infrastructure/performance-indicators/#master-pipeline-stability) | We don't know exactly what would be the success rate without any flaky tests, but we assume we could attain 99% |
+| [5200+ `~"failure::flaky-test"` issues](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTFlakytestissues/FlakyTests?:iid=1) out of a total of [260,040 tests as of 2024-03-01](https://gitlab-org.gitlab.io/rspec_profiling_stats/#overall_time) | It means [we identified 1.99% of tests as being flaky](https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html#automatic-retries-and-flaky-tests-detection). [GitHub identified that 25% of their tests were flaky at some point](https://github.blog/2020-12-16-reducing-flaky-builds-by-18x/#how-far-weve-come), our reality is probably in between. |
+| [Coverage is currently at 98.42%](https://gitlab-org.gitlab.io/gitlab/coverage-ruby/#_AllFiles) | Even if we'd removed the 5200 flaky tests, we don't expect the coverage to go down meaningfully. |
+| ["Average Retry Count"](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTFlakytestissues/FlakyTests?:iid=1) per pipeline is currently at 0.015, it means given [RSpec jobs' current average duration of 23 minutes](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTEP-JobsDurations/EP-JobsDurations?:iid=2), this results in an additional `0.015 * 23 = 0.345` minutes on average per pipeline, not including the idle time between the job failing and the time it is retried. [Explanation provided by Albert](https://gitlab.com/gitlab-org/quality/team-tasks/-/issues/874#note_575599680). | Given we have approximately [91k pipelines per month](https://gitlab.com/gitlab-org/gitlab/-/pipelines/charts), that means flaky tests are wasting 31,395 CI minutes per month. Given our private runners cost us $0.0845 / minute, this means flaky tests are wasting at minimum $2,653 per month of CI minutes. This doesn't take in account the engineers' time wasted. |
+
+### Manual flow to detect flaky tests
+
+When a flaky test fails in an MR, the author might follow the following flow:
+
+```mermaid
+graph LR
+    A[Test fails in a MR] --> C{Does the failure looks related to the MR?}
+    C -->|Yes| D[Try to reproduce and fix the test locally]
+    C -->|No| E{Does a flaky test issue exists?}
+    E -->|Yes| F[Retry the job and hope that it will pass this time]
+    E -->|No| G[Wonder if this is flaky and retry the job]
+```
+
+## Why is flaky tests management important?
+
+Flaky tests negatively impact several teams and areas:
+
+| Impacted department/team | Impacted area | Impact description | Impact quantification |
+| --------------- | ------------- | ------------------ | --------------------- |
+| Development department | MR & deployment cycle time | Wasted time (by forcing people to look at the failures and retry them manually if needed) | A lot of wasted time for all our engineers |
+| Infrastructure department | CI compute resources | Wasted money | At least $2,653 worth of wasted CI compute time per month |
+| Delivery team & Quality department | Deployment cycle time | Distraction from actual CI failures & regressions, leading to slower detection of those | TBD |
+
+## Flaky tests management process
+
+We started an experiment to [automatically open merge requests for very flaky tests](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/147137) to improve overall pipeline stability and duration.
+To ensure that our product quality is not negatively affected due to test coverage reduction, the following process should be followed:
+
+1. Groups are responsible for reviewing their [test-quarantining merge requests](https://gitlab.com/gitlab-org/gitlab/-/merge_requests?label_name=quarantine).
+   These merge requests are meant to start a discussion on whether a test is useful or not.
+   In case a test is impacting `master`' stability heavily, the Engineering Productivity team can merge these merge requests even without a review from their responsible group.
+   The group should still review the merge request and start a discussion about the quarantined test's next step.
+2. Once a test is quarantined, its associated issue will be reported in [weekly group reports](https://gitlab.com/gitlab-org/quality/triage-reports/-/issues/?sort=updated_desc&state=opened&label_name%5B%5D=triage%20report&in=TITLE&search=triage%20report%20for&first_page_size=20).
+    Groups can also list all of their [flaky tests](https://gitlab.com/gitlab-org/gitlab/-/issues/?state=opened&label_name%5B%5D=failure%3A%3Aflaky-test&label_name%5B%5D=group%3A%3Axxx) and their [quarantined tests](https://gitlab.com/gitlab-org/gitlab/-/issues/?state=opened&label_name%5B%5D=group%3A%3Axxx&label_name%5B%5D=quarantine) (replace `group::xxx` in the issues list).
+3. The number of quarantined test cases per group is also available as [a dashboard](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTFlakytestissues/FlakyTestIssues?:iid=2).
+4. Groups are responsible for ensuring stability and coverage of their own tests, by removing unstable tests or getting them back to running.
+
+You can leave any feedback about this process in the [dedicated issue](https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/issues/447).
+
+### Goals
+
+- Increase `master` stability to a solid 95% success rate without manual action
+- Improve productivity - MR merge time - [lower "Average Retry Count"](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTFlakytestissues/FlakyTests?:iid=1)
+- Remove doubts on whether `master` is broken or not
+- Reduce the need to retry a failing job by default
+- Define acceptable thresholds for action like quarantining/focus on refactoring
+- Step towards unlocking [Merge train](https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/195)
+
+## Additional resources
+
+- [Flaky tests technical documentation](https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html)
+- [Measure and act on flaky specs](https://gitlab.com/groups/gitlab-org/-/epics/8789)
+- [Flaky tests dashboard](https://10az.online.tableau.com/#/site/gitlab/workbooks/2283052/views)
diff --git a/content/handbook/engineering/infrastructure/engineering-productivity/flaky-tests.md b/content/handbook/engineering/infrastructure/engineering-productivity/flaky-tests.md
deleted file mode 100644
index 0a31f717d9..0000000000
--- a/content/handbook/engineering/infrastructure/engineering-productivity/flaky-tests.md
+++ /dev/null
@@ -1,55 +0,0 @@
----
-
-title: "Flaky tests Primer"
----
-
-**Last reviewed**: 2021-10-28
-
-- [Flaky tests technical documentation](https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html)
-- [Measure and act on flaky specs](https://gitlab.com/groups/gitlab-org/-/epics/8789)
-- [Flaky tests Sisense dashboard](https://10az.online.tableau.com/#/site/gitlab/workbooks/2283052/views)
-
-### Introduction
-
-A flaky test is an unreliable test that occasionally fails but passes eventually if you retry it enough times.
-
-In a test suite, flaky tests are inevitable, so our goal should be to limit their negative impact as soon as possible.
-
-### Current state and assumptions
-
-| Current state | Assumptions |
-| ------------- | ----------- |
-| `master` success rate (with manual retrying of flaky tests) [is between 88% and 92% for August/September/October 2021](https://10az.online.tableau.com/#/site/gitlab/workbooks/2312755/views) | We don't know exactly what would be the success rate if we'd stop retrying flaky tests, but based on this exploratory chart, it could go down by approximately 7% |
-| [175 programmatically identified flaky tests](https://10az.online.tableau.com/#/site/gitlab/workbooks/2283052/views) and [211 `~"failure::flaky-test" issues](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTFlakytestissues/FlakyTests?:iid=1) out of a total of 159,590 tests | It means [we identified 0.1% of tests as being flaky](https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html#automatic-retries-and-flaky-tests-detection). This is in line with the ["RSpec Job Flaky Failure Probability"](https://10az.online.tableau.com/#/site/gitlab/views/SlowRSpecTestsIssues/SlowRSpecTestsIssuesDashboard?:iid=1). [GitHub identified that 25% of their tests were flaky at some point](https://github.blog/2020-12-16-reducing-flaky-builds-by-18x/#how-far-weve-come), our reality is probably in between. |
-| [Coverage is currently at 97.86%](https://gitlab-org.gitlab.io/gitlab/coverage-ruby/#_AllFiles) | Even if we'd removed the 175 flaky tests, we don't expect the coverage to go down meaningfully. |
-| ["Average Retry Count"](https://10az.online.tableau.com/#/site/gitlab/workbooks/2283052/views) per pipeline is currently at 0.08, it means given [RSpec jobs' current average duration of 23 minutes](https://10az.online.tableau.com/#/site/gitlab/views/DRAFTEP-JobsDurations/EP-JobsDurations?:iid=2), this results in an additional `0.08 * 23 = 1.84` minutes on average per pipeline , not including the idle time between the job failing and the time it is retried. [Explanation provided by Albert](https://gitlab.com/gitlab-org/quality/team-tasks/-/issues/874#note_575599680). | Given we have approximately [11k MR pipelines per month](https://10az.online.tableau.com/#/site/gitlab/workbooks/2312755/views), that means flaky tests are wasting 20,240 minutes per month = **337 engineer hours** = 14 days. Given our private runners cost us $0.0845 / minute, this means flaky tests are wasting $1,710 per month. |
-
-When a flaky test fails in an MR, following is the workflow the author might follow:
-
-```mermaid
-graph LR
-    A[Test fails in a MR] --> C{Does the failure looks related to the MR?}
-    C -->|Yes| D[Try to reproduce and fix the test locally]
-    C -->|No| E{Does a flaky test issue exists?}
-    E -->|Yes| F[Retry the job and hope that it will pass this time]
-    E -->|No| G[Wonder if this is flaky and retry the job]
-```
-
-### Why is this important?
-
-Flaky tests negatively impact several teams and areas:
-
-| Impacted department/team | Impacted area | Impact description | Impact quantification |
-| --------------- | ------------- | ------------------ | --------------------- |
-| Development department | MR & deployment cycle time | Wasted time (by forcing people to look at the failure and retry them manually) | ~$26,000 wasted time per month based on 337 engineer hours and using $77 hourly rate for an Engineer |
-| Infrastructure department | CI compute resources | Wasted money | At least $1,710 worth of wasted CI compute time per month |
-| Delivery team & Quality department | Deployment cycle time | Distraction from actual CI failures & regressions, leading to slower detection of those | TBD |
-
-### Goal
-
-- Increase `master` stability to a solid 95% success rate without manual action
-- Improve productivity - MR merge time - [lower "Average Retry Count"](https://10az.online.tableau.com/#/site/gitlab/workbooks/2283052/views)
-- Removes doubts on whether `master` is broken or not and default action of retry
-- Defining acceptable thresholds for action like quarantining/focus on refactoring
-- Step towards unlocking merge train
-
diff --git a/content/handbook/engineering/infrastructure/engineering-productivity/project-management.md b/content/handbook/engineering/infrastructure/engineering-productivity/project-management.md
index 355d8aa786..61bafe61b1 100644
--- a/content/handbook/engineering/infrastructure/engineering-productivity/project-management.md
+++ b/content/handbook/engineering/infrastructure/engineering-productivity/project-management.md
@@ -1,29 +1,145 @@
 ---
 
-title: "Engineering productivity Project Management"
+title: "Engineering productivity project management"
 description: "Guidelines for project management for the Engineering Productivity team at GitLab"
 ---
 
+## Work prioritization
+
+The Engineering Productivity team has diverse responsibilities and reactive work. Work is categorized as planned and reactive.
+
+## Guiding principles
+
+- We focus on OKRs, corrective actions and preventative work.
+- We adhere to the general release milestones like %x.y.
+- We are ambitious with our targeted planned work per milestone. These targets are not reflective of a commitment. Reactive work load will ebb and flow and we do not expect to accomplish everything planned for the current milestone.
+- [Priority labels](/handbook/engineering/infrastructure/engineering-productivity/issue-triage/#priority) are used to indicate relative priority for a milestone.
+
+## Weighting
+
+We follow the [department weighting guidelines](/handbook/engineering/infrastructure/test-platform/#weights) to relatively weight issues over time to understand a milestone velocity and increase predictability.
+
+When weighting, think about knowns and complexity related to recently completed work. The goal with weighting is to allow for some estimation ambiguity that allows for a consistent predictable flow of work each milestone.
+
+## Prioritization activities
+
+| When | Activity | DRI |
+| --- | --- | --- |
+| Weekly | Assign `~priority::1`, `~priority::2` issues to a milestone | Engineering Productivity Engineering Manager |
+| Weekly | Weight issues identified with `~"needs weight"` | Engineering Productivity Backend Engineer |
+| Weekly | Prioritize all `~"Engineering Productivity"` issues | Engineering Productivity Engineering Manager |
+| 2 weeks prior to milestone start | Milestone planned work is identified and scheduled | Engineering Productivity Engineering Manager |
+| 2 weeks prior to milestone start | Provide feedback on planned work | Engineering Productivity team |
+| 1 week prior to milestone start | Transition any work that is not in progress for current milestone to upcoming milestone | Engineering Productivity Engineering Manager |
+| 1 week prior to milestone start | Adjust planned work for upcoming milestone | Engineering Productivity Engineering Manager |
+| 1 week prior to milestone start | Final adjustments to planned scope | Engineering Productivity team |
+| During milestone | Adjust priorities and scope based on newly identified issues and reactive workload | Engineering Productivity Engineering Manager |
+
 ## Projects
 
-The Quality team currently works cross-functionally and our task ownership spans multiple projects.
+The Engineering productivity team currently works cross-functionally and our task ownership spans multiple projects.
+
+The list below is ordered based on aligned priorities and includes primary domain experts for communication as well as a documentation reference for self-service.
+
+| Project | Domain Knowledge | Documentation |
+| ------- | ------------------------------------------ | ----- |
+| GitLab CI Pipeline configuration optimization and stability | Jen-Shin, David, Jenn | [Pipelines for the GitLab project](https://docs.gitlab.com/ee/development/pipelines/index.html) |
+| Triaging master-broken | Jenn, Nao | [Broken Master](https://about.gitlab.com/handbook/engineering/workflow/#broken-master) |
+| GitLab Development Kit (GDK) continued development | Nao, Peter | [GitLab Development Kit](https://gitlab.com/gitlab-org/gitlab-development-kit/) |
+| Triage operations for issues, merge requests, community contributions | Jenn, Alina | [triage-ops](https://gitlab.com/gitlab-org/quality/triage-ops/) |
+| Review Apps | David, Rémy | [Using review apps in the development of GitLab](https://docs.gitlab.com/ee/development/testing_guide/review_apps.html) |
+| Triage engine, used by GitLab triage operations | Jen-Shin, Rémy | [GitLab Triage](https://gitlab.com/gitlab-org/ruby/gems/gitlab-triage/) |
+| Danger & Dangerfiles (includes Reviewer roulette) for shared Danger rules and plugins | Rémy, Jen-Shin, Peter | [`gitLab-dangerfiles` Ruby gem](https://gitlab.com/gitlab-org/ruby/gems/gitlab-dangerfiles) for shared [Danger](https://docs.gitlab.com/ee/development/dangerbot.html#danger-bot) rules and plugins |
+| JiHu | Jen-Shin | [JiHu Support](https://about.gitlab.com/handbook/ceo/office-of-the-ceo/jihu-support/) |
+| Development department metrics for measurements of Quality and Productivity | Jenn, Rémy | [Development Department Performance Indicators](https://about.gitlab.com/handbook/engineering/development/performance-indicators/) |
+| RSpec Profiling Statistics for profiling information on RSpec tests in CI | Peter | [rspec_profiling_stats](https://gitlab.com/gitlab-org/rspec_profiling_stats) |
+| RuboCop & shared RuboCop cops | Peter | [`gitLab-styles` Ruby gem](https://gitlab.com/gitlab-org/ruby/gems/gitlab-styles) for shared [RuboCop cops](https://docs.gitlab.com/ee/development/contributing/style_guides.html#ruby-rails-rspec) |
+| Feature flag alert for reporting on GitLab feature flags | Rémy | [Gitlab feature flag alert](https://gitlab.com/gitlab-org/gitlab-feature-flag-alert) |
+| Chatops (especially for feature flags toggling) | Rémy | [Chatops scripts for managing GitLab.com from Slack](https://gitlab.com/gitlab-com/chatops) |
+| CI/CD variables, Triage ops, and Internal workspaces infrastructure | David, Rémy | [Engineering Productivity infrastructure](https://gitlab.com/gitlab-org/quality/engineering-productivity-infrastructure) |
+| Tokens management | Rémy | ["Rotating credentials" runbook](https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/blob/main/runbooks/rotating-credentials.md) |
+| Gems management | Rémy | [Rubygems committee project](https://gitlab.com/gitlab-dependency-committees/rubygems-committee) |
+| Shared CI/CD config & components | David, Rémy | [`gitlab-org/quality/pipeline-common`](https://gitlab.com/gitlab-org/quality/pipeline-common) and [`gitlab-org/components`](https://gitlab.com/gitlab-org/components) |
+| Dependency management (Gems, Ruby, Vue, etc.) | Jen-Shin, Peter | [Renovate GitLab bot](https://gitlab.com/gitlab-org/frontend/renovate-gitlab-bot) |
+| Quality toolbox | David, Rémy | [Quality toolbox](https://gitlab.com/gitlab-org/quality/toolbox) |
+
+## Issues
+
+### Issues currently worked on
+
+Our team's [Quality: Engineering Productivity board](https://gitlab.com/groups/gitlab-org/-/boards/978615?label_name[]=Engineering%20Productivity) shows the current ownership of workload / issues maintained by team members in Engineering Productivity team.
+
+### Asynchronous issue updates
+
+Communicating progress is important but status doesn't belong in one on ones as it can be more appropriately communicated with a broader audience using other methods. The "standup" model used by a lot of organizations practicing scrum assumes a certain time of day for those to happen. In the context of a timezone distributed team, there is no "9am" that the team shares. Additionally, the act of losing and gaining context after completing work for the day only to gain it again to share a status update is context switching. The intended audience of the standup model assumes that it's just the team but in GitLab's model, that means folks need to be aware of where this is being communicated (slack, issues, other). Since this information isn't available to the intended audience, the information needs to be duplicated which at worst means there's no single source of truth and at a minimum means copy pasting information.
+
+The proposal is to trial using an Asynchronous Issue Update model, similar to [what the Package Group uses](/handbook/engineering/development/ops/package/#async-issue-updates). This process would replace the existing daily standup update we post in Slack with `Geekbot`. The time period for the trial would be a milestone or two, depending on feedback cycles.
+
+The async daily update communicates the progress and confidence using an issue comment and the milestone health status using the Health Status field in the issue. A daily update may be skipped if there was no progress. Merge requests that do not have a related issue should be updated directly. It's preferable to update the issue rather than the related merge requests, as those do not provide a view of the overall progress. Where there are blockers or you need support, Slack is the preferred space to ask for that. Being blocked or needing support are more urgent than email notifications allow.
+
+When communicating the health status, the options are:
+- `on track` - when the issue is progressing as planned
+- `needs attention` - when the issue requires attention or intervention to keep it on schedule
+- `at risk` - when there is a risk the issue will not be completed according to schedule
+
+The async update comment should include:
+- what percentage complete the work is, in other words, how much work is done to put all the required MRs in review
+- the confidence of the person that their estimate is correct
+- notes on what was done and/or if review has started
+- it could be good to specify the relevant dependencies in the update, if there are multiple people working on it
+
+Example:
+```
+**Status**: 20% complete, 75% confident
+
+Expecting to go into review tomorrow.
+```
 
-- **GitLab.org**
-  - [GitLab](https://gitlab.com/gitlab-org/gitlab)
-  - [GitLab Triage](https://gitlab.com/gitlab-org/ruby/gems/gitlab-triage)
-  - [GitLab Roulette](https://gitlab.com/gitlab-org/gitlab-roulette)
-  - [GitLab Development Kit](https://gitlab.com/gitlab-org/gitlab-development-kit)
-  - **Ruby gems**
-    - [GitLab Styles](https://gitlab.com/gitlab-org/ruby/gems/gitlab-styles)
-    - [GitLab Dangerfiles](https://gitlab.com/gitlab-org/ruby/gems/gitlab-dangerfiles)
-    - [GitLab Quality Test Tooling](https://gitlab.com/gitlab-org/ruby/gems/gitlab_quality-test_tooling)
-  - **Quality Group**
-    - [Triage-Ops](https://gitlab.com/gitlab-org/quality/triage-ops)
-    - [Quality toolbox](https://gitlab.com/gitlab-org/quality/toolbox)
+Include one entry for each associated MR
 
-### Reviewers and maintainers
+Example:
+```
+**Issue status**: 20% complete, 75% confident
 
-Upon joining the Quality department, team members are granted either developer, maintainer, or owner access to a variety of core projects. For projects where only developer access is initially granted, there are some criteria that should be met before maintainer access is granted.
+Expecting to go into review tomorrow.
+
+**MR statuses**:
+
+- !11111+ - 80% complete, 99% confident - docs update - need to add one more section
+- !21212+ - 10% complete, 70% confident - api update - database migrations created, working on creating the rest of the functionality next
+```
+
+#### How to measure confidence?
+
+Ask yourself, how confident am I that my % of completeness is correct?.
+
+For things like bugs or issues with many unknowns, the confidence can help communicate the level of unknowns. For example, if you start a bug with a lot of unknowns on the first day of the milestone you might have low confidence that you understand what your level of progress is.
+Your confidence in the work may go down for whatever reason, it's acceptable to downgrade your confidence. Consideration should be given to retrospecting on why that happened.
+
+## Epics
+
+### Weekly epic updates
+
+A weekly update should be added to epics you're assigned to and/or are actively working on. The update should provide an overview of the progress across the feature. Consider adding an update if epic is blocked, if there are unexpected competing priorities, and even when not in progress, what is the confidence level to deliver by the expected delivery date. A weekly update may then be skipped until the situation changes. Anyone working on issues assigned to an epic can post weekly updates.
+
+The epic updates communicate a high level view of progress and status for quarterly goals using an epic comment. It does not need to have issue or MR level granularity because that is part of each issue updates.
+
+The weekly update comment should include:
+- Status: ok, so-so, bad? Is there something blocked in the general effort?
+- How much of the total work is done? How much is remaining? Do we have an ETA?
+- What's your confidence level on the completion percentage?
+- What is next?
+- Is there something that needs help/support? (tag specific individuals so they know ahead of time)
+
+#### Examples
+
+Some good examples of epic updates that cover the above aspects:
+- <https://gitlab.com/groups/gitlab-org/-/epics/8628#note_1090732793>
+- <https://gitlab.com/groups/gitlab-org/-/epics/5152#note_1029337901>
+
+## Reviewers and maintainers
+
+Upon joining the Engineering productivity team, team members are granted either developer, maintainer, or owner access to a variety of core projects. For projects where only developer access is initially granted, there are some criteria that should be met before maintainer access is granted.
 
 - [GitLab Tooling and Pipeline configuration](https://gitlab.com/gitlab-org/gitlab/-/blob/35789a64a6519ee764c8cb3b98f9287915e96e9d/.gitlab/CODEOWNERS#L82-117)
   - GitLab Tooling and Pipeline configuration consists of scripts and config files used for both local development and for CI pipelines. Changes made to these files have wide impact to developer experience at GitLab.
@@ -55,7 +171,7 @@ Upon joining the Quality department, team members are granted either developer,
 - [Triage Ops](https://gitlab.com/gitlab-org/quality/triage-ops)
   - Authored or reviewed 10 MRs in total.
 
-#### Becoming a maintainer
+### Becoming a maintainer
 
 The following guidelines will help you to become a maintainer. Remember that there is no specific
 timeline on this, and that you should work together with your manager and current maintainers.
@@ -73,8 +189,3 @@ Your approval means you think it is ready to merge.
 It is your responsibility to set up any necessary meetings to discuss your
 progress with current maintainers, as well as your manager. These can be at any
 frequency that is right for you.
-
-## Project Management
-
-Our team's [Quality: Engineering Productivity board](https://gitlab.com/groups/gitlab-org/-/boards/978615?label_name[]=Engineering%20Productivity) shows the current ownership of workload / issues maintained by team members in Engineering Productivity team.
-
diff --git a/content/handbook/engineering/infrastructure/engineering-productivity/test-intelligence.md b/content/handbook/engineering/infrastructure/engineering-productivity/test-intelligence.md
new file mode 100644
index 0000000000..4360dd3a93
--- /dev/null
+++ b/content/handbook/engineering/infrastructure/engineering-productivity/test-intelligence.md
@@ -0,0 +1,60 @@
+---
+title: "Test Intelligence"
+---
+
+## Introduction
+
+As the owner of [pipeline configuration](https://docs.gitlab.com/ee/development/pipelines/index.html) for the [GitLab project](https://gitlab.com/gitlab-org/gitlab), the Engineering Productivity team has adopted several test intelligence strategies aimed to improve pipeline efficiency with the following benefits:
+- Shortened feedback loop by prioritizing tests that are most likely to fail
+- Faster pipelines to scale better when Merge Train is enabled
+
+These strategies include:
+- Predictive test jobs via test mapping
+- Fail-fast job
+- Re-run previously failed tests early
+- Selective jobs via pipeline rules
+- Selective jobs via labels
+
+## Predictive test jobs via test mapping
+
+Tests that provide coverage to the code changes in each merge request are most likely to fail. As a result, merge request pipelines for the [GitLab project](https://gitlab.com/gitlab-org/gitlab) run only the predictive set of tests by default. These include:
+- [RSpec predictive jobs](https://docs.gitlab.com/ee/development/pipelines/#rspec-predictive-jobs) which runs relevant RSpec tests that are mapped to the code changes
+- [Jest predictive jobs](https://docs.gitlab.com/ee/development/pipelines/#jest-predictive-jobs) which runs relevant Jest tests that are mapped to the code changes
+
+See <https://docs.gitlab.com/ee/development/pipelines/index.html#predictive-test-jobs-before-a-merge-request-is-approved> for more information.
+
+## Fail-fast job
+
+There is a [fail-fast job](https://docs.gitlab.com/ee/development/pipelines/#fail-fast-job-in-merge-request-pipelines) in each merge request pipeline aimed to run all the RSpec tests that provide coverage for the code changes, hence are most likely to fail. It uses the same [test_file_finder](https://gitlab.com/gitlab-org/ruby/gems/test_file_finder) gem for test mapping. The job provides faster feedback by running early and stops the rest of the pipeline right away if any of the fail-fast job tests fail.
+Take a look at this [youtube video](https://www.youtube.com/watch?v=FCCbxZky5Nk) for details on how [GitLab](https://gitlab.com/gitlab-org/gitlab) implements the fail-fast job with test_file_finder.
+Note that the current design only works with low-impacting merge requests which are only mapped to a small set of tests. If there is a large number of tests that are likely to fail for a merge request, putting them in a single job is not feasible and could result in a long-running bottleneck which defeats its purpose.
+
+See <https://docs.gitlab.com/ee/development/pipelines/index.html#fail-fast-job-in-merge-request-pipelines> for more information.
+
+Premium GitLab customers, who wish to incorporate the `Fail-Fast job` into their Ruby projects, can set it up with our [Verify/Failfast](https://docs.gitlab.com/ee/ci/testing/fail_fast_testing.html) template.
+
+## Re-run previously failed tests early
+
+Tests that previously failed in a merge request are likely to fail again, so they provide the most urgent feedback in the next run.
+To grant these tests the highest priority, the [GitLab](https://gitlab.com/gitlab-org/gitlab) pipeline [prioritizes previously failed tests by re-running them early](https://docs.gitlab.com/ee/development/pipelines/#re-run-previously-failed-tests-in-merge-request-pipelines) in a dedicated job, so it will be one of the first jobs to fail if attention is needed.
+
+See <https://docs.gitlab.com/ee/development/pipelines/index.html#re-run-previously-failed-tests-in-merge-request-pipelines> for more information.
+
+## Selective jobs via pipeline rules
+
+The GitLab pipeline consists of hundreds of jobs, but not all are necessary for each merge request. For example, a merge request with only changes to documentation files do not need to run any backend tests, so we can exclude all backend test jobs from the pipeline.
+See [specify-when-jobs-run-with-rules](https://docs.gitlab.com/ee/ci/jobs/job_control.html#specify-when-jobs-run-with-rules) for how to include/exclude CI jobs based on file changes.
+Most of the pipeline rules for the [GitLab project](https://gitlab.com/gitlab-org/gitlab) can be found in <https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/ci/rules.gitlab-ci.yml>.
+
+## Selective jobs via labels
+
+Developers can add labels to run jobs in addition to the ones selected by the pipeline rules. Those labels start with `pipeline:` and multiple can be applied. A few examples that people commonly use:
+
+- `~"pipeline:run-all-rspec"`
+- `~"pipeline:run-all-jest"`
+- `~"pipeline:run-as-if-foss"`
+- `~"pipeline:run-as-if-jh"`
+- `~"pipeline:run-praefect-with-db"`
+- `~"pipeline:run-single-db"`
+
+See [docs](https://docs.gitlab.com/ee/development/pipelines/) for when to use these pipeline labels.