Skip to content

Latest commit

 

History

History
112 lines (75 loc) · 6.7 KB

GETTING_STARTED.md

File metadata and controls

112 lines (75 loc) · 6.7 KB

Getting started

Repair Agent

ecChronos is an agent that simplifies and automates repairs. Each instance of ecChronos is designed to keep a single Cassandra node repaired. It first calculates all the ranges that need to be repaired per table and then groups nodes by the ranges they have in common.

One repair is run against a Cassandra node per range as if running repair with -st and -et flags.

A lock table within Cassandra ensures that nodes do not run repairs at the same time. How the locking mechanism behaves can be configured in ecc.yml.

ecChronos uses repair history saved in Cassandra to know what is and isn't repaired. It ensures that the same thing will not need to be repaired multiple times per interval by keeping track of the latest repaired time. If a range fails, it will return to the scheduler to be run at a later time.

Priority is given to every range that needs to be repaired. The priority is based on the number of hours since it has been last repaired to ensure that the oldest ranges are repaired first. Failing repairs will therefore gain higher and higher priority with time.

Configuration

The default settings assume a GC Grace Second setting of 10 days. Repair will be run once on every table every interval which defaults to 7 days. If 8 days have passed on a range without a repair then a warning will be logged. If 10 days have passed an ERROR is logged.

Tune these parameters to fit your use case in ecc.yml

Starting ecChronos

Installation information can be found in SETUP.md.

When ecChronos is started, it will read repair history to see the status of the repairs. If repair history exists for a table, ecChronos will use the last time repair was run and add the repair interval to calculate the next repair time. On the other hand, if there's no repair history present for a table, ecChronos will assume the table has been repaired in the past, but an initial delay (defaults to one day) will then offset the first repair. If the repair interval is shorter than the configured initial delay, initial delay will be set to the same value as the repair interval.

The assumption is done in the following way:

  • Initial delay = min(initial delay, repair interval)
  • Completed at = (start time - repair interval + initial delay)
  • Next repair = (completed at + repair interval)

Given the formulas above and start time = 2023-08-25 10:00:00, repair interval = 7 days, initial delay = 1 days (default value) the calculation looks like this:

  • Completed at = 2023-08-19 10:00:00
  • Next repair = 2023-08-26 10:00:00

For more fine grained control over which tables to repair and when, please refer to schedule.yml.

Ecctool

For more information on the different commands, see ECCTOOL.md.

Schedules

Running ecctool schedules will give an overview of the current schedules keeping tables updated.

------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id                                   | Keyspace  | Table                   | Status    | Repaired(%) | Completed at        | Next repair         | Repair type |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 10f0ea60-3585-11ed-86d2-4fb526f4f28a | repair    | rep486                  | ON_TIME   | 24.22       | 2022-09-20 10:00:35 | 2022-09-20 12:00:33 | VNODE       |
| 1313a350-3585-11ed-a474-87c1392dbca4 | repair    | rep487                  | ON_TIME   | 0.00        | 2022-09-20 10:00:35 | 2022-09-20 12:00:35 | VNODE       |
| 15089580-3585-11ed-8bcb-f1d1bbd61061 | repair    | rep488                  | ON_TIME   | 0.00        | 2022-09-20 10:00:35 | 2022-09-20 12:00:35 | VNODE       |
| 16cf72d0-3585-11ed-b921-350e95eb41ad | repair    | rep489                  | ON_TIME   | 0.00        | 2022-09-20 10:00:35 | 2022-09-20 12:00:35 | VNODE       |
| a5348710-34ed-11ed-8794-0d445a53d0c8 | repair    | rep49                   | ON_TIME   | 0.00        | 2022-09-20 10:00:35 | 2022-09-20 12:00:35 | VNODE       |
| 18e5a8a0-3585-11ed-aaa4-e74f24a4aded | repair    | rep490                  | ON_TIME   | 0.00        | 2022-09-20 10:00:35 | 2022-09-20 12:00:35 | VNODE       |
| 1a9be420-3585-11ed-984f-cde131d276a2 | repair    | rep491                  | ON_TIME   | 0.00        | 2022-09-20 10:00:35 | 2022-09-20 12:00:35 | VNODE       |

The status shows COMPLETED when the repair has completed within the interval. If not all ranges are repaired within the interval, the status shows ON_TIME instead. LATE and OVERDUE will be shown when ranges have not repaired for 8 and 10 days respectively. These late and overdue times can be tuned in ecc.yml and schedule.yml to fit your use case.

Repaired shows how many ranges are repaired. Note that this value can go up and down as ranges become unrepaired since last interval.

Completed at shows the time when all ranges are repaired. ecChronos assumes a range is repaired if there is no repair history.

Repairs

ecctool repairs shows an overview of all manually triggered repairs

RepairInfo

ecctool repair-info gives you an idea of what has been repaired. Giving your interval can help you determine how many of your ranges have been repaired during the given time.

Web Server

By default ecChronos starts a web server that can be configured in application.yml. The server is based on springboot and most features springboot has are exposed here. The REST API can be found in REST.md.

Security

If your use case requires security, the three interfaces ecChronos provides can be secured. For CQL and JMX, the security options can be found in security.yml. For the web server these options can be found in the application.yml.

Feature JMX CQL Web
TLS Yes Yes Yes
Authentication Yes Yes No
Authorization Yes Yes No

Statistics

Statistics are enabled by default but can be disabled in ecc.yaml. More information on what metrics to expect can be found in METRICS.md. Statistics can also be excluded for more fine grained control on what to save. Note that something like logrotate is needed to archieve/delete old files.