Skip to content

Latest commit

 

History

History
155 lines (102 loc) · 11.4 KB

troubleshoot-ticdc.md

File metadata and controls

155 lines (102 loc) · 11.4 KB
title summary aliases
Troubleshoot TiCDC
Learn how to troubleshoot issues you might encounter when you use TiCDC.
/docs/dev/ticdc/troubleshoot-ticdc/

Troubleshoot TiCDC

This document introduces the common errors you might encounter when using TiCDC, and the corresponding maintenance and troubleshooting methods.

Note:

In this document, the server address specified in cdc cli commands is server=http://127.0.0.1:8300. When you use the command, replace the address with your actual TiCDC Server address.

TiCDC replication interruptions

How do I know whether a TiCDC replication task is interrupted?

  • Check the changefeed checkpoint monitoring metric of the replication task (choose the right changefeed id) in the Grafana dashboard. If the metric value stays unchanged, or the checkpoint lag metric keeps increasing, the replication task might be interrupted.
  • Check the exit error count monitoring metric. If the metric value is greater than 0, an error has occurred in the replication task.
  • Execute cdc cli changefeed list and cdc cli changefeed query to check the status of the replication task. stopped means the task has stopped, and the error item provides the detailed error message. After the error occurs, you can search error on running processor in the TiCDC server log to see the error stack for troubleshooting.
  • In some extreme cases, the TiCDC service is restarted. You can search the FATAL level log in the TiCDC server log for troubleshooting.

How do I know whether the replication task is stopped manually?

You can know whether the replication task is stopped manually by executing cdc cli. For example:

{{< copyable "shell-regular" >}}

cdc cli changefeed query --server=http://127.0.0.1:8300 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f

In the output of the above command, admin-job-type shows the state of this replication task. For more information about each state and its meaning, see Changefeed states.

How do I handle replication interruptions?

A replication task might be interrupted in the following known scenarios:

  • The downstream continues to be abnormal, and TiCDC still fails after many retries.

    • In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of gc-ttl.

    • Handling method: You can resume the replication task by executing cdc cli changefeed resume after the downstream is back to normal.

  • Replication cannot continue because of incompatible SQL statement(s) in the downstream.

    • In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of gc-ttl.
    • Handling procedures:
      1. Query the status information of the replication task using the cdc cli changefeed query command and record the value of checkpoint-ts.
      2. Use the new task configuration file and add the ignore-txn-start-ts parameter to skip the transaction corresponding to the specified start-ts.
      3. Pause the replication task by executing cdc cli changefeed pause -c <changefeed-id>.
      4. Specify the new task configuration file by executing cdc cli changefeed update -c <changefeed-id> --config <config-file-path>.
      5. Resume the replication task by executing cdc cli changefeed resume -c <changefeed-id>.

What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption?

  • Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved in v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions.

How do I handle the Error 1298: Unknown or incorrect time zone: 'UTC' error when creating the replication task or replicating data to MySQL?

This error is returned when the downstream MySQL does not load the time zone. You can load the time zone by running mysql_tzinfo_to_sql. After loading the time zone, you can create tasks and replicate data normally.

{{< copyable "shell-regular" >}}

mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql -p

If the output of the command above is similar to the following one, the import is successful:

Enter password:
Warning: Unable to load '/usr/share/zoneinfo/iso3166.tab' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/leap-seconds.list' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/zone.tab' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/zone1970.tab' as time zone. Skipping it.

If the downstream is a special MySQL environment (a public cloud RDS or some MySQL derivative versions) and importing the time zone using the preceding method fails, you can use the default time zone of the downstream by setting time-zone to an empty value, such as time-zone="".

When using time zones in TiCDC, it is recommended to explicitly specify the time zone, such as time-zone="Asia/Shanghai". Also, make sure that the tz specified in TiCDC server configurations and the time-zone specified in Sink URI are consistent with the time zone configuration of the downstream database. This prevents data inconsistency caused by inconsistent time zones.

How do I handle the incompatibility issue of configuration files caused by TiCDC upgrade?

Refer to Notes for compatibility.

The start-ts timestamp of the TiCDC task is quite different from the current time. During the execution of this task, replication is interrupted and an error [CDC:ErrBufferReachLimit] occurs. What should I do?

Since v4.0.9, you can try to enable the unified sorter feature in your replication task, or use the BR tool for an incremental backup and restore, and then start the TiCDC replication task from a new time.

When the downstream of a changefeed is a database similar to MySQL and TiCDC executes a time-consuming DDL statement, all other changefeeds are blocked. What should I do?

  1. Pause the execution of the changefeed that contains the time-consuming DDL statement. Then you can see that other changefeeds are no longer blocked.
  2. Search for the apply job field in the TiCDC log and confirm the start-ts of the time-consuming DDL statement.
  3. Manually execute the DDL statement in the downstream. After the execution finishes, go on performing the following operations.
  4. Modify the changefeed configuration and add the above start-ts to the ignore-txn-start-ts configuration item.
  5. Resume the paused changefeed.

The [tikv:9006]GC life time is shorter than transaction duration, transaction starts at xx, GC safe point is yy error is reported when I use TiCDC to create a changefeed. What should I do?

You need to run the pd-ctl service-gc-safepoint --pd <pd-addrs> command to query the current GC safepoint and service GC safepoint. If the GC safepoint is smaller than the start-ts of the TiCDC replication task (changefeed), you can directly add the --disable-gc-check option to the cdc cli create changefeed command to create a changefeed.

If the result of pd-ctl service-gc-safepoint --pd <pd-addrs> does not have gc_worker service_id:

  • If your PD version is v4.0.8 or earlier, refer to PD issue #3128 for details.
  • If your PD is upgraded from v4.0.8 or an earlier version to a later version, refer to PD issue #3366 for details.

When I use TiCDC to replicate messages to Kafka, Kafka returns the Message was too large error. Why?

For TiCDC v4.0.8 or earlier versions, you cannot effectively control the size of the message output to Kafka only by configuring the max-message-bytes setting for Kafka in the Sink URI. To control the message size, you also need to increase the limit on the bytes of messages to be received by Kafka. To add such a limit, add the following configuration to the Kafka server configuration.

# The maximum byte number of a message that the broker receives
message.max.bytes=2147483648
# The maximum byte number of a message that the broker copies
replica.fetch.max.bytes=2147483648
# The maximum message byte number that the consumer side reads
fetch.message.max.bytes=2147483648

How can I find out whether a DDL statement fails to execute in downstream during TiCDC replication? How to resume the replication?

If a DDL statement fails to execute, the replication task (changefeed) automatically stops. The checkpoint-ts is the DDL statement's finish-ts. If you want TiCDC to retry executing this statement in the downstream, use cdc cli changefeed resume to resume the replication task. For example:

{{< copyable "shell-regular" >}}

cdc cli changefeed resume -c test-cf --server=http://127.0.0.1:8300

If you want to skip this DDL statement that goes wrong, set the start-ts of the changefeed to the checkpoint-ts (the timestamp at which the DDL statement goes wrong) plus one, and then run the cdc cli changefeed create command to create a new changefeed task. For example, if the checkpoint-ts at which the DDL statement goes wrong is 415241823337054209, run the following commands to skip this DDL statement:

To skip this DDL statement that goes wrong, you can configure the ignore-txn-start-ts parameter to skip the transactions corresponding to the specified start-ts. For example:

  1. Search the TiCDC log for the apply job field, and identify the start-ts of the DDLs that are taking a long time.
  2. Modify the changefeed configuration. Add the above start-ts to the ignore-txn-start-ts configuration item.
  3. Resume the suspended changefeed.

Note:

Although setting the changefeed start-ts to the value of checkpoint-ts (at the time of the error) plus one and then recreating the task can skip the DDL statement, it can also cause TiCDC to lose the DML data change at the time of checkpointTs+1. Therefore, this operation is strictly prohibited in production environments.

cdc cli changefeed remove --server=http://127.0.0.1:8300 --changefeed-id simple-replication-task
cdc cli changefeed create --server=http://127.0.0.1:8300 --sink-uri="mysql://root:[email protected]:3306/" --changefeed-id="simple-replication-task" --sort-engine="unified" --start-ts 415241823337054210

The Kafka: client has run out of available brokers to talk to: EOF error is reported when I use TiCDC to replicate messages to Kafka. What should I do?

This error is typically caused by the connection failure between TiCDC and the Kafka cluster. To troubleshoot, you can check the Kafka logs and network status. One possible reason is that you did not specify the correct kafka-version parameter when creating the replication task, causing the Kafka client inside TiCDC to use the wrong Kafka API version when accessing the Kafka server. You can fix this issue by specifying the correct kafka-version parameter in the --sink-uri configuration. For example:

cdc cli changefeed create --server=http://127.0.0.1:8300 --sink-uri "kafka://127.0.0.1:9092/test?topic=test&protocol=open-protocol&kafka-version=2.4.0"