TiDB Troubleshooting Map
This document summarizes common issues in TiDB and other components. You can use this map to diagnose and solve issues when you encounter related problems.
1. Service Unavailable
1.1 The client reports Region is Unavailable error
1.1.1 The
Region is Unavailableerror is usually because a Region is not available for a period of time. You might encounterTiKV server is busy, or the request to TiKV fails due tonot leaderorepoch not match, or the request to TiKV time out. In such cases, TiDB performs abackoffretry mechanism. When thebackoffexceeds a threshold (20s by default), the error will be sent to the client. Within thebackoffthreshold, this error is not visible to the client.1.1.2 Multiple TiKV instances are OOM at the same time, which causes no Leader during the OOM period. See case-991 in Chinese.
1.1.3 TiKV reports
TiKV server is busy, and exceeds thebackofftime. For more details, refer to 4.3.TiKV server is busyis a result of the internal flow control mechanism and should not be counted in thebackofftime. This issue will be fixed.1.1.4 Multiple TiKV instances failed to start, which causes no Leader in a Region. When multiple TiKV instances are deployed in a physical machine, the failure of the physical machine can cause no Leader in a Region if the label is not properly configured. See case-228 in Chinese.
1.1.5 When a Follower apply is lagged in a previous epoch, after the Follower becomes a Leader, it rejects the request with
epoch not match. See case-958 in Chinese (TiKV needs to optimize its mechanism).
1.2 PD errors cause service unavailable
Refer to 5 PD issues.
2. Latency increases significantly
2.1 Transient increase
- 2.1.1 Wrong TiDB execution plan causes latency increase. Refer to 3.3.
- 2.1.2 PD Leader election issue or OOM. Refer to 5.2 and 5.3.
- 2.1.3 A significant number of Leader drops in some TiKV instances. Refer to 4.4.
2.2 Persistent and significant increase
2.2.1 TiKV single thread bottleneck
Too many Regions in a TiKV instance causes a single gRPC thread to be the bottleneck (Check the Grafana -> TiKV-details -> Thread CPU/gRPC CPU Per Thread metric). In v3.x or later versions, you can enable
Hibernate Regionto resolve the issue. See case-612 in Chinese.For versions earlier than v3.0, when the raftstore thread or the apply thread becomes the bottleneck (Grafana -> TiKV-details -> Thread CPU/raft store CPU and Async apply CPU metrics exceed
80%), you can scale out TiKV (v2.x) instances or upgrade to v3.x with multi-threading.
2.2.2 CPU load increases.
2.2.3 TiKV slow write. Refer to 4.5.
2.2.4 TiDB wrong execution plan. Refer to 3.3.
3. TiDB issues
3.1 DDL
3.1.1 An error
ERROR 1105 (HY000): unsupported modify decimal column precisionis reported when you modify the length of thedecimalfield. TiDB does not support changing the length of thedecimalfield.3.1.2 TiDB DDL job hangs or executes slowly (use
admin show ddl jobsto check DDL progress)Cause 1: Network issue with other components (PD/TiKV).
Cause 2: Early versions of TiDB (earlier than v3.0.8) have heavy internal load because of a lot of goroutine at high concurrency.
Cause 3: In early versions (v2.1.15 & versions < v3.0.0-rc1), PD instances fail to delete TiDB keys, which causes every DDL change to wait for two leases.
For other unknown causes, report a bug.
Solution:
- For cause 1, check the network connection between TiDB and TiKV/PD.
- For cause 2 and 3, the issues are already fixed in later versions. You can upgrade TiDB to a later version.
- For other causes, you can use the following solution of migrating the DDL owner.
DDL owner migration:
- If you can connect to the TiDB server, execute the owner election command again:
curl -X POST http://{TiDBIP}:10080/ddl/owner/resign - If you cannot connect to the TiDB server, use
tidb-ctlto delete the DDL owner from the etcd of the PD cluster to trigger re-election:tidb-ctl etcd delowner [LeaseID] [flags] + ownerKey
- If you can connect to the TiDB server, execute the owner election command again:
3.1.3 TiDB reports
information schema is changederror in logFor the detailed causes and solution, see Why the
Information schema is changederror is reported.Background: The increased number of
schema versionis consistent with the number ofschema stateof each DDL change operation. For example, thecreate tableoperation has 1 version change, and theadd columnoperation has 4 version changes. Therefore, too many column change operations might causeschema versionto increase fast. For details, refer to online schema change.
3.1.4 TiDB reports
information schema is out of datein logCause 1: The TiDB server that is executing the DML statement is stopped by
graceful killand prepares to exit. The execution time of the transaction that contains the DML statement exceeds one DDL lease. An error is reported when the transaction is committed.Cause 2: The TiDB server cannot connect to PD or TiKV when it is executing the DML statement. As a result, the TiDB server did not load the new schema within one DDL lease (
45sby default), or the TiDB server disconnects from PD with thekeep alivesetting.Cause 3: TiKV has high load or network timed out. Check the node loads in Grafana -> TiDB and TiKV.
Solution:
- For cause 1, retry the DML operation when TiDB is started.
- For cause 2, check the network between the TiDB server and PD/TiKV.
- For cause 3, investigate why TiKV is busy. Refer to 4 TiKV issues.
3.2 OOM issues
3.2.1 Symptom
Client: The client reports the error
ERROR 2013 (HY000): Lost connection to MySQL server during query.Check the log
Execute
dmesg -T | grep tidb-server. The result shows the OOM-killer log around the time point when the error occurs.Grep the "Welcome to TiDB" log in
tidb.logaround the time point after the error occurs (namely, the time when tidb-server restarts).Grep
fatal error: runtime: out of memoryorcannot allocate memoryintidb_stderr.log.In v2.1.8 or earlier versions, you can grep
fatal error: stack overflowin thetidb_stderr.log.
Monitor: The memory usage of tidb-server instances increases sharply in a short period of time.
3.2.2 Locate the SQL statement that causes OOM. (Currently all versions of TiDB cannot locate SQL accurately. You still need to analyze whether OOM is caused by the SQL statement after you locate one.)
- For versions >= v3.0.0, grep "expensive_query" in
tidb.log. That log message records SQL queries that timed out or exceed memory quota. - For versions < v3.0.0, grep "memory exceeds quota" in
tidb.logto locate SQL queries that exceed memory quota.
- For versions >= v3.0.0, grep "expensive_query" in
3.2.3 Mitigate OOM issues
- By enabling
SWAP, you can mitigate the OOM issue caused by overuse of memory by large queries. When the memory is insufficient, this method can have impact on the performance of large queries due to the I/O overhead. The degree to which the performance is affected depends on the remaining memory space and the disk I/O speed.
- By enabling
3.2.4 Typical reasons for OOM
The SQL query has
join. If you view the SQL statement by usingexplain, you can find that thejoinoperation selects theHashJoinalgorithm and theinnertable is large.The data volume of a single
UPDATE/DELETEquery is too large. See case-882 in Chinese.The SQL contains multiple sub-queries connected by
Union. See case-1828 in Chinese.
For more information about troubleshooting OOM, see Troubleshoot TiDB OOM Issues.
3.3 Wrong execution plan
3.3.1 Symptom
SQL query execution time is much longer compared with that of previous executions, or the execution plan suddenly changes. If the execution plan is logged in the slow log, you can directly compare the execution plans.
SQL query execution time is much longer compared with that of other databases such as MySQL. Compare the execution plan with other databases to see the differences, such as
Join Order.In slow log, the number of SQL execution time
Scan Keysis large.
3.3.2 Investigate the execution plan
explain analyze {SQL}. When the execution time is acceptable, comparecountin the result ofexplain analyzeand the number ofrowinexecution info. If a large difference is found in theTableScan/IndexScanrow, it is likely that the statistics is incorrect. If a large difference is found in other rows, the problem might not be in the statistics.select count(*). When the execution plan contains ajoinoperation,explain analyzemight take a long time. You can check whether the problem is in the statistics by executingselect count(*)for the conditions onTableScan/IndexScanand comparing therow countinformation in theexplainresult.
3.3.3 Mitigation
For v3.0 and later versions, use the
SQL Bindfeature to bind the execution plan.Update the statistics. If you are roughly sure that the problem is caused by the statistics, dump the statistics. If the cause is outdated statistics, such as the
modify count/row countinshow stats_metais greater than a certain value (for example, 0.3), or the table has an index of time column, you can try recovering by usinganalyze table. Ifauto analyzeis configured, check whether thetidb_auto_analyze_ratiosystem variable is too large (for example, greater than 0.3), and whether the current time is betweentidb_auto_analyze_start_timeandtidb_auto_analyze_end_time.For other situations, report a bug.
3.4 SQL execution error
3.4.1 The client reports the
ERROR 1265(01000) Data Truncatederror. This is because the way TiDB internally calculates the precision ofDecimaltype is incompatible with that of MySQL. This issue has been fixed in v3.0.10 (#14438).Cause:
In MySQL, if two large-precision
Decimalare divided and the result exceeds the maximum decimal precision (30), only30digits are reserved and no error is reported;In TiDB, the calculation result is the same as in MySQL, but inside the data structure that represents
Decimal, a field for decimal precision still retains the actual precision.Take
(0.1^30) / 10as an example. The results in TiDB and MySQL are both0, because the precision is30at most. However, in TiDB, the field for decimal precision is still31.After multiple
Decimaldivisions, even though the result is correct, this precision field could grow larger and larger, and eventually exceeds the threshold in TiDB (72), and theData Truncatederror is reported.The multiplication of
Decimaldoes not have this issue, because the out-of-bounds is bypassed, and the precision is set to the maximum precision limit.Solution: You can bypass this issue by manually adding
Cast(xx as decimal(a, b)), in whichaandbare the target precisions.
4. TiKV issues
4.1 TiKV panics and fails to start
4.1.1
sync-log = false. Theunexpected raft log index: last_index X < applied_index Yerror is returned after the machine is powered off.This issue is expected. You can restore the Region using
tikv-ctl.4.1.2 If TiKV is deployed on a virtual machine, when the virtual machine is killed or the physical machine is powered off, the
entries[X, Y] is unavailable from storageerror is reported.This issue is expected. The
fsyncof virtual machines is not reliable, so you need to restore the Region usingtikv-ctl.4.1.3 For other unexpected causes, report a bug.
4.2 TiKV OOM
4.2.1 If the
block-cacheconfiguration is too large, it might cause OOM.To verify the cause of the problem, check the
block cache sizeof RocksDB by selecting the corresponding instance in the monitor Grafana -> TiKV-details.Meanwhile, check whether the
[storage.block-cache] capacity = # "1GB"parameter is set properly. By default, TiKV'sblock-cacheis set to45%of the total memory of the machine. You need to explicitly specify this parameter when you deploy TiKV in the container, because TiKV obtains the memory of the physical machine, which might exceed the memory limit of the container.4.2.2 Coprocessor receives many large queries and returns a large volume of data. gRPC fails to send data as quickly as the coprocessor returns data, which results in OOM.
To verify the cause, you can check whether
response sizeexceeds thenetwork outboundtraffic by viewing the monitor Grafana -> TiKV-details -> coprocessor overview.4.2.3 Other components occupy too much memory.
This issue is unexpected. You can report a bug.
4.3 The client reports the server is busy error
Check the specific cause for busy by viewing the monitor Grafana -> TiKV -> errors. server is busy is caused by the flow control mechanism of TiKV, which informs tidb/ti-client that TiKV is currently under too much pressure and will retry later.
4.3.1 TiKV RocksDB encounters
write stall.A TiKV instance has two RocksDB instances, one in
data/raftto save the Raft log, another indata/dbto save the real data. You can check the specific cause for stall by runninggrep "Stalling" RocksDBin the log. The RocksDB log is a file starting withLOG, andLOGis the current log.Too many
level0 sstcauses stall. You can add the[rocksdb] max-sub-compactions = 2(or 3) parameter to speed uplevel0 sstcompaction. The compaction task from level0 to level1 is divided into several subtasks (the max number of subtasks is the value ofmax-sub-compactions) to be executed concurrently. See case-815 in Chinese.Too many
pending compaction bytescauses stall. The disk I/O fails to keep up with the write operations in business peaks. You can mitigate this problem by increasing thesoft-pending-compaction-bytes-limitandhard-pending-compaction-bytes-limitof the corresponding CF.The default value of
[rocksdb.defaultcf] soft-pending-compaction-bytes-limitis64GB. If the pending compaction bytes reaches the threshold, RocksDB slows down the write speed. You can set[rocksdb.defaultcf] soft-pending-compaction-bytes-limitto128GB.The default value of
hard-pending-compaction-bytes-limitis256GB. If the pending compaction bytes reaches the threshold (this is not likely to happen, because RocksDB slows down the write after the pending compaction bytes reachessoft-pending-compaction-bytes-limit), RocksDB stops the write operation. You can sethard-pending-compaction-bytes-limitto512GB.If the disk I/O capacity fails to keep up with the write for a long time, it is recommended to scale up your disk. If the disk throughput reaches the upper limit and causes write stall (for example, the SATA SSD is much lower than NVME SSD), while the CPU resources is sufficient, you may apply a compression algorithm of higher compression ratio. This way, the CPU resources is traded for disk resources, and the pressure on the disk is eased.
If the default CF compaction sees a high pressure, change the
[rocksdb.defaultcf] compression-per-levelparameter from["no", "no", "lz4", "lz4", "lz4", "zstd", "zstd"]to["no", "no", "zstd", "zstd", "zstd", "zstd", "zstd"].
Too many memtables causes stall. This usually occurs when the amount of instant writes is large and the memtables flush to the disk slowly. If the disk write speed cannot be improved, and this issue only occurs during business peaks, you can mitigate it by increasing the
max-write-buffer-numberof the corresponding CF.- For example, set
[rocksdb.defaultcf] max-write-buffer-numberto8(5by default). Note that this might cause more memory usage in the peak, because more memtables might be in the memory.
- For example, set
4.3.2
scheduler too busySerious write conflict.
latch wait durationis high. You can viewlatch wait durationin the monitor Grafana -> TiKV-details -> scheduler prewrite/scheduler commit. When the write tasks pile up in the scheduler, the pending write tasks exceed the threshold set in[storage] scheduler-pending-write-threshold(100MB). You can verify the cause by viewing the metric corresponding toMVCC_CONFLICT_COUNTER.Slow write causes write tasks to pile up. The data being written to TiKV exceeds the threshold set by
[storage] scheduler-pending-write-threshold(100MB). Refer to 4.5.
4.3.3
raftstore is busy. The processing of messages is slower than the receiving of messages. The short-termchannel fullstatus does not affect the service, but if the error persists for a long time, it might cause Leader switch.append logencounters stall. Refer to 4.3.1.append log durationis high, which causes slow processing of messages. You can refer to 4.5 to analyze whyappend log durationis high.- raftstore receives a large batch of messages in an instant (check in the TiKV Raft messages dashboard), and fails to process them. Usually the short-term
channel fullstatus does not affect the service.
4.3.4 TiKV coprocessor is in a queue. The number of piled up tasks exceeds
coprocessor threads * readpool.coprocessor.max-tasks-per-worker-[normal|low|high]. Too many large queries leads to the tasks piling up in coprocessor. You need to check whether a execution plan change causes a large number of table scan operations. Refer to 3.3.
4.4 Some TiKV nodes drop Leader frequently
4.4.1 Re-election because TiKV is restarted
After TiKV panics, it is pulled up by systemd and runs normally. You can check whether panic has occurred by viewing the TiKV log. Because this issue is unexpected, report a bug if it happens.
TiKV is stopped or killed by a third party and then pulled up by systemd. Check the cause by viewing
dmesgand the TiKV log.TiKV is OOM, which causes restart. Refer to 4.2.
TiKV is hung because of dynamically adjusting
THP(Transparent Hugepage). See case case-500 in Chinese.
4.4.2 TiKV RocksDB encounters write stall and thus results in re-election. You can check if the monitor Grafana -> TiKV-details -> errors shows
server is busy. Refer to 4.3.1.4.4.3 Re-election because of network isolation.
4.5 TiKV write is slow
4.5.1 Check whether the TiKV write is low by viewing the
prewrite/commit/raw-putduration of TiKV gRPC (only for RawKV clusters). Generally, you can locate the slow phase according to the performance-map. Some common situations are listed as follows.4.5.2 The scheduler CPU is busy (only for transaction kv).
The
scheduler command durationof prewrite/commit is longer than the sum ofscheduler latch wait durationandstorage async write duration. The scheduler worker has a high CPU demand, such as over 80% ofscheduler-worker-pool-size* 100%, or the CPU resources of the entire machine are relatively limited. If the write workload is large, check if[storage] scheduler-worker-pool-sizeis set too small.For other situations, report a bug.
4.5.3 Append log is slow.
The Raft IO/
append log durationin TiKV Grafana is high, usually because the disk write operation is slow. You can verify the cause by checking theWAL Sync Duration maxvalue of RocksDB - raft.For other situations, report a bug.
4.5.4 The raftstore thread is busy.
The Raft Propose/
propose wait durationis significantly larger than the append log duration in TiKV Grafana. Take the following methods:- Check whether the
[raftstore] store-pool-sizeconfiguration value is too small. It is recommended to set the value between1and5and not too large. - Check whether the CPU resources on the machine are insufficient.
- Check whether the
4.5.5 Apply is slow.
The Raft IO/
apply log durationin TiKV Grafana is high, which usually comes with a high Raft Propose/apply wait duration. The possible causes are as follows:[raftstore] apply-pool-sizeis too small (it is recommended to set the value between1and5and not too large), and the Thread CPU/apply CPUis large.The CPU resources on the machine are insufficient.
Region write hot spot. A single apply thread has high CPU usage. Currently, we cannot properly address the hot spot problem on a single Region, which is being improved. To view the CPU usage of each thread, modify the Grafana expression and add
by (instance, name).RocksDB write is slow. RocksDB kv/
max write durationis high. A single Raft log might contain multiple KVs. When writing into RocksDB, 128 KVs are written into RocksDB in a write batch. Therefore, an apply log might be associated with multiple writes in RocksDB.For other situations, report a bug.
4.5.6 Raft commit log is slow.
The Raft IO/
commit log durationin TiKV Grafana is high (this metric is only supported in Grafana after v4.x). Every Region corresponds to an independent Raft group. Raft has a flow control mechanism, similar to the sliding window mechanism of TCP. You can control the size of the sliding window by configuring the[raftstore] raft-max-inflight-msgs = 256parameter. If there is a write hot spot and thecommit log durationis high, you can adjust the parameter, such as increasing it to1024.4.5.7 For other situations, refer to the write path on performance-map and analyze the cause.
5. PD issues
5.1 PD scheduling
5.1.1 Merge
Empty Regions across tables cannot be merged. You need to modify the
[coprocessor] split-region-on-tableparameter in TiKV, which is set tofalsein v4.x by default. See case-896 in Chinese.Region merge is slow. You can check whether the merged operator is generated by accessing the monitor dashboard in Grafana -> PD -> operator. To accelerate the merge, increase the value of
merge-schedule-limit.
5.1.2 Add replicas or take replicas online/offline
The TiKV disk uses 80% of the capacity, and PD does not add replicas. In this situation, the number of miss peers increases, so TiKV needs to be scaled out. See case-801 in Chinese.
When a TiKV node is taken offline, some Region cannot be migrated to other nodes. This issue has been fixed in v3.0.4 (#5526). See case-870 in Chinese.
5.1.3 Balance
- The Leader/Region count is not evenly distributed. See case-394 and case-759 in Chinese. The major cause is that the balance performs scheduling based on the size of Region/Leader, so this might result in the uneven distribution of the count. In TiDB 4.0, the
[leader-schedule-policy]parameter is introduced, which enables you to set the scheduling policy of Leader to becount-based orsize-based.
- The Leader/Region count is not evenly distributed. See case-394 and case-759 in Chinese. The major cause is that the balance performs scheduling based on the size of Region/Leader, so this might result in the uneven distribution of the count. In TiDB 4.0, the
5.2 PD election
5.2.1 PD switches Leader.
Cause 1: Disk. The disk where the PD node is located has full I/O load. Investigate whether PD is deployed with other components with high I/O demand and the health of the disk. You can verify the cause by viewing the monitor metrics in Grafana -> disk performance -> latency/load. You can also use the FIO tool to run a check on the disk if necessary. See case-292 in Chinese.
Cause 2: Network. The PD log shows
lost the TCP streaming connection. You need to check whether there is a problem with the network between PD nodes and verify the cause by viewinground tripin the monitor Grafana -> PD -> etcd. See case-177 in Chinese.Cause 3: High system load. The log shows
server is likely overloaded. See case-214 in Chinese.
5.2.2 PD cannot elect a Leader or the election is slow.
PD cannot elect a Leader: The PD log shows
lease is not expired. This issue has been fixed in v3.0.x and v2.1.19. See case-875 in Chinese.The election is slow: The Region loading duration is long. You can check this issue by running
grep "regions cost"in the PD log. If the result is in seconds, such asload 460927 regions cost 11.77099s, it means the Region loading is slow. You can enable theregion storagefeature in v3.0 by settinguse-region-storagetotrue, which significantly reduce the Region loading duration. See case-429 in Chinese.
5.2.3 PD timed out when TiDB executes SQL statements.
PD doesn't have a Leader or switches Leader. Refer to 5.2.1 and 5.2.2.
Network issue. Check whether the network from TiDB to PD Leader is running normally by accessing the monitor Grafana -> blackbox_exporter -> ping latency.
PD panics. Report a bug.
PD is OOM. Refer to 5.3.
If the issue has other causes, get goroutine by running
curl http://127.0.0.1:2379/debug/pprof/goroutine?debug=2and report a bug.
5.2.4 Other issues
PD reports the
FATALerror, and the log showsrange failed to find revision pair. This issue has been fixed in v3.0.8 (#2040). For details, see case-947 in Chinese.For other situations, report a bug.
5.3 PD OOM
5.3.1 When the
/api/v1/regionsinterface is used, too many Regions might cause PD OOM. This issue has been fixed in v3.0.8 (#1986).5.3.2 PD OOM during the rolling upgrade. The size of gRPC messages is not limited, and the monitor shows that TCP InSegs is relatively large. This issue has been fixed in v3.0.6 (#1952).
5.4 Grafana display
- 5.4.1 The monitor in Grafana -> PD -> cluster -> role displays follower. The Grafana expression issue has been fixed in v3.0.8.
6. Ecosystem tools
6.1 TiDB Binlog
6.1.1 TiDB Binlog is a tool that collects changes from TiDB and provides backup and replication to downstream TiDB or MySQL platforms. For details, see TiDB Binlog on GitHub.
6.1.2 The
Update Timein Pump/Drainer Status is updated normally, and no anomaly shows in the log, but no data is written to the downstream.- Binlog is not enabled in the TiDB configuration. Modify the
[binlog]configuration in TiDB.
- Binlog is not enabled in the TiDB configuration. Modify the
6.1.3
saramain Drainer reports theEOFerror.- The Kafka client version in Drainer is inconsistent with the version of Kafka. You need to modify the
[syncer.to] kafka-versionconfiguration.
- The Kafka client version in Drainer is inconsistent with the version of Kafka. You need to modify the
6.1.4 Drainer fails to write to Kafka and panics, and Kafka reports the
Message was too largeerror.The binlog data is too large, so the single message written to Kafka is too large. You need to modify the following configuration of Kafka:
message.max.bytes=1073741824 replica.fetch.max.bytes=1073741824 fetch.message.max.bytes=1073741824For details, see case-789 in Chinese.
6.1.5 Inconsistent data in upstream and downstream
Some TiDB nodes do not enable binlog. For v3.0.6 or later versions, you can check the binlog status of all the nodes by accessing the http://127.0.0.1:10080/info/all interface. For versions earlier than v3.0.6, you can check the binlog status by viewing the configuration file.
Some TiDB nodes go into the
ignore binlogstatus. For v3.0.6 or later versions, you can check the binlog status of all the nodes by accessing the http://127.0.0.1:10080/info/all interface. For versions earlier than v3.0.6, check the TiDB log to see whether it contains theignore binlogkeyword.The value of the timestamp column is inconsistent in upstream and downstream.
This is caused by different time zones. You need to ensure that Drainer is in the same time zone as the upstream and downstream databases. Drainer obtains its time zone from
/etc/localtimeand does not support theTZenvironment variable. See case-826 in Chinese.In TiDB, the default value of timestamp is
null, but the same default value in MySQL 5.7 (not including MySQL 8) is the current time. Therefore, when the timestamp in upstream TiDB isnulland the downstream is MySQL 5.7, the data in the timestamp column is inconsistent. You need to runset @@global.explicit_defaults_for_timestamp=on;in the upstream before enabling binlog.
For other situations, report a bug.
6.1.6 Slow replication
The downstream is TiDB/MySQL, and the upstream performs frequent DDL operations. See case-1023 in Chinese.
The downstream is TiDB/MySQL, and the table to be replicated has no primary key and no unique index, which causes reduced performance in binlog. It is recommended to add the primary key or unique index.
If the downstream outputs to files, check whether the output disk or network disk is slow.
For other situations, report a bug.
6.1.7 Pump cannot write binlog and reports the
no space left on deviceerror.- The local disk space is insufficient for Pump to write binlog data normally. You need to clean up the disk space and then restart Pump.
6.1.8 Pump reports the
fail to notify all living drainererror when it is started.Cause: When Pump is started, it notifies all Drainer nodes that are in the
onlinestate. If it fails to notify Drainer, this error log is printed.Solution: Use the binlogctl tool to check whether each Drainer node is normal or not. This is to ensure that all Drainer nodes in the
onlinestate are working normally. If the state of a Drainer node is not consistent with its actual working status, use the binlogctl tool to change its state and then restart Pump. See the case fail-to-notify-all-living-drainer.
6.1.9 Drainer reports the
gen update sqls failed: table xxx: row data is corruption []error.- Trigger: The upstream performs DML operations on this table while performing
DROP COLUMNDDL. This issue has been fixed in v3.0.6. See case-820 in Chinese.
- Trigger: The upstream performs DML operations on this table while performing
6.1.10 Drainer replication is hung. The process remains active but the checkpoint is not updated.
- This issues has been fixed in v3.0.4. See case-741 in Chinese.
6.1.11 Any component panics.
6.2 Data Migration
6.2.1 TiDB Data Migration (DM) is a migration tool that supports data migration from MySQL/MariaDB into TiDB. For details, see DM on GitHub.
6.2.2
Access denied for user 'root'@'172.31.43.27' (using password: YES)shows when you runquery statusor check the log.- The database related passwords in all the DM configuration files should be encrypted by
dmctl. If a database password is empty, it is unnecessary to encrypt the password. Cleartext passwords can be used since v1.0.6. - During DM operation, the user of the upstream and downstream databases must have the corresponding read and write privileges. Data Migration also prechecks the corresponding privileges automatically while starting the data replication task.
- To deploy different versions of DM-worker/DM-master/dmctl in a DM cluster, see the case study on AskTUG in Chinese.
- The database related passwords in all the DM configuration files should be encrypted by
6.2.3 A replication task is interrupted with the
driver: bad connectionerror returned.The
driver: bad connectionerror indicates that an anomaly has occurred in the connection between DM and the downstream TiDB database (such as network failure, TiDB restart and so on), and that the data of the current request has not yet been sent to TiDB.- For versions earlier than DM 1.0.0 GA, stop the task by running
stop-taskand then restart the task by runningstart-task. - For DM 1.0.0 GA or later versions, an automatic retry mechanism for this type of error is added. See #265.
- For versions earlier than DM 1.0.0 GA, stop the task by running
6.2.4 A replication task is interrupted with the
invalid connectionerror.The
invalid connectionerror indicates that an anomaly has occurred in the connection between DM and the downstream TiDB database (such as network failure, TiDB restart, TiKV busy and so on), and that a part of the data for the current request has been sent to TiDB. Because DM has the feature of concurrently replicating data to the downstream in replication tasks, several errors might occur when a task is interrupted. You can check these errors by runningquery-statusorquery-error.- If only the
invalid connectionerror occurs during the incremental replication process, DM retries the task automatically. - If DM does not retry or fails to retry automatically because of version problems (automatic retry is introduced in v1.0.0-rc.1), use
stop-taskto stop the task and then usestart-taskto restart the task.
- If only the
6.2.5 The relay unit reports the error
event from * in * diff from passed-in event *, or a replication task is interrupted with an error that fails to get or parse binlog, such asget binlog error ERROR 1236 (HY000) and binlog checksum mismatch, data may be corrupted returnedDuring the process that DM pulls relay log or the incremental replication, this two errors might occur if the size of the upstream binlog file exceeds 4 GB.
Cause: When writing relay logs, DM needs to perform event verification based on binlog positions and the binlog file size, and store the replicated binlog positions as checkpoints. However, the official MySQL uses uint32 to store binlog positions, which means the binlog position for a binlog file over 4 GB overflows, and then the errors above occur.
Solution:
- For relay processing units, manually recover replication.
- For binlog replication processing units, manually recover replication.
6.2.6 The DM replication is interrupted, and the log returns
ERROR 1236 (HY000) The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.Check whether the master binlog is purged.
Check the position information recorded in
relay.meta.relay.metahas recorded the empty GTID information. DM-worker saves the GTID information in memory torelay.metawhen it exits or in every 30s. When DM-worker does not obtain the upstream GTID information, it saves the empty GTID information torelay.meta. See case-772 in Chinese.The binlog event recorded in
relay.metatriggers the incomplete recover process and records the wrong GTID information. This issue is fixed in v1.0.2, and might occur in earlier versions.
6.2.7 The DM replication process returns an error
Error 1366: incorrect utf8 value eda0bdedb29d(\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd).- This value cannot be successfully written into MySQL 8.0 or TiDB, but can be written into MySQL 5.7. You can skip the data format check by enabling the
tidb_skip_utf8_checkparameter.
- This value cannot be successfully written into MySQL 8.0 or TiDB, but can be written into MySQL 5.7. You can skip the data format check by enabling the
6.3 TiDB Lightning
6.3.1 TiDB Lightning is a tool for fast full import of large amounts of data into a TiDB cluster. See TiDB Lightning on GitHub.
6.3.2 Import speed is too slow.
region-concurrencyis set too high, which causes thread contention and reduces performance. Three ways to troubleshoot:- The setting can be found from the start of the log by searching
region-concurrency. - If TiDB Lightning shares a server with other services (for example, Importer), you must manually set
region-concurrencyto 75% of the total number of CPU cores on that server. - If there is a quota on CPU (for example, limited by Kubernetes settings), TiDB Lightning might not be able to read this out. In this case,
region-concurrencymust also be manually reduced.
- The setting can be found from the start of the log by searching
Every additional index introduces a new KV pair for each row. If there are N indices, the actual size to be imported would be approximately (N+1) times the size of the Mydumper output. If the indices are negligible, you may first remove them from the schema, and add them back via
CREATE INDEXafter the import is complete.The version of TiDB Lightning is old. Try the latest version, which might improve the import speed.
6.3.3
checksum failed: checksum mismatched remote vs local.Cause 1: The table might already have data. These old data can affect the final checksum.
Cause 2: If the checksum of the target database is 0, which means nothing is imported, it is possible that the cluster is too hot and fails to take in any data.
Cause 3: If the data source is generated by the machine and not backed up by Mydumper, ensure it respects the constrains of the table. For example:
AUTO_INCREMENTcolumns need to be positive, and do not contain the value "0".- UNIQUE and PRIMARY KEYs must not have duplicate entries.
Solution: See Troubleshooting Solution.
6.3.4
Checkpoint for … has invalid status:(error code)Cause: Checkpoint is enabled, and Lightning/Importer has previously abnormally exited. To prevent accidental data corruption, TiDB Lightning will not start until the error is addressed. The error code is an integer less than 25, with possible values as
0, 3, 6, 9, 12, 14, 15, 17, 18, 20 and 21. The integer indicates the step where the unexpected exit occurs in the import process. The larger the integer is, the later the exit occurs.Solution: See Troubleshooting Solution.
6.3.5
ResourceTemporarilyUnavailable("Too many open engines …: 8")- Cause: The number of concurrent engine files exceeds the limit specified by tikv-importer. This could be caused by misconfiguration. In addition, even when the configuration is correct, if tidb-lightning has exited abnormally before, an engine file might be left at a dangling open state, which could cause this error as well.
- Solution: See Troubleshooting Solution.
6.3.6
cannot guess encoding for input file, please convert to UTF-8 manuallyCause: TiDB Lightning only supports the UTF-8 and GB-18030 encodings. This error means the file is not in any of these encodings. It is also possible that the file has mixed encoding, such as containing a string in UTF-8 and another string in GB-18030, due to historical ALTER TABLE executions.
Solution: See Troubleshooting Solution.
6.3.7
[sql2kv] sql encode error = [types:1292]invalid time format: '{1970 1 1 0 45 0 0}'Cause: A timestamp type entry has a time value that does not exist. This is either because of DST changes or because the time value has exceeded the supported range (from Jan 1, 1970 to Jan 19, 2038).
Solution: See Troubleshooting Solution.
7. Common log analysis
7.1 TiDB
7.1.1
GC life time is shorter than transaction duration.The transaction duration exceeds the GC lifetime (10 minutes by default).
You can increase the GC lifetime by modifying the
tidb_gc_life_timesystem variable. Generally, it is not recommended to modify this parameter, because changing it might cause many old versions to pile up if this transaction has a large number ofUPDATEandDELETEstatements.7.1.2
txn takes too much time.This error is returned when you commit a transaction that has not been committed for a long time (over 590 seconds).
If your application needs to execute a transaction of such a long time, you can increase the
[tikv-client] max-txn-time-use = 590parameter and the GC lifetime to avoid this issue. It is recommended to check whether your application needs such a long transaction time.7.1.3
coprocessor.goreportsrequest outdated.This error is returned when the coprocessor request sent to TiKV waits in a queue at TiKV for over 60 seconds.
You need to investigate why the TiKV coprocessor is in a long queue.
7.1.4
region_cache.goreports a large number ofswitch region peer to next due to send request fail, and the error message iscontext deadline exceeded.The request for TiKV timed out and triggers the region cache to switch the request to other nodes. You can continue to run the
grep "<addr> cancelledcommand on theaddrfield in the log and take the following steps according to thegrepresults:send request is cancelled: The request timed out during the sending phase. You can investigate the monitoring Grafana -> TiDB -> Batch Client/Pending Request Count by TiKVand see whether the Pending Request Count is greater than 128:- If the value is greater than 128, the sending goes beyond the processing capacity of KV, so the sending piles up.
- If the value is not greater than 128, check the log to see if the report is caused by the operation and maintenance changes of the corresponding KV; otherwise, this error is unexpected, and you need to report a bug.
wait response is cancelled: The request timed out after it is sent to TiKV. You need to check the response time of the corresponding TiKV address and the Region logs in PD and KV at that time.
7.1.5
distsql.goreportsinconsistent index.The data index seems to be inconsistent. Run the
admin check table <TableName>command on the table where the reported index is. If the check fails, disable garbage collection by running the following command, and report a bug:SET GLOBAL tidb_gc_enable = 0;
7.2 TiKV
7.2.1
key is locked.The read and write have conflict. The read request encounters data that has not been committed and needs to wait until the data is committed.
A small number of this error has no impact on the business, but a large number of this error indicates that the read-write conflict is severe in your business.
7.2.2
write conflict.This is the write-write conflict in optimistic transactions. If multiple transactions modify the same key, only one transaction succeed and other transactions automatically obtain the timestamp again and retry the operation, with no impact on the business.
If the conflict is severe, it might cause transaction failure after multiple retries. In this case, it is recommended to use the pessimistic lock.
7.2.3
TxnLockNotFound.This transaction commit is too slow, causing it to be rolled back by other transactions after Time To Live (TTL). This transaction will automatically retry, so the business is usually not affected. For a transaction with a size of 0.25 MB or smaller, the default TTL is 3 seconds.
7.2.4
PessimisticLockNotFound.Similar to
TxnLockNotFound. The pessimistic transaction commit is too slow and thus rolled back by other transactions.7.2.5
stale_epoch.The request epoch is outdated, so TiDB re-sends the request after updating the routing. The business is not affected. Epoch changes when Region has a split/merge operation or a replica is migrated.
7.2.6
peer is not leader.The request is sent to a replica that is not Leader. If the error response indicates which replica is the latest Leader, TiDB updates the local routing according the error and sends a new request to the latest Leader. Usually, the business is not affected.
In v3.0 and later versions, TiDB tries other peers if the request to the previous Leader fails, which might lead to frequent
peer is not leaderin TiKV log. You can check theswitch region peer to next due to send request faillog of the corresponding Region in TiDB to determine the root cause of the sending failure. For details, refer to 7.1.4.This error might also be returned if a Region has no Leader due to other reasons. For details, see 4.4.