[5/15 Bodies] No block bodies to write in this log period block number=27910849 #51

broyeztony · 2023-05-04T15:00:43Z

System information

OS: linux (amd64)
OS Image: Amazon Linux 2
Kernel version: 5.4.226-129.415.amzn2.x86_64
Container runtime: docker://20.10.17

Erigon version: 2.40.0-dev-3da15fcb

Erigon Command (with flags/config):

erigon --chain=bsc --datadir=/data/erigon --config=/home/erigon/config.toml --prune=hrtc --db.pagesize=16k --log.console.verbosity=info --nat=extip:$(POD_IP) --maxpeers=1000 --p2p.protocol=66 --torrent.download.slots=20 --torrent.download.rate=512mb --bodies.cache=214748364800 --batchSize=4096M

Concensus Layer:

Concensus Layer Command (with flags/config):

Chain/Network: bsc / 56

Expected behaviour

node sync to head and remain synced consistently

Actual behaviour

The node stopped syncing (here, stucked at bodies stage) and drifted away from chain's head without any changes of configuration.
Also, please note that we already use the config flags mentioned in #41.
The solutions proposed there using the integration command do not work either.

Steps to reproduce the behaviour

launch the client, and sync stucks at a fixed block number.

Backtrace

[INFO] [05-04|12:41:20.217] [5/15 Bodies] No block bodies to write in this log period block number=27910849
[INFO] [05-04|12:41:40.217] [5/15 Bodies] Downloading block bodies   block_num=27910849 delivery/sec=334.0KB wasted/sec=0B remaining=4935 delivered=264 cache=4.8GB alloc=9.2GB sys=16.2GB
[INFO] [05-04|12:42:00.217] [5/15 Bodies] No block bodies to write in this log period block number=27910849
[INFO] [05-04|12:42:00.221] [5/15 Bodies] DONE                       in=1m40.005033072s
[INFO] [05-04|12:42:03.978] Commit cycle                             in=3.756665787s
[INFO] [05-04|12:42:03.978] Timings (slower than 50ms)               Headers=6.616s Bodies=1m40.005s
[INFO] [05-04|12:42:03.978] Tables                                   PlainState=233.4GB AccountChangeSet=1.7GB StorageChangeSet=3.6GB BlockTransaction=5.1GB TransactionLog=9.1GB FreeList=45.2MB ReclaimableSpace=181.0GB
[INFO] [05-04|12:42:03.980] RPC Daemon notified of new headers       from=27914761 to=27915785 hash=0x684047a81b5a66d11e06e1fc4d5f432d1de4b8a194f1e20c20c9241fe5c98fc0 header sending=1.239498ms log sending=324ns
[INFO] [05-04|12:42:03.981] [2/15 Headers] Waiting for headers...    from=27915785
[INFO] [05-04|12:42:06.711] [parlia] snapshots build, gather headers block=27900000
[INFO] [05-04|12:42:06.713] [parlia] snapshots build, recover from headers block=27900000
[INFO] [05-04|12:42:07.850] [parlia] snapshots build, gather headers block=27900000
[INFO] [05-04|12:42:07.852] [parlia] snapshots build, recover from headers block=27900000
[INFO] [05-04|12:42:09.759] [parlia] snapshots build, gather headers block=27900000
[INFO] [05-04|12:42:09.769] [parlia] snapshots build, recover from headers block=27900000
[INFO] [05-04|12:42:09.866] [2/15 Headers] Processed                 highest inserted=27915821 age=7s
[INFO] [05-04|12:42:09.885] [5/15 Bodies] Processing bodies...       from=27910849 to=27915821
[INFO] [05-04|12:42:10.248] [txpool] stat                            pending=10000 baseFee=268 queued=30000 alloc=6.1GB sys=16.2GB
[INFO] [05-04|12:42:29.886] [5/15 Bodies] Downloading block bodies   block_num=27910849 delivery/sec=313.3KB wasted/sec=102.5KB remaining=4971 delivered=103 cache=4.8GB alloc=6.3GB sys=16.2GB
[INFO] [05-04|12:42:49.886] [5/15 Bodies] Downloading block bodies   block_num=27910849 delivery/sec=417.2KB wasted/sec=0B remaining=4971 delivered=249 cache=4.8GB alloc=5.8GB sys=16.2GB
[INFO] [05-04|12:43:07.902] [p2p] GoodPeers                          eth66=708
[INFO] [05-04|12:43:09.885] [5/15 Bodies] No block bodies to write in this log period block number=27910849

The text was updated successfully, but these errors were encountered:

hitchhooker · 2023-05-05T07:31:20Z

workaround has been to identify node being stuck and restarting service something like this

#!/bin/bash

# Variables
LOG_FILE="bsc.log"
PATTERN="No block bodies to write"
MIN_OCCURRENCES=5
NODE_RESTART_CMD="systemctl restart your-node.service"

# Function to count consecutive occurrences of the pattern with the same block number
count_occurrences() {
  local log_file="$1"
  local pattern="$2"
  local count=0
  local prev_block_number=""

  grep -F "$pattern" "$log_file" | while read -r line; do
    if [[ $line == *"$pattern"* ]]; then
      block_number=$(echo "$line" | grep -oP 'block number=\K\d+')
      if [[ "$block_number" == "$prev_block_number" ]]; then
        ((count++))
      else
        count=1
      fi

      if (( count >= MIN_OCCURRENCES )); then
        return 0
      fi

      prev_block_number="$block_number"
    fi
  done

  return 1
}

# Main loop
while true; do
  if count_occurrences "$LOG_FILE" "$PATTERN"; then
    echo "Restarting node due to consecutive occurrences of '$PATTERN' with the same block number in $LOG_FILE"
    $NODE_RESTART_CMD
  fi
  sleep 60
done

calmbeing · 2023-05-09T02:40:08Z

Suggest to use the latest mdbx snapshot for archive node: https://github.com/bnb-chain/bsc-snapshots

broyeztony · 2023-05-11T09:30:41Z

Following our call today, here are the 'ETL' logs we see in the client container

[INFO] [05-11|09:11:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=ae5f14a9
[INFO] [05-11|09:12:16.634] [] ETL [2/2] Loading                     into=PlainState current_prefix=b41b9ac5
[INFO] [05-11|09:12:28.010] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.3GB sys=9.7GB
[INFO] [05-11|09:12:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=bb56a527
[INFO] [05-11|09:13:16.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=c4bfd229
[INFO] [05-11|09:13:27.226] [p2p] GoodPeers                          eth66=992
[INFO] [05-11|09:13:28.027] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.0GB sys=9.7GB
[INFO] [05-11|09:13:49.650] [] ETL [2/2] Loading                     into=PlainState current_prefix=cca4e84c
[INFO] [05-11|09:14:17.557] [] ETL [2/2] Loading                     into=PlainState current_prefix=d72ffbc2
[INFO] [05-11|09:14:28.013] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.2GB sys=9.7GB
[INFO] [05-11|09:14:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=ddb59a10
[INFO] [05-11|09:15:16.634] [] ETL [2/2] Loading                     into=PlainState current_prefix=e7f08bbf
[INFO] [05-11|09:15:27.225] [p2p] GoodPeers                          eth66=992
[INFO] [05-11|09:15:28.024] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.4GB sys=9.7GB
[INFO] [05-11|09:15:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=ea0d2132
[INFO] [05-11|09:16:16.634] [] ETL [2/2] Loading                     into=PlainState current_prefix=f387ef35
[INFO] [05-11|09:16:28.010] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.5GB sys=9.7GB
[INFO] [05-11|09:16:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=fd36e2c2
[INFO] [05-11|09:17:27.225] [p2p] GoodPeers                          eth66=992
[INFO] [05-11|09:17:28.032] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.3GB sys=9.7GB
[INFO] [05-11|09:18:14.291] [7/15 Execution] Completed on            block=28111500
[INFO] [05-11|09:18:14.291] [7/15 Execution] DONE                    in=47m50.839811187s
[INFO] [05-11|09:18:14.292] [8/15 HashState] Promoting plain state   from=28101756 to=28111500
[INFO] [05-11|09:18:14.292] [8/15 HashState] Incremental promotion   from=28101756 to=28111500 codes=true csbucket=AccountChangeSet
[INFO] [05-11|09:18:28.009] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.0GB sys=9.7GB
[INFO] [05-11|09:18:44.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acce32
[INFO] [05-11|09:19:14.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acd406
[INFO] [05-11|09:19:27.225] [p2p] GoodPeers                          eth66=994
[INFO] [05-11|09:19:28.024] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=4.6GB sys=9.7GB
[INFO] [05-11|09:19:44.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acd957
[INFO] [05-11|09:20:14.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acde06
[INFO] [05-11|09:20:28.011] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=3.7GB sys=9.7GB
[INFO] [05-11|09:20:44.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001ace3b5
[INFO] [05-11|09:21:14.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acea5a
[INFO] [05-11|09:21:27.225] [p2p] GoodPeers                          eth66=998
[INFO] [05-11|09:21:28.027] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.8GB sys=9.7GB
[INFO] [05-11|09:22:14.187] [8/15 HashState] ETL [2/2] Loading       into=HashedCodeHash current_prefix=4a728024
[INFO] [05-11|09:22:28.010] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.0GB sys=9.7GB
[INFO] [05-11|09:22:44.187] [8/15 HashState] ETL [2/2] Loading       into=HashedCodeHash current_prefix=95b338d1
[INFO] [05-11|09:23:14.187] [8/15 HashState] ETL [2/2] Loading       into=HashedCodeHash current_prefix=dfe51ad1

We are not sure that these logs are relevant to the issue, but we are also running other network on Erigon without issue and we don't the see these 'ETL' logs there.

Also could you provide the configuration on which you are running your own archive and full nodes?
Thank you very much

setunapo · 2023-05-11T14:12:50Z

got it, we are checking on it.

calmbeing · 2023-05-16T15:22:15Z

@broyeztony For very short-term, you could use something like following script to detect "No block write.." and restart automatically. For the long-term, we're analyzing carefully about the p2p sentry code and will optimize it soon.

#!/bin/bash
set -x
# Define the number of log lines to check
LOG_LINES=50

# Define the number of occurrences before triggering a restart
OCCURRENCES=10

# Continuously check the logs
while true; do
    # Get the last 30 log lines
    LOGS=$(journalctl -u erigon-bsc --lines=$LOG_LINES)

    # Count the number of occurrences of the error message
    COUNT=$(echo "$LOGS" | grep -o "No block headers to write in this log period block number" | wc -l)

    # Check if the error occurred more than 10 times
    if [ $COUNT -gt $OCCURRENCES ]; then
        # Restart the erigon-bsc service
        echo "Error occurred more than $OCCURRENCES times. Restarting service..."
        systemctl restart erigon-bsc
    fi

    # Sleep for 60 seconds before checking the logs again
    sleep 60
done

calmbeing · 2023-05-16T15:22:44Z

workaround has been to identify node being stuck and restarting service something like this

#!/bin/bash

# Variables
LOG_FILE="bsc.log"
PATTERN="No block bodies to write"
MIN_OCCURRENCES=5
NODE_RESTART_CMD="systemctl restart your-node.service"

# Function to count consecutive occurrences of the pattern with the same block number
count_occurrences() {
  local log_file="$1"
  local pattern="$2"
  local count=0
  local prev_block_number=""

  grep -F "$pattern" "$log_file" | while read -r line; do
    if [[ $line == *"$pattern"* ]]; then
      block_number=$(echo "$line" | grep -oP 'block number=\K\d+')
      if [[ "$block_number" == "$prev_block_number" ]]; then
        ((count++))
      else
        count=1
      fi

      if (( count >= MIN_OCCURRENCES )); then
        return 0
      fi

      prev_block_number="$block_number"
    fi
  done

  return 1
}

# Main loop
while true; do
  if count_occurrences "$LOG_FILE" "$PATTERN"; then
    echo "Restarting node due to consecutive occurrences of '$PATTERN' with the same block number in $LOG_FILE"
    $NODE_RESTART_CMD
  fi
  sleep 60
done

Also, can refer to this script

koraykoska · 2023-05-16T16:56:26Z

A restart does not work at all for us. This error continues to come without the node syncing. We used the latest snapshot on the official page. Is there any workaround?

koraykoska · 2023-05-16T16:58:50Z

After every restart:

[INFO] [05-16|16:55:51.333] [txpool] Started
[INFO] [05-16|16:56:10.518] [2/15 Headers] Wrote block headers       number=28265891 blk/second=5.000 alloc=2.9GB sys=3.1GB
[INFO] [05-16|16:56:30.518] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:56:50.492] [txpool] stat                            pending=1 baseFee=0 queued=4 alloc=3.1GB sys=3.3GB
[INFO] [05-16|16:56:50.518] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:57:10.518] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:57:30.519] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:57:49.054] [p2p] GoodPeers                          eth66=73 eth67=4 eth68=1
[INFO] [05-16|16:57:50.494] [txpool] stat                            pending=3641 baseFee=0 queued=2182 alloc=2.4GB sys=3.6GB
[INFO] [05-16|16:57:50.519] [2/15 Headers] Wrote block headers       number=28265933 blk/second=2.100 alloc=2.4GB sys=3.6GB
[INFO] [05-16|16:58:10.518] [2/15 Headers] No block headers to write in this log period block number=28265933

And it continues to never sync, only No block headers to write in this log period

calmbeing · 2023-05-19T06:43:35Z

@broyeztony @fylsan3 After careful debug, the root cause is for two reason:

Request packet params is kind of different from bsc-geth.
Always connect to useless peers even return empty response.

The first one was fixed in the source code,
For the second one: please specify flag : --sentry.drop-useless-peers, which would drop useless peers with empty response

Please let us know if any issue

broyeztony · 2023-05-20T07:22:23Z

@calmbeing thank you for the update.

Despite running with the suggested parameters on v1.0.7 we are still facing issue with the syncing.
We observe the error no block headers/bodies to write... to be less frequent, but overall the syncing seems to be slower than before.
Please find attached the entire log of one of our node that was restarted by the cronjob 9h ago.
bsc-mainnet-erigon-archive.log

Also, these are the startup command parameters for reference

- '--chain=bsc'
- '--datadir=/data/erigon'
- '--log.console.verbosity=info'
- '--nat=extip:$(POD_IP)'
- '--maxpeers=1000'
- '--p2p.protocol=66,67'
- '--torrent.download.slots=20'
- '--torrent.download.rate=512mb'
- '--bodies.cache=214748364800'
- '--batchSize=4096M'
- '--db.pagesize=16k'
- '--sentry.drop-useless-peers'

yusben · 2023-05-20T09:49:07Z

@calmbeing thank you for the update.

Despite running with the suggested parameters on v1.0.7 we are still facing issue with the syncing. We observe the error no block headers/bodies to write... to be less frequent, but overall the syncing seems to be slower than before. Please find attached the entire log of one of our node that was restarted by the cronjob 9h ago. bsc-mainnet-erigon-archive.log

Also, these are the startup command parameters for reference
- '--chain=bsc'
- '--datadir=/data/erigon'
- '--log.console.verbosity=info'
- '--nat=extip:$(POD_IP)'
- '--maxpeers=1000'
- '--p2p.protocol=66,67'
- '--torrent.download.slots=20'
- '--torrent.download.rate=512mb'
- '--bodies.cache=214748364800'
- '--batchSize=4096M'
- '--db.pagesize=16k'
- '--sentry.drop-useless-peers'

I assume you should use v1.0.8 with the latest fixes. I deployed it yesterday with the flag as was recommended by calmbeing and so far erigon has been working for 24+ hours flawlessly and without no block headers in the logs.

broyeztony · 2023-05-23T07:36:28Z

@yusben Thank you for your inputs.
Would you mind sharing the setup you are using to run your node? Not so much the startup command but rather:

are you running a full (pruned) node or an archive node?
disk / storage type / filesystem ? is it local to the machine that runs erigon or a scalable volume like AWS EBS. What filesystem are you using there?

Using v1.0.8 and ... --sentry.drop-useless-peers --p2p.protocol=66... we are not encountering the no block headers/bodies error logs either (except on 1 testnet node, same issue as reported there) but our syncing loop on mainnet archive is just too slow (our mainnet/full nodes however perform better and remains relatively 'close' to head)

On average, it takes our best nodes to sync ~400 blocks over ~20m. Pasting a example below:

[INFO] [05-23|06:07:00.025] RPC Daemon notified of new headers       from=28453505 to=28453885 hash=0xbcc966a30ad2e90786e2c585868671bdcf813fd2b2c49f4de0c42121df41324e header sending=371.552µs log sending=316ns
... <stages 1 to 15>
[INFO] [05-23|06:26:40.867] Timings (slower than 50ms)               Headers=1.112s BlockHashes=546ms Bodies=1.251s Senders=288ms Execution=2m54.614s HashState=3m19.178s IntermediateHashes=3m2.516s CallTraces=2m13.753s AccountHistoryIndex=1m9.658s StorageHistoryIndex=2m10.962s LogIndex=1m34.543s TxLookup=1m48.088s

As a consequence, our nodes always fall behind head by > 400 blocks.
@harveyff @calmbeing FYI.

calmbeing · 2023-05-23T08:19:10Z

@broyeztony , good to hear that, your sync slow problem should be resolved by this latest commit cc23b96, we'll draft a release later.

setunapo · 2023-05-25T08:06:00Z

closed, seems the sync is much smooth, could reopen if reoccur

yusben · 2023-05-29T11:15:36Z

@yusben Thank you for your inputs. Would you mind sharing the setup you are using to run your node? Not so much the startup command but rather:

are you running a full (pruned) node or an archive node?

disk / storage type / filesystem ? is it local to the machine that runs erigon or a scalable volume like AWS EBS. What filesystem are you using there?

Using v1.0.8 and ... --sentry.drop-useless-peers --p2p.protocol=66... we are not encountering the no block headers/bodies error logs either (except on 1 testnet node, same issue as reported there) but our syncing loop on mainnet archive is just too slow (our mainnet/full nodes however perform better and remains relatively 'close' to head)

On average, it takes our best nodes to sync ~400 blocks over ~20m. Pasting a example below:
[INFO] [05-23|06:07:00.025] RPC Daemon notified of new headers       from=28453505 to=28453885 hash=0xbcc966a30ad2e90786e2c585868671bdcf813fd2b2c49f4de0c42121df41324e header sending=371.552µs log sending=316ns
... <stages 1 to 15>
[INFO] [05-23|06:26:40.867] Timings (slower than 50ms)               Headers=1.112s BlockHashes=546ms Bodies=1.251s Senders=288ms Execution=2m54.614s HashState=3m19.178s IntermediateHashes=3m2.516s CallTraces=2m13.753s AccountHistoryIndex=1m9.658s StorageHistoryIndex=2m10.962s LogIndex=1m34.543s TxLookup=1m48.088s
As a consequence, our nodes always fall behind head by > 400 blocks. @harveyff @calmbeing FYI.

I am running a pruned node on baremetal server with 3 nvme ssds in raid0 ext4. Now it is working for 10 days without No block bodies to write in this log period error and is healthy for 99%+ of time.

command: | --chain=bsc --datadir=/home/erigon/erigon-data --db.pagesize=16k --prune=hrtc --ws --http --http.addr 0.0.0.0 --http.port 8545 --http.vhosts=* --maxpeers 100 --http.api=eth,erigon,web3,net,debug,trace,txpool,admin --port 30303 --p2p.allowed-ports=30303,30304 --p2p.protocol=66 --log.dir.path=/home/erigon/erigon-data/logs --metrics --metrics.addr 0.0.0.0

setunapo assigned calmbeing May 5, 2023

setunapo added the x-sync label May 5, 2023

setunapo added the p1 label May 15, 2023

setunapo mentioned this issue May 20, 2023

FAQ: Solution For Common Issues #41

Open

setunapo added p0 and removed p1 labels May 22, 2023

node-real deleted a comment from fylsan3 May 23, 2023

setunapo closed this as completed May 25, 2023

koraykoska mentioned this issue Jun 5, 2023

[2/15 Headers] No block headers to write in this log period block #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5/15 Bodies] No block bodies to write in this log period block number=27910849 #51

[5/15 Bodies] No block bodies to write in this log period block number=27910849 #51

broyeztony commented May 4, 2023 •

edited

Loading

hitchhooker commented May 5, 2023

calmbeing commented May 9, 2023

broyeztony commented May 11, 2023

setunapo commented May 11, 2023

calmbeing commented May 16, 2023

calmbeing commented May 16, 2023

koraykoska commented May 16, 2023

koraykoska commented May 16, 2023

calmbeing commented May 19, 2023 •

edited

Loading

broyeztony commented May 20, 2023

yusben commented May 20, 2023

broyeztony commented May 23, 2023 •

edited

Loading

calmbeing commented May 23, 2023

setunapo commented May 25, 2023

yusben commented May 29, 2023

[5/15 Bodies] No block bodies to write in this log period block number=27910849 #51

[5/15 Bodies] No block bodies to write in this log period block number=27910849 #51

Comments

broyeztony commented May 4, 2023 • edited Loading

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Backtrace

hitchhooker commented May 5, 2023

calmbeing commented May 9, 2023

broyeztony commented May 11, 2023

setunapo commented May 11, 2023

calmbeing commented May 16, 2023

calmbeing commented May 16, 2023

koraykoska commented May 16, 2023

koraykoska commented May 16, 2023

calmbeing commented May 19, 2023 • edited Loading

broyeztony commented May 20, 2023

yusben commented May 20, 2023

broyeztony commented May 23, 2023 • edited Loading

calmbeing commented May 23, 2023

setunapo commented May 25, 2023

yusben commented May 29, 2023

broyeztony commented May 4, 2023 •

edited

Loading

calmbeing commented May 19, 2023 •

edited

Loading

broyeztony commented May 23, 2023 •

edited

Loading