Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5/15 Bodies] No block bodies to write in this log period block number=27910849 #51

Closed
broyeztony opened this issue May 4, 2023 · 15 comments
Assignees

Comments

@broyeztony
Copy link

broyeztony commented May 4, 2023

System information

OS: linux (amd64)
OS Image: Amazon Linux 2
Kernel version: 5.4.226-129.415.amzn2.x86_64
Container runtime: docker://20.10.17

Erigon version: 2.40.0-dev-3da15fcb

Erigon Command (with flags/config):

erigon --chain=bsc --datadir=/data/erigon --config=/home/erigon/config.toml --prune=hrtc --db.pagesize=16k --log.console.verbosity=info --nat=extip:$(POD_IP) --maxpeers=1000 --p2p.protocol=66 --torrent.download.slots=20 --torrent.download.rate=512mb --bodies.cache=214748364800 --batchSize=4096M

Concensus Layer:

Concensus Layer Command (with flags/config):

Chain/Network: bsc / 56

Expected behaviour

node sync to head and remain synced consistently

Actual behaviour

The node stopped syncing (here, stucked at bodies stage) and drifted away from chain's head without any changes of configuration.
Also, please note that we already use the config flags mentioned in #41.
The solutions proposed there using the integration command do not work either.

Steps to reproduce the behaviour

launch the client, and sync stucks at a fixed block number.

Backtrace

[INFO] [05-04|12:41:20.217] [5/15 Bodies] No block bodies to write in this log period block number=27910849
[INFO] [05-04|12:41:40.217] [5/15 Bodies] Downloading block bodies   block_num=27910849 delivery/sec=334.0KB wasted/sec=0B remaining=4935 delivered=264 cache=4.8GB alloc=9.2GB sys=16.2GB
[INFO] [05-04|12:42:00.217] [5/15 Bodies] No block bodies to write in this log period block number=27910849
[INFO] [05-04|12:42:00.221] [5/15 Bodies] DONE                       in=1m40.005033072s
[INFO] [05-04|12:42:03.978] Commit cycle                             in=3.756665787s
[INFO] [05-04|12:42:03.978] Timings (slower than 50ms)               Headers=6.616s Bodies=1m40.005s
[INFO] [05-04|12:42:03.978] Tables                                   PlainState=233.4GB AccountChangeSet=1.7GB StorageChangeSet=3.6GB BlockTransaction=5.1GB TransactionLog=9.1GB FreeList=45.2MB ReclaimableSpace=181.0GB
[INFO] [05-04|12:42:03.980] RPC Daemon notified of new headers       from=27914761 to=27915785 hash=0x684047a81b5a66d11e06e1fc4d5f432d1de4b8a194f1e20c20c9241fe5c98fc0 header sending=1.239498ms log sending=324ns
[INFO] [05-04|12:42:03.981] [2/15 Headers] Waiting for headers...    from=27915785
[INFO] [05-04|12:42:06.711] [parlia] snapshots build, gather headers block=27900000
[INFO] [05-04|12:42:06.713] [parlia] snapshots build, recover from headers block=27900000
[INFO] [05-04|12:42:07.850] [parlia] snapshots build, gather headers block=27900000
[INFO] [05-04|12:42:07.852] [parlia] snapshots build, recover from headers block=27900000
[INFO] [05-04|12:42:09.759] [parlia] snapshots build, gather headers block=27900000
[INFO] [05-04|12:42:09.769] [parlia] snapshots build, recover from headers block=27900000
[INFO] [05-04|12:42:09.866] [2/15 Headers] Processed                 highest inserted=27915821 age=7s
[INFO] [05-04|12:42:09.885] [5/15 Bodies] Processing bodies...       from=27910849 to=27915821
[INFO] [05-04|12:42:10.248] [txpool] stat                            pending=10000 baseFee=268 queued=30000 alloc=6.1GB sys=16.2GB
[INFO] [05-04|12:42:29.886] [5/15 Bodies] Downloading block bodies   block_num=27910849 delivery/sec=313.3KB wasted/sec=102.5KB remaining=4971 delivered=103 cache=4.8GB alloc=6.3GB sys=16.2GB
[INFO] [05-04|12:42:49.886] [5/15 Bodies] Downloading block bodies   block_num=27910849 delivery/sec=417.2KB wasted/sec=0B remaining=4971 delivered=249 cache=4.8GB alloc=5.8GB sys=16.2GB
[INFO] [05-04|12:43:07.902] [p2p] GoodPeers                          eth66=708
[INFO] [05-04|12:43:09.885] [5/15 Bodies] No block bodies to write in this log period block number=27910849
@hitchhooker
Copy link

workaround has been to identify node being stuck and restarting service something like this

#!/bin/bash

# Variables
LOG_FILE="bsc.log"
PATTERN="No block bodies to write"
MIN_OCCURRENCES=5
NODE_RESTART_CMD="systemctl restart your-node.service"

# Function to count consecutive occurrences of the pattern with the same block number
count_occurrences() {
  local log_file="$1"
  local pattern="$2"
  local count=0
  local prev_block_number=""

  grep -F "$pattern" "$log_file" | while read -r line; do
    if [[ $line == *"$pattern"* ]]; then
      block_number=$(echo "$line" | grep -oP 'block number=\K\d+')
      if [[ "$block_number" == "$prev_block_number" ]]; then
        ((count++))
      else
        count=1
      fi

      if (( count >= MIN_OCCURRENCES )); then
        return 0
      fi

      prev_block_number="$block_number"
    fi
  done

  return 1
}

# Main loop
while true; do
  if count_occurrences "$LOG_FILE" "$PATTERN"; then
    echo "Restarting node due to consecutive occurrences of '$PATTERN' with the same block number in $LOG_FILE"
    $NODE_RESTART_CMD
  fi
  sleep 60
done

@calmbeing
Copy link

Suggest to use the latest mdbx snapshot for archive node: https://github.com/bnb-chain/bsc-snapshots

@broyeztony
Copy link
Author

Following our call today, here are the 'ETL' logs we see in the client container

[INFO] [05-11|09:11:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=ae5f14a9
[INFO] [05-11|09:12:16.634] [] ETL [2/2] Loading                     into=PlainState current_prefix=b41b9ac5
[INFO] [05-11|09:12:28.010] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.3GB sys=9.7GB
[INFO] [05-11|09:12:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=bb56a527
[INFO] [05-11|09:13:16.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=c4bfd229
[INFO] [05-11|09:13:27.226] [p2p] GoodPeers                          eth66=992
[INFO] [05-11|09:13:28.027] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.0GB sys=9.7GB
[INFO] [05-11|09:13:49.650] [] ETL [2/2] Loading                     into=PlainState current_prefix=cca4e84c
[INFO] [05-11|09:14:17.557] [] ETL [2/2] Loading                     into=PlainState current_prefix=d72ffbc2
[INFO] [05-11|09:14:28.013] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.2GB sys=9.7GB
[INFO] [05-11|09:14:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=ddb59a10
[INFO] [05-11|09:15:16.634] [] ETL [2/2] Loading                     into=PlainState current_prefix=e7f08bbf
[INFO] [05-11|09:15:27.225] [p2p] GoodPeers                          eth66=992
[INFO] [05-11|09:15:28.024] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.4GB sys=9.7GB
[INFO] [05-11|09:15:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=ea0d2132
[INFO] [05-11|09:16:16.634] [] ETL [2/2] Loading                     into=PlainState current_prefix=f387ef35
[INFO] [05-11|09:16:28.010] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.5GB sys=9.7GB
[INFO] [05-11|09:16:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=fd36e2c2
[INFO] [05-11|09:17:27.225] [p2p] GoodPeers                          eth66=992
[INFO] [05-11|09:17:28.032] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.3GB sys=9.7GB
[INFO] [05-11|09:18:14.291] [7/15 Execution] Completed on            block=28111500
[INFO] [05-11|09:18:14.291] [7/15 Execution] DONE                    in=47m50.839811187s
[INFO] [05-11|09:18:14.292] [8/15 HashState] Promoting plain state   from=28101756 to=28111500
[INFO] [05-11|09:18:14.292] [8/15 HashState] Incremental promotion   from=28101756 to=28111500 codes=true csbucket=AccountChangeSet
[INFO] [05-11|09:18:28.009] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.0GB sys=9.7GB
[INFO] [05-11|09:18:44.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acce32
[INFO] [05-11|09:19:14.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acd406
[INFO] [05-11|09:19:27.225] [p2p] GoodPeers                          eth66=994
[INFO] [05-11|09:19:28.024] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=4.6GB sys=9.7GB
[INFO] [05-11|09:19:44.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acd957
[INFO] [05-11|09:20:14.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acde06
[INFO] [05-11|09:20:28.011] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=3.7GB sys=9.7GB
[INFO] [05-11|09:20:44.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001ace3b5
[INFO] [05-11|09:21:14.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acea5a
[INFO] [05-11|09:21:27.225] [p2p] GoodPeers                          eth66=998
[INFO] [05-11|09:21:28.027] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.8GB sys=9.7GB
[INFO] [05-11|09:22:14.187] [8/15 HashState] ETL [2/2] Loading       into=HashedCodeHash current_prefix=4a728024
[INFO] [05-11|09:22:28.010] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.0GB sys=9.7GB
[INFO] [05-11|09:22:44.187] [8/15 HashState] ETL [2/2] Loading       into=HashedCodeHash current_prefix=95b338d1
[INFO] [05-11|09:23:14.187] [8/15 HashState] ETL [2/2] Loading       into=HashedCodeHash current_prefix=dfe51ad1

We are not sure that these logs are relevant to the issue, but we are also running other network on Erigon without issue and we don't the see these 'ETL' logs there.

Also could you provide the configuration on which you are running your own archive and full nodes?
Thank you very much

@setunapo
Copy link
Collaborator

got it, we are checking on it.

@calmbeing
Copy link

@broyeztony For very short-term, you could use something like following script to detect "No block write.." and restart automatically. For the long-term, we're analyzing carefully about the p2p sentry code and will optimize it soon.

#!/bin/bash
set -x
# Define the number of log lines to check
LOG_LINES=50

# Define the number of occurrences before triggering a restart
OCCURRENCES=10

# Continuously check the logs
while true; do
    # Get the last 30 log lines
    LOGS=$(journalctl -u erigon-bsc --lines=$LOG_LINES)

    # Count the number of occurrences of the error message
    COUNT=$(echo "$LOGS" | grep -o "No block headers to write in this log period block number" | wc -l)

    # Check if the error occurred more than 10 times
    if [ $COUNT -gt $OCCURRENCES ]; then
        # Restart the erigon-bsc service
        echo "Error occurred more than $OCCURRENCES times. Restarting service..."
        systemctl restart erigon-bsc
    fi

    # Sleep for 60 seconds before checking the logs again
    sleep 60
done

@calmbeing
Copy link

workaround has been to identify node being stuck and restarting service something like this

#!/bin/bash

# Variables
LOG_FILE="bsc.log"
PATTERN="No block bodies to write"
MIN_OCCURRENCES=5
NODE_RESTART_CMD="systemctl restart your-node.service"

# Function to count consecutive occurrences of the pattern with the same block number
count_occurrences() {
  local log_file="$1"
  local pattern="$2"
  local count=0
  local prev_block_number=""

  grep -F "$pattern" "$log_file" | while read -r line; do
    if [[ $line == *"$pattern"* ]]; then
      block_number=$(echo "$line" | grep -oP 'block number=\K\d+')
      if [[ "$block_number" == "$prev_block_number" ]]; then
        ((count++))
      else
        count=1
      fi

      if (( count >= MIN_OCCURRENCES )); then
        return 0
      fi

      prev_block_number="$block_number"
    fi
  done

  return 1
}

# Main loop
while true; do
  if count_occurrences "$LOG_FILE" "$PATTERN"; then
    echo "Restarting node due to consecutive occurrences of '$PATTERN' with the same block number in $LOG_FILE"
    $NODE_RESTART_CMD
  fi
  sleep 60
done

Also, can refer to this script

@koraykoska
Copy link

A restart does not work at all for us. This error continues to come without the node syncing. We used the latest snapshot on the official page. Is there any workaround?

@koraykoska
Copy link

After every restart:

[INFO] [05-16|16:55:51.333] [txpool] Started
[INFO] [05-16|16:56:10.518] [2/15 Headers] Wrote block headers       number=28265891 blk/second=5.000 alloc=2.9GB sys=3.1GB
[INFO] [05-16|16:56:30.518] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:56:50.492] [txpool] stat                            pending=1 baseFee=0 queued=4 alloc=3.1GB sys=3.3GB
[INFO] [05-16|16:56:50.518] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:57:10.518] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:57:30.519] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:57:49.054] [p2p] GoodPeers                          eth66=73 eth67=4 eth68=1
[INFO] [05-16|16:57:50.494] [txpool] stat                            pending=3641 baseFee=0 queued=2182 alloc=2.4GB sys=3.6GB
[INFO] [05-16|16:57:50.519] [2/15 Headers] Wrote block headers       number=28265933 blk/second=2.100 alloc=2.4GB sys=3.6GB
[INFO] [05-16|16:58:10.518] [2/15 Headers] No block headers to write in this log period block number=28265933

And it continues to never sync, only No block headers to write in this log period

@calmbeing
Copy link

calmbeing commented May 19, 2023

@broyeztony @fylsan3 After careful debug, the root cause is for two reason:

  1. Request packet params is kind of different from bsc-geth.
  2. Always connect to useless peers even return empty response.
  • The first one was fixed in the source code,

  • For the second one: please specify flag : --sentry.drop-useless-peers, which would drop useless peers with empty response

Please let us know if any issue

@broyeztony
Copy link
Author

@calmbeing thank you for the update.

Despite running with the suggested parameters on v1.0.7 we are still facing issue with the syncing.
We observe the error no block headers/bodies to write... to be less frequent, but overall the syncing seems to be slower than before.
Please find attached the entire log of one of our node that was restarted by the cronjob 9h ago.
bsc-mainnet-erigon-archive.log

Also, these are the startup command parameters for reference

- '--chain=bsc'
- '--datadir=/data/erigon'
- '--log.console.verbosity=info'
- '--nat=extip:$(POD_IP)'
- '--maxpeers=1000'
- '--p2p.protocol=66,67'
- '--torrent.download.slots=20'
- '--torrent.download.rate=512mb'
- '--bodies.cache=214748364800'
- '--batchSize=4096M'
- '--db.pagesize=16k'
- '--sentry.drop-useless-peers'

@yusben
Copy link

yusben commented May 20, 2023

@calmbeing thank you for the update.

Despite running with the suggested parameters on v1.0.7 we are still facing issue with the syncing. We observe the error no block headers/bodies to write... to be less frequent, but overall the syncing seems to be slower than before. Please find attached the entire log of one of our node that was restarted by the cronjob 9h ago. bsc-mainnet-erigon-archive.log

Also, these are the startup command parameters for reference

- '--chain=bsc'
- '--datadir=/data/erigon'
- '--log.console.verbosity=info'
- '--nat=extip:$(POD_IP)'
- '--maxpeers=1000'
- '--p2p.protocol=66,67'
- '--torrent.download.slots=20'
- '--torrent.download.rate=512mb'
- '--bodies.cache=214748364800'
- '--batchSize=4096M'
- '--db.pagesize=16k'
- '--sentry.drop-useless-peers'

I assume you should use v1.0.8 with the latest fixes. I deployed it yesterday with the flag as was recommended by calmbeing and so far erigon has been working for 24+ hours flawlessly and without no block headers in the logs.

@broyeztony
Copy link
Author

broyeztony commented May 23, 2023

@yusben Thank you for your inputs.
Would you mind sharing the setup you are using to run your node? Not so much the startup command but rather:

  • are you running a full (pruned) node or an archive node?
  • disk / storage type / filesystem ? is it local to the machine that runs erigon or a scalable volume like AWS EBS. What filesystem are you using there?

Using v1.0.8 and ... --sentry.drop-useless-peers --p2p.protocol=66... we are not encountering the no block headers/bodies error logs either (except on 1 testnet node, same issue as reported there) but our syncing loop on mainnet archive is just too slow (our mainnet/full nodes however perform better and remains relatively 'close' to head)

On average, it takes our best nodes to sync ~400 blocks over ~20m. Pasting a example below:

[INFO] [05-23|06:07:00.025] RPC Daemon notified of new headers       from=28453505 to=28453885 hash=0xbcc966a30ad2e90786e2c585868671bdcf813fd2b2c49f4de0c42121df41324e header sending=371.552µs log sending=316ns
... <stages 1 to 15>
[INFO] [05-23|06:26:40.867] Timings (slower than 50ms)               Headers=1.112s BlockHashes=546ms Bodies=1.251s Senders=288ms Execution=2m54.614s HashState=3m19.178s IntermediateHashes=3m2.516s CallTraces=2m13.753s AccountHistoryIndex=1m9.658s StorageHistoryIndex=2m10.962s LogIndex=1m34.543s TxLookup=1m48.088s

As a consequence, our nodes always fall behind head by > 400 blocks.
@harveyff @calmbeing FYI.

@node-real node-real deleted a comment from fylsan3 May 23, 2023
@calmbeing
Copy link

@broyeztony , good to hear that, your sync slow problem should be resolved by this latest commit cc23b96, we'll draft a release later.

@setunapo
Copy link
Collaborator

closed, seems the sync is much smooth, could reopen if reoccur

@yusben
Copy link

yusben commented May 29, 2023

@yusben Thank you for your inputs. Would you mind sharing the setup you are using to run your node? Not so much the startup command but rather:

  • are you running a full (pruned) node or an archive node?
  • disk / storage type / filesystem ? is it local to the machine that runs erigon or a scalable volume like AWS EBS. What filesystem are you using there?

Using v1.0.8 and ... --sentry.drop-useless-peers --p2p.protocol=66... we are not encountering the no block headers/bodies error logs either (except on 1 testnet node, same issue as reported there) but our syncing loop on mainnet archive is just too slow (our mainnet/full nodes however perform better and remains relatively 'close' to head)

On average, it takes our best nodes to sync ~400 blocks over ~20m. Pasting a example below:

[INFO] [05-23|06:07:00.025] RPC Daemon notified of new headers       from=28453505 to=28453885 hash=0xbcc966a30ad2e90786e2c585868671bdcf813fd2b2c49f4de0c42121df41324e header sending=371.552µs log sending=316ns
... <stages 1 to 15>
[INFO] [05-23|06:26:40.867] Timings (slower than 50ms)               Headers=1.112s BlockHashes=546ms Bodies=1.251s Senders=288ms Execution=2m54.614s HashState=3m19.178s IntermediateHashes=3m2.516s CallTraces=2m13.753s AccountHistoryIndex=1m9.658s StorageHistoryIndex=2m10.962s LogIndex=1m34.543s TxLookup=1m48.088s

As a consequence, our nodes always fall behind head by > 400 blocks. @harveyff @calmbeing FYI.

I am running a pruned node on baremetal server with 3 nvme ssds in raid0 ext4. Now it is working for 10 days without No block bodies to write in this log period error and is healthy for 99%+ of time.

command: | --chain=bsc --datadir=/home/erigon/erigon-data --db.pagesize=16k --prune=hrtc --ws --http --http.addr 0.0.0.0 --http.port 8545 --http.vhosts=* --maxpeers 100 --http.api=eth,erigon,web3,net,debug,trace,txpool,admin --port 30303 --p2p.allowed-ports=30303,30304 --p2p.protocol=66 --log.dir.path=/home/erigon/erigon-data/logs --metrics --metrics.addr 0.0.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants