Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.1.0-rc1 boostd stops accepting deals and performing retrievals, stops displaying piece-doctor UI page #1771

Open
7 of 12 tasks
Shekelme opened this issue Oct 24, 2023 · 10 comments
Labels

Comments

@Shekelme
Copy link

Shekelme commented Oct 24, 2023

Checklist

  • This is not a question or a support request. If you have any boost related questions, please ask in the discussion forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the discussion forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to boost.

Boost component

  • boost daemon - storage providers
  • boost client
  • boost UI
  • boost data-transfer
  • boost index-provider
  • Other

Boost Version

boostd version 2.1.0-rc1+mainnet+git.ba2b53c

Describe the Bug

On two large miners of our team, the following problem manifests itself with different frequency. At some point, boostd stops accepting deals and performing retrievals. At such moments, all UI pages can be opened, except for the piece-doctor (http://192.168.x.y:8080/piece-doctor).
Today I can say I was lucky, and the problem appeared within the same hour on both miners. But on lenovo91 the symptoms are as different from lenovo 92 miner. Please see the "Logging Information" section.
I have tried different versions of this RC, including compiling from the master branch. I am currently using release v2.1.0-rc1 plus git cherry-pick 52f8bc7.
I don't know how to diagnose this problem, I will be happy to provide any necessary information.
If this is informative, then the YugabyteDB Version is 2.18.2.1
When I restart processes to fix the problem, the boostd-data process may not be restarted - everything will work anyway.

Logging Information

LENOVO91
admfc@lenovo91:~$ journalctl -f -u boostd-data
Oct 19 19:43:42 lenovo91 systemd[1]: Started boostd-data.
Oct 19 19:43:42 lenovo91 boostd-data[2077503]: 2023/10/19 19:43:42 goose: no migrations to run. current version: 20230828111523
Oct 19 19:43:42 lenovo91 boostd-data[2077503]: 2023-10-19T19:43:42.907+0300        INFO        boostd-data        cmd/run.go:152        Started boostd-data yugabyte service on address 0.0.0.0:8044
Oct 24 03:52:58 lenovo91 boostd-data[2077503]: 2023-10-24T03:52:58.107+0300        ERROR        rpc        [email protected]/websocket.go:122        handle me:write tcp 192.168.11.91:8044->192.168.11.91:48918: write: connection reset by peer
Oct 24 03:52:58 lenovo91 boostd-data[2077503]: 2023-10-24T03:52:58.109+0300        ERROR        rpc        [email protected]/handler.go:369        error and res returned        {"request": {"jsonrpc":"2.0","id":749442,"method":"boostddata.GetIndex","params":[{"/":"baga6ea4seaqaomgacplk5546unk74iv2yalb3wtyawm532cmya5beeh6vwveeaq"}],"meta":{"SpanContext":"AAAhmfZeEpQ5/jnu9swfjcYKAco6iaute5XlAgA="}}, "r.err": "connection closing", "resError": "json: unsupported type: <-chan types.IndexRecord"}
Oct 24 03:52:58 lenovo91 boostd-data[2077503]: 2023-10-24T03:52:58.109+0300        ERROR        rpc        [email protected]/handler.go:369        error and res returned        {"request": {"jsonrpc":"2.0","id":749441,"method":"boostddata.GetIndex","params":[{"/":"baga6ea4seaqaomgacplk5546unk74iv2yalb3wtyawm532cmya5beeh6vwveeaq"}],"meta":{"SpanContext":"AACc2fiG8+DOJxxD+so385T4AWHeu+lCC7wOAgA="}}, "r.err": "connection closing", "resError": "json: unsupported type: <-chan types.IndexRecord"}
Oct 24 03:52:58 lenovo91 boostd-data[2077503]: 2023-10-24T03:52:58.109+0300        ERROR        rpc        [email protected]/websocket.go:122        handle me:write tcp 192.168.11.91:8044->192.168.11.91:48918: write: connection reset by peer
Oct 24 03:52:58 lenovo91 boostd-data[2077503]: 2023-10-24T03:52:58.109+0300        ERROR        rpc        [email protected]/handler.go:369        error and res returned        {"request": {"jsonrpc":"2.0","id":749440,"method":"boostddata.GetIndex","params":[{"/":"baga6ea4seaqaomgacplk5546unk74iv2yalb3wtyawm532cmya5beeh6vwveeaq"}],"meta":{"SpanContext":"AACHvDbU2qS1Uo0wHy+cR7VRAVGTps2S8xU/AgA="}}, "r.err": "connection closing", "resError": "json: unsupported type: <-chan types.IndexRecord"}
Oct 24 03:52:58 lenovo91 boostd-data[2077503]: 2023-10-24T03:52:58.109+0300        ERROR        rpc        [email protected]/websocket.go:122        handle me:write tcp 192.168.11.91:8044->192.168.11.91:48918: write: connection reset by peer
Oct 24 03:52:58 lenovo91 boostd-data[2077503]: 2023-10-24T03:52:58.109+0300        ERROR        rpc        [email protected]/websocket.go:122        handle me:write tcp 192.168.11.91:8044->192.168.11.91:48918: write: connection reset by peer


LENOVO92
2023-10-24T03:36:59.634+0300^IDEBUG^Ipiecedirectory^Ipiecedirectory/doctor.go:111^IcheckPiece processing^I{"took": 0.075624085}$
2023-10-24T03:36:59.699+0300^IDEBUG^Ipiecedoc^Ipiecedirectory/doctor.go:177^Ichecking state for market deal^I{"piece": "baga6ea4seaqfmypuryk5f2gmptxnc5qgi7vzal4w6dlebvugh25gbjvo7h4a2ai", "deal": 45107371}$
2023-10-24T03:36:59.701+0300^IDEBUG^Ipiecedoc^Ipiecedirectory/doctor.go:238^Iunflagging piece^I{"piece": "baga6ea4seaqfmypuryk5f2gmptxnc5qgi7vzal4w6dlebvugh25gbjvo7h4a2ai"}$
2023-10-24T03:36:59.702+0300^IDEBUG^Ipiecedirectory^Ipiecedirectory/doctor.go:111^IcheckPiece processing^I{"took": 0.068215793}$
2023-10-24T03:36:59.766+0300^IDEBUG^Ipiecedoc^Ipiecedirectory/doctor.go:177^Ichecking state for market deal^I{"piece": "baga6ea4seaqfn36i6abvwjxumsa5io7zxblxfbq4jz4ej5es5er6inci5jhxqia", "deal": 30902550}$
2023-10-24T03:36:59.769+0300^IDEBUG^Ipiecedoc^Ipiecedirectory/doctor.go:238^Iunflagging piece^I{"piece": "baga6ea4seaqfn36i6abvwjxumsa5io7zxblxfbq4jz4ej5es5er6inci5jhxqia"}$
2023-10-24T03:36:59.770+0300^IDEBUG^Ipiecedirectory^Ipiecedirectory/doctor.go:111^IcheckPiece processing^I{"took": 0.067854893}$
2023-10-24T03:36:59.799+0300^IINFO^Iboost-provider^Istoragemarket/contract_deal_monitor.go:86^Icontract deal monitor context canceled, exiting...$
2023-10-24T03:36:59.799+0300^IINFO^Imodules^Imodules/storageminer.go:458^Icontract deals monitor started$
2023-10-24T03:36:59.864+0300^IDEBUG^Ipiecedoc^Ipiecedirectory/doctor.go:177^Ichecking state for market deal^I{"piece": "baga6ea4seaqfn7dccfxobshs6c5dhyx7g7aow7hloeu7qvcguirborldiulhcla", "deal": 7133975}$
---SILENCE---
2023-10-24T04:32:16.316+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:139^Ichecking for sector state updates$
2023-10-24T04:32:16.316+0300^IINFO^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:166^Iredeclaring storage$
2023-10-24T04:32:20.155+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:277^Isector present in all sector states, but not
active^I{"number": {"Miner":1222595,"Number":24199}}$
2023-10-24T04:32:20.155+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:277^Isector present in all sector states, but not
active^I{"number": {"Miner":1222595,"Number":24189}}$
2023-10-24T04:32:20.155+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:277^Isector present in all sector states, but not
active^I{"number": {"Miner":1222595,"Number":24419}}$
2023-10-24T04:32:20.155+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:277^Isector present in all sector states, but not
active^I{"number": {"Miner":1222595,"Number":24413}}$
2023-10-24T04:32:20.155+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:277^Isector present in all sector states, but not
active^I{"number": {"Miner":1222595,"Number":24190}}$
---LOG LINES---
2023-10-24T04:32:21.200+0300^IDEBUG^Iindex-provider-wrapper^Iindexprovider/wrapper.go:161^Isector {1222595 20947} has 0 deals, seal status Removed$
2023-10-24T04:32:21.202+0300^IDEBUG^Iindex-provider-wrapper^Iindexprovider/wrapper.go:161^Isector {1222595 20815} has 0 deals, seal status Removed$
2023-10-24T04:32:21.204+0300^IDEBUG^Iindex-provider-wrapper^Iindexprovider/wrapper.go:161^Isector {1222595 17329} has 0 deals, seal status Removed$
2023-10-24T04:32:27.536+0300^IINFO^Iboost-storage-deal^Ilogs/log.go:95^ICleaning logs older than 30 days from logsDB $
---SILENCE---
2023-10-24T05:32:16.316+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:139^Ichecking for sector state updates$
2023-10-24T05:32:16.316+0300^IINFO^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:166^Iredeclaring storage$
2023-10-24T05:32:20.022+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:277^Isector present in all sector states, but not active^I{"number": {"Miner":1222595,"Number":24294}}$
2023-10-24T05:32:20.022+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:277^Isector present in all sector states, but not active^I{"number": {"Miner":1222595,"Number":24199}}$
2023-10-24T05:32:20.022+0300^IDEBUG^Isectorstatemgr^Isectorstatemgr/sectorstatemgr.go:277^Isector present in all sector states, but not active^I{"number": {"Miner":1222595,"Number":24424}}$
---LOG LINES---

Repo Steps

  1. Run '...'
  2. Do '...'
  3. See error '...'
    ...
@Shekelme
Copy link
Author

Everything has been working fine for several days on both miners after deactivating the

[ContractDeals]
  # Whether to enable chain monitoring in order to accept contract deals
  #
  # type: bool
  Enabled = false

parameter.

@nonsense nonsense removed this from the Boost v2 stability milestone Oct 30, 2023
@distora-w3
Copy link

distora-w3 commented Nov 23, 2023

I have something similar going on with boostd-data. Not sure how to debug/log further

  • start yb-master, then yb-tserver, then boostd-data, then boost
  • runs fine for a few hours or a day (I've not detected a pattern) boostd-data collapses or still runs (but retreivals fail, cant inspect lid items)
  • this is a single box with daemon, miner, yuga boostd-data and boost (replication_factor=1 for yuga
  • I have set CID-Gravity to reject deals, so that the set up is not mining
  • Not running HTTP and Bitswap retreivals to keep activity low
  • bootd-data and boosted logs are empty apart from start up messages (see below)
nohup boostd-data --vv run yugabyte \
  --hosts 192.168.1.54 \
  --connect-string="postgresql://yugabyte:[email protected]:5433?sslmode=disable" \
  --addr 0.0.0.0:8044 \
 > /filecoinpool/worker/logs/boostd-data.log 2>&1 &
 
$ cat /filecoinpool/worker/logs/boostd-data.log
nohup: ignoring input
2023/11/23 08:58:45 goose: no migrations to run. current version: 20230828111523

nohup boostd --vv run > /filecoinpool/worker/logs/boost.log 2>&1 &
cat /filecoinpool/worker/logs/boost.log
nohup: ignoring input
Worker: 29251; Base: 0; Target: 0 (diff: 0)
State: idle; Current Epoch: 0; Todo: 0

Done!
2023/11/22 09:58:22 goose: no migrations to run. current version: 20230816133342
Graphql server listening on 192.168.1.54:8080
2023/11/22 09:58:24 failed to sufficiently increase send buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details.

quic-go:

sysctl net.core.rmem_max
net.core.rmem_max = 2500000

Versions:

$ boostd --version
boostd version 2.1.0-rc3+mainnet+git.1622d0c.dirty

$ boostd-data --version
boostd-data version 1.5.0+mainnet+git.1622d0c

$ lotus-miner version
Daemon:  1.24.0-rc5+mainnet+git.ef070faca+api1.5.0
Local: lotus-miner version 1.24.0-rc5+mainnet+git.ef070faca

$lotus --version
lotus version 1.24.0-rc5+mainnet+git.ef070faca

$go version
go version go1.20.3 linux/amd64

boost config.toml

[ContractDeals]
  # Whether to enable chain monitoring in order to accept contract deals
  #
  # type: bool
  #Enabled = false

@nonsense
Copy link
Member

@distora-w3 @Shekelme could you upgrade to the latest version from main (or wait until tomorrow for v2.1.1) and let me know if you are still experiencing these issues?

@Shekelme
Copy link
Author

I will wait for v2.1.1

@distora-w3
Copy link

distora-w3 commented Nov 29, 2023

Still have the issue: I am now on
boostd --version
boostd version 2.1.0+mainnet+git.3233c51

boostd-data version 1.5.1+mainnet+git.3233c51

Quick question: I am using --vv for both boostd and boost-data, but see nothing in the logs. (Do I need set something?)

@LexLuthr
Copy link
Collaborator

@distora-w3 There are 2 more fixes after that commit. Can you please deploy the latest main or wait for v2.1.1 (due tomorrow)?

@distora-w3
Copy link

distora-w3 commented Dec 4, 2023

Still lose boostd-data with these:

boostd --version
boostd version 2.1.1+mainnet+git.e9d18ac

boostd-data --version
boostd-data version 1.5.1+mainnet+git.e9d18ac
go version
go version go1.20.7 linux/amd64

lotus version
Daemon:  1.24.0+mainnet+git.7c093485c+api1.5.0
Local: lotus version 1.24.0+mainnet+git.7c093485c

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

@Shekelme
Copy link
Author

v2.1.1 did not help: #1846

@distora-w3
Copy link

distora-w3 commented Dec 19, 2023

@Shekelme things have improved for me. I have gone a week of boostd-data not crashing.

Previously, I was starting with the following from a laptop:

nohup boostd-data --vv run yugabyte \
  --hosts 192.168.1.54 \
  --connect-string="postgresql://yugabyte:[email protected]:5433?sslmode=disable" \
  --addr 0.0.0.0:8044 \
 > /filecoinpool/worker/logs/boostd-data.log 2>&1 &

If I run the same from a terminal window that is always active, then I have gone from a few hours to over a week without outage.

LOGS
I still have not been able to generate logs. (I only get the start-up messages of 3 or 4 lines)

I have found the following on slack, but they do not seem to work for me. I have tried them in various combinations

$ make debug install
$ sudo make install
$ boostd log set-level boost-net=debug boost-prop=debug boost-storage-deal=debug
$ boostd log set-level *=debug

@Shekelme
Copy link
Author

The same trick was with booster-bitswap.
But I use systemd unit file to run boostd-data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

4 participants