-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Txn latency between websockets and http #3074
Comments
It's not really surprising, the only question is why do you need to refetch the hash after ? If you subscribe to tx events you should already have all the infos. |
@ceciEstErmat can you please elaborate on why it's not surprising? I'd really appreciate that as I'm not well versed in what's going on under the hood in this scenario. As for why, it's useful for sniping bots in different cases and it doesn't have to be a Example:
It doesn't have to be a bot scenario; just an example from the fields. |
@akegaviar - In your example, are the subscription (e.g. I'm trying to understand how there might be a race condition between nodes' chain state. Going from finalized to confirmed commitment level should provide a really wide window allowing all nodes to be able to observe chain state. |
@akegaviar To me it is not surprising as most RPC clusters are "eventually consistent" - you might be using an RPC provider and be assigned to event subscription on machine A but request machine B. Machine B might not yet at the same state as A. Distributed system are for the most part "eventually consistent" this is why it did not surprise me, and coding such a thing i would expect such behavior. Unsure if this clarify what i meant. |
This is a critical question before going any further. The two previous comments stitch together a potential problem scenario:
Suppose that node A is caught up with the tip of the cluster and node B is ~50 slots behind the tip. You could get a Depending on configuration, RPC nodes might be doing a lot of extra work and be under heavier load than "regular" nodes. We have definitely seen scenarios where heavy RPC traffic (requests) can cause an RPC node to fall behind. Rate limiting of nodes was left as something to perform outside of the validator client. |
@bw-solana We were able to reproduce the issue when both
Wondering the same when we are hitting the same node and that too when |
Makes me think the Transaction Status Service must be backed up. I believe if you request tx details before TSS has processed the updates, RPC server will return null or error |
I still cannot understand why it would necessitate introducing a sleep to fix this issue when it is not two different nodes serving websocket and http rpc |
I agree. The sleep is more of a workaround. We need to figure out what's going on. My best guess is TSS being slow, but we'll need to confirm. |
I have root-caused the issue. The problem is caused by the asynchronous nature of the logsSubscribe notification and the persistence of TransactionStatusMeta by the TransactionStatusService in the ReplayStage. When a transaction is executed, the transaction logs are collected synchronously and the transaction status batch is sent to TransactionStatusService which will do the persistence asynchronously in a separate thread. Then logsSubscribe event (using CommitmentAggregationData --> NotificationEntry::Bank) is sent. Due to the asynchronously nature of these two processes, it may happen when getTransaction is being called the TransactionStatusMeta has not been written to the block store yet. We have the same mechanism for a long time so I do not think this is a regression. Recommendations:
|
This might help a little #3026 |
Problem
We are noticing some latency with txn responses on our mainnet nodes running v1.18.23
finalized
confirmed
Now some of the hashes return a null and the rate of null we witness decreases when we introduce a sleep between
logSubscribe
andgetTransaction
. We are trying to understand why this would happen when both the websocket and http calls are being served by the same node.Proposed Solution
The text was updated successfully, but these errors were encountered: