Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[iota-indexer]: Fix failing test #4171

Closed
sergiupopescu199 opened this issue Nov 21, 2024 · 2 comments · Fixed by #4173
Closed

[iota-indexer]: Fix failing test #4171

sergiupopescu199 opened this issue Nov 21, 2024 · 2 comments · Fixed by #4173
Assignees
Labels
infrastructure Issues related to the Infrastructure Team sc-platform Issues related to the Smart Contract Platform group.

Comments

@sergiupopescu199
Copy link
Contributor

Description

Sometimes the get_total_transaction_blocks fails in the CI wit this error

thread 'read_api::get_total_transaction_blocks' panicked at crates/iota-indexer/tests/rpc-tests/read_api.rs:1167:14:
called `Result::unwrap()` on an `Err` value: Call(ErrorObject { code: InvalidRequest, message: "Verified checkpoint not found for sequence number: 5", data: None })
@sergiupopescu199 sergiupopescu199 added infrastructure Issues related to the Infrastructure Team sc-platform Issues related to the Smart Contract Platform group. labels Nov 21, 2024
@sergiupopescu199 sergiupopescu199 self-assigned this Nov 21, 2024
@sergiupopescu199
Copy link
Contributor Author

sergiupopescu199 commented Nov 26, 2024

@iotaledger/l1-core-infra

Following the recent refork, the iota-indexer has changed how it manages checkpoint synchronization. Previously reliant on the full node's REST API, this process is now optional.

In the default configuration of the TestCluster, we typically have four validators and a single full node. When using the TestClusterBuilder, the specified data_ingestion_path is mapped directly to the NodeConfig. In this setup, checkpoints generated by the validators and the full node are stored as files in the format <checkpoint_seq_num>.chk within the designated directory. If we configure the same data_ingestion_path within the IndexerConfig, the Indexer initiates the process of syncing these checkpoint files into Postgres. Once this synchronization is complete, the Indexer subsequently removes the files from the directory.

However, with this bug, the situation becomes tricky. The new functionality has led to instances where the indexer syncs checkpoints at a pace that outstrips the full node. This is primarily because validators are responsible for generating and signing checkpoints first, thereby creating the files in the data_ingestion_path ahead of the full node. The CheckpointService residing in the validators broadcasts the checkpoints across the network to inform peers. Meanwhile, the full node queries the latest highest checkpoints from its peers, synchronizes this data, and writes them as files in the data_ingestion_path. However, because the indexer has already stored these checkpoints in Postgres, it will automatically delete the corresponding files. Consequently, the fullnode subsequently writes this data into its CheckpointStore, which utilizes RocksDB. Retrieval of checkpoint data is facilitated by the RPC get_checkpoint method, which queries RocksDB tables.

Note

Validator and Full node store the checkpoints into data_ingestion_path using store_checkpoint_locally which is called upon finalization of a checkpoint by the CheckpointExecutor task.

However, a significant distinction emerges in their StateSync processes. The StateSyncEventLoop for both Validators and Full Nodes exhibits different behaviors, as this loop is responsible for the actual insertion of checkpoints into the RocksDB.

Another interesting aspect to consider is that the get_checkpoint function aggregates data from two distinct tables: certified_checkpoints and checkpoint_content. When the full node receives new checkpoints from its peers, the first action it takes is to insert them into the certified_checkpoints table. Once the local verification of these checkpoints is completed, they are also added to the checkpoint_content table. Between these two insertions, a brief time interval may elapse, which can occasionally lead to the get_checkpoint method returning the error VerifiedCheckpointDigestNotFound. This error arises from the fact that while the method can successfully fetch the checkpoint summary from the certified_checkpoints table using the checkpoint sequence number, the subsequent attempt to retrieve the checkpoint content by using the summary digest may fail, resulting in this error.

There are two solutions to this problem:

  1. create a similar function to wait for a certain checkpoint for the full node as we do for the indexer, but this will require a major refactor for test cases
  2. Use the REST API as synchronization mechanism for the indexer, it wil have the same behaviour as prior to the refork, the refactor will be minimal, to do so we only need to add a new field into IndexerConfig: remote_store_url: Some(format!("{}/api/v1",fullnode_rpc)), and remove the data ingestion path form the TestClusterBuilder.

@kodemartin
Copy link
Contributor

@sergiupopescu199 thanks for the thorough analysis. That is a very interesting finding shedding more light in the data-ingestion process.

It is apparent that this stems from the test setup, as it requires all components (validators, fullnode, indexer) to be collocated for the indexer to outpace the fullnode. This will hardly be the case in actual setups though, as validators are typically independent nodes.

At this point I would like to clarify the following:

  • remote_store_url is already there in the IndexerConfig. It is just set to None.
  • data-ingestion distinguishes between two different read_sources: local and remote. While local is the default source (the indexer falls back to a temp dir ingestion path if not set in IndexerConfig), if the IndexerConfig has a remote_store_url value then the remote source is used.

So in order to sync from the fullnode, no other action is needed than to set the remote_store_url in the test_utils to the fullnode.

If you think this is less disruptive, then I say we go for it!

Note

This implies that we can choose to use the fullnode ingestion path for local synchronization in collocated setups. This might prove very helpful in reducing the latency of the indexer database. @tomxey I think this makes a perfect alternative setup to the remote synchronization for the latency analysis.

@sergiupopescu199 fcould you please add permalinks to the code where the following happen in your analysis above? This will provide direct reference to the code for the future.

This is primarily because validators [...] thereby creating the files in the data_ingestion_path ahead of the full node

Meanwhile, the full node [...] writes them as files in the data_ingestion_path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Issues related to the Infrastructure Team sc-platform Issues related to the Smart Contract Platform group.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants