diff --git a/neps/nep-0539.md b/neps/nep-0539.md new file mode 100644 index 000000000..ca2dd0e4d --- /dev/null +++ b/neps/nep-0539.md @@ -0,0 +1,788 @@ +--- +NEP: 539 +Title: Cross-Shard Congestion Control +Authors: Waclaw Banasik , Jakob Meier +Status: New +DiscussionsTo: https://github.com/nearprotocol/neps/pull/539 +Type: Protocol +Version: 1.0.0 +Created: 2024-03-21 +LastUpdated: 2024-05-15 +--- + +## Summary + +We propose to limit the transactions and outgoing receipts included in each +chunk. Limits depend on the congestion level of that receiving shard and are to +be enforced by the protocol. + +Congestion is primarily tracked in terms of the total gas of all delayed receipts. +Chunk producers must ensure they stop accepting new transactions if the receiver +account is on a shard with a full delayed receipts queue. + +Forwarding of receipts that are not directly produced by a transaction, namely +cross-contract calls and delegated receipts, is limited by the receiver's overall +congestion level. This includes the gas of delayed receipts and the gas of +receipts that have not been forwarded, yet, due to congestion control +restrictions. Additionally, the memory consumption of both receipt types can +also cause congestion. + +This proposal extends the local congestion control rules already in place. It +keeps the transaction pool size limit as is but replaces the old delayed receipt +count limit with limits on gas and size of the receipts. + +## Motivation + +We want to guarantee the Near Protocol blockchain operates stably even during +congestion. + +Today, when users send transactions at a higher rate than what the network can +process, receipts accumulate without limits. This leads to unlimited memory +consumption on chunk producers' and validators' machines. Furthermore, the delay +for a transaction from when it is accepted to when it finishes execution becomes +larger and larger, as the receipts need to queue behind those already in the +system. + +First attempts to limit the memory consumption have been added without protocol +changes. This is known as "local congestion control" and can be summarized in +two rules: + +- Limit the transaction pool to 100 MB. + https://github.com/near/nearcore/pull/9172 +- Once we have accumulated more than 20k delayed receipts in a shard, + chunk-producers for that shard stop accepting more transactions. + https://github.com/near/nearcore/pull/9222 + +But these rules do not put any limits on what other shards would accept. For +example, when a particular contract is popular, the contract's shard will +eventually stop accepting new transactions, but all other shards will keep +accepting more and more transactions that produce receipts for the popular +contract. Therefore the number of delayed receipts keeps growing indefinitely. + +Cross-shard congestion control addresses this issue by stopping new transactions +at the source and delaying receipt forwarding when the receiving shard has +reached its congestion limit. + +## Specification + +The proposal adds fields to chunks headers, adds a new trie column, changes the +transaction selection rules, and changes the chunk execution flow. After the +concepts section below, the next four sections specify each of these changes in +more detail. + +### Concepts + +Below are high-level description of important concepts to make the following +sections a bit easier to digest. + +**Delayed receipts** are all receipts that were ready for execution in the +previous chunk but were delayed due to gas or compute limits. They are stored in +the delayed receipt queue, which itself is stored in the state of the trie. +There is exactly one delayed receipts queue per shard. + +**Outgoing receipts** are all newly produced receipts as a result of executing +receipts or when converting a transaction to non-local receipts. In the absence +of congestion, they are all stored in the output of applying receipts, as a +simple list of receipts. + +The **outgoing receipts buffer** is a data structure added by cross-shard +congestion control. Each shard has one instance of it in the state trie for +every other shard. We use this buffer to store outgoing receipts temporarily +when reaching receipt forwarding limits. + +**Receipt limits** are measured in gas and size. Gas in this context refers to +the maximum amount of gas that could be burn when applying the receipt. The +receipt size is how much space it takes in memory, measured in bytes. + +The **congestion level** is an indicator between 0 and 1 that determines how +strong congestion prevention measures should be. The maximum congestion measured +is reached at congestion level 1. This value is defined separately for each +shard and computed as the maximum value of the following congestion types. + +- **Incoming congestion** increases linearly with the amount of gas in + delayed receipts. +- **Outgoing congestion** increases linearly with the amount of gas in any of + the outgoing buffers of the shard. +- **Memory congestion** increases linearly with the total size of all delayed + and buffered outgoing receipts. +- **Missed chunk congestion** rises linearly with the number of missed chunks + since the last successful chunk, measure in block height difference. + +### Chunk header changes + +We change the chunk header to include congestion information, adding four +indicators. + +```rust +/// sum of gas of all receipts in the delayed receipts queue, at the end of chunk execution +delayed_receipts_gas: u128, +/// sum of gas of all receipts in all outgoing receipt buffers, at the end of chunk execution +buffered_receipts_gas: u128, +/// sum of all receipt sizes in the delayed receipts queue and all outgoing buffers +receipt_bytes: u64, +/// if the congestion level is 1.0, the only shard that can forward receipts to this shard in the next chunk +/// not relevant if the congestion level is < 1.0 +allowed_shard: u16, +``` + +The exact header structure and reasons for the particular integer sizes are +described in more details in the [reference +implementation](#efficiently-computing-the-congestion-information) section. +Usage of these fields is described in [Changes to chunk +execution](#changes-to-chunk-execution). + +This adds 42 bytes to the chunk header, increasing it from 374 bytes up to 416 +bytes in the borsh representation (assuming no validator proposals are +included.) This in turn increases the block size by 42 bytes *per shard*, as all +chunk headers are fully included in blocks. + +Including all this information in the chunk header enables efficient validation. +Using the previous chunk header (or alternatively, the state), combined with the +list of receipts applied and forwarded, a validator can check that the +congestion rules described in this NEP are fulfilled. + +### Changes to transaction selection + +We change transaction selection to reject new transactions when the system is +congested, to reduce to total workload in the system. + +Today, transactions are taken from the chunk producer's pool until `tx_limit` is +reached, where `tx_limit` is computed as follows. + +```python +# old +tx_limit = 500 Tgas if len(delayed_receipts) < 20_000 else 0 +``` + +We replace the formula for the transaction limit to depend on the +`incoming_congestion` variable (between 0 and 1) computed in the previous chunk +of the same shard: + +```python +# new +MIN_TX_GAS = 20 Tgas +MAX_TX_GAS = 500 Tgas +tx_limit = mix(MAX_TX_GAS, MIN_TX_GAS, incoming_congestion) +``` + +This smoothly limits the acceptance of new work, to prioritize reducing the +backlog of delayed receipts. + +In the pseudo code above, we borrow the [`mix`](https://docs.gl/sl4/mix) +function from GLSL for linear interpolation. + +> `mix(x, y, a)` +> +> `mix` performs a linear interpolation between x and y using a to weight between +> them. The return value is computed as $x \times (1 - a) + y \times a$. + +More importantly, we add a more targeted rule to reject all transactions *targeting* +a shard with a congestion level above a certain threshold. + +```python +def filter_tx(tx): + REJECT_TX_CONGESTION_THRESHOLD = 0.25 + if congestion_level(tx.receiver_shard_id) > REJECT_TX_CONGESTION_THRESHOLD + tx.reject() + else + tx.accept() +``` + +Here, `congestion_level()` is the maximum of incoming, outgoing, memory, and +missed chunk congestion. + +This stops (some) new incoming work at the source, when a shard is using too +much memory to store unprocessed receipts, or if there is already too much work +piled up for that shard. + +Chunk validators must validate that the two rules above are respected in a +produced chunk. + +### Changes to chunk execution + +We change chunk execution to hold back receipt forwarding to congested shards. +This has two effects. + +1. It prevents the memory consumption of the congested shard from growing at the + expense of buffering these pending receipts on the outgoing shards. +2. When user demand is consistently higher than what the system can handle, this + mechanism lets backpressure propagate shard-by-shard until it reaches the shards + responsible for accepting too many receipts and causes transaction + filtering to kick in. + +To accomplish this, we add 3 new steps to chunk execution (enumerated as 1, 2, 6 +below) and modify how outgoing receipts are treated in the transaction +conversion step (3) and in the receipts execution step (4). + +The new chunk execution then follows this order. + +1. (new) Read congestion information for *all* shards from the previous chunk headers. + + ```rust + // congestion info for each shard, as computed in the last included chunk of the shard + { + delayed_receipts_gas: u128, + buffered_receipts_gas: u128, + receipt_bytes: u64, + allowed_shard: u16, + // extended congestion info, as computed from the latest block header + missed_chunks_count: u64 + } + ``` + +2. (new) Compute bandwidth limits to other shards based on the congestion information. + The formula is: + + ```python + for receiver in other_shards: + MAX_CONGESTION_INCOMING_GAS = 20 Pgas + incoming_congestion = delayed_receipts_gas[receiver] / MAX_CONGESTION_INCOMING_GAS + + MAX_CONGESTION_OUTGOING_GAS = 2 Pgas + outgoing_congestion = buffered_receipts_gas[receiver] / MAX_CONGESTION_OUTGOING_GAS + + MAX_CONGESTION_MEMORY_CONSUMPTION = 1000 MB + memory_congestion = receipt_bytes[receiver] / MAX_CONGESTION_MEMORY_CONSUMPTION + + MAX_CONGESTION_MISSED_CHUNKS = 10 + missed_chunk_congestion = missed_chunks_count[receiver] / MAX_CONGESTION_MISSED_CHUNKS + + congestion = max(incoming_congestion, outgoing_congestion, memory_congestion, missed_chunk_congestion) + + if congestion >= 1.0: + # Maximum congestion: reduce to minimum speed + if current_shard == allowed_shard[receiver]: + outgoing_gas_limit[receiver] = 1 Pgas + else: + outgoing_gas_limit[receiver] = 0 + else: + # Green or Amber + # linear interpolation based on congestion level + MIN_GAS_FORWARDING = 1 Pgas + MAX_GAS_FORWARDING = 300 Pgas + outgoing_gas_limit[receiver] + = mix(MAX_GAS_FORWARDING, MIN_GAS_FORWARDING, congestion) + ``` + +3. (new) Drain receipts in the outgoing buffer from the previous round + - Subtract `receipt.gas()` from `outgoing_gas_limit[receipt.receiver]` for + each receipt drained. + - Keep receipts in the buffer if the gas limit would be negative. + - Subtract `receipt.gas()` from `outgoing_congestion` and `receipt.size()` + from `receipt_bytes` for the local shard for every forwarded receipt. + - Add the removed receipts to the outgoing receipts of the new chunk. +4. Convert all transactions to receipts included in the chunk. + - Local receipts, which are receipts where the sender account id is equal to + the receiver id, are set aside as local receipts for step 5. + - Non-local receipts up to `outgoing_gas_limit[receipt.receiver]` for the + respective shard go to the outgoing receipts list of the chunk. + - (new) Non-local receipts above `outgoing_gas_limit[receipt.receiver]` for + the respective shard go to the outgoing receipts buffer. + - (new) For each receipt added to the outgoing buffer, add `receipt.gas()` + to `outgoing_congestion` and `receipt.size()` to `receipt_bytes` for + the local shard. +5. Execute receipts in the order of `local`, `delayed`, `incoming`, `yield-resume time-out`. + - Don't stop before all receipts are executed or more than 1000 Tgas have + been burnt. Burnt gas includes the burnt gas from step 4. + - Outgoing receipts up to what is left in + `outgoing_gas_limit[receipt.receiver]` per shard (after step 3) go to the + outgoing receipts list of the chunk. + - (new) Outgoing receipts above `outgoing_gas_limit[receipt.receiver]` + go to the outgoing receipts buffer. + - (new) For each delayed executed receipt, remove `receipt.gas()` from + `incoming_congestion` and `receipt.size()` from `receipt_bytes`. +6. Remaining local or incoming receipts are added to the end of the `delayed` + receipts queue. + - (new) For each receipt added to the delayed receipts queue, add + `receipt.gas()` to `incoming_congestion` and `receipt.size()` to + `receipt_bytes`. +7. (new) Write own congestion information into the result, to be included in the + next chunk header. + - If the congestion level is >= 1.0, the `allowed_shard` can be chosen + freely by the chunk producer. Selecting the own shard means nobody can + send. The reference implementations uses round-robin assignment of all + other shards. Further optimization can be done without requiring protocol + changes. + - If the congestion level is <= 1.0, the `allowed_shard` value does not + affect congestion control. But chunk producer must set it to the own shard + in this case. + +In the formula above, the receipt gas and the receipt size are defined as: + +```python +def gas(receipt): + return receipt.attached_gas + receipt.exec_gas + +def size(receipt): + return len(borsh(receipt)) +``` + +### Changes to Trie + +We store the outgoing buffered receipts in the trie, similar to delayed receipts +but in their own separate column. But instead of a single queue per shard, we add +one queue for each other shard at the current sharding layout. + +We add two trie columns: + +- `BUFFERED_RECEIPT_INDICES: u8 = 13;` +- `BUFFERED_RECEIPT: u8 = 14` + +The `BUFFERED_RECEIPT_INDICES` column only has one value, which stores a +borsh-serialized instance of `BufferedReceiptIndices` defines as follows: + +```rust +pub struct BufferedReceiptIndices { + pub shard_buffer_indices: BTreeMap, +} + +pub struct ShardBufferedReceiptIndices { + // First inclusive index in the queue. + pub first_index: u64, + // Exclusive end index of the queue + pub next_available_index: u64, +} +``` + +The `BUFFERED_RECEIPT` column stores receipts keyed by +`TrieKey::BufferedReceipt{ receiving_shard: ShardId, index: u64 }`. + +The `BufferedReceiptIndices` map defines which queues exist, which changes +during resharding. For each existing queue, all receipts in the range +`[first_index, next_available_index)` (inclusive start, exclusive end) must +exist under the key with the corresponding shard. + + +### Notes on parameter fine-tuning + +Below are the reasons why each parameter is set to the specific value given above. + +For more details, a spreadsheet with the full analysis can be found here: +https://docs.google.com/spreadsheets/d/1Vt_-sgMdX1ncYleikYY8uFID_aG9RaqJOqaVMLQ37tQ/edit#gid=0 + +#### Queue sizes + +The parameters are chosen to strike a balance between guarantees for short +queues and utilization. 20 PGas delayed receipts means that incoming receipts +have to wait at most 20 chunks to be applied. And it can guarantee 100% +utilization as long as the ratio between burnt and attached gas in receipts is +above 1 to 20. + +A shorter delayed queue would result in lower delays but in our model +simulations, we saw reduced utilization even in simple and balanced workloads. + +The 1 GB of memory is a target value for the control algorithm to try and stay +below. With receipts in the normal range of sizes seen in today's traffic, we +should never even get close to 1 GB. But the protocol allows a single receipt to +be multiple MBs. In those cases, a limit of 1 GB still gives us almost 100% +utilization but prevents queues from growing larger than what validators can +keep in memory. + +#### Receipt forwarding limits + +`MIN_GAS_FORWARDING = 1 Pgas` and `MAX_GAS_FORWARDING = 300 Pgas` give us a +large range to smooth out how much should be forwarded to other shards. +Depending on the burnt to attached gas ratio of the workload, it will settle at +different values for each directed shard pair. This gives the algorithm +adaptability to many workload patterns. + +For the forwarding to work smoothly, we need a bit of an outgoing buffer queue. +We found in simulations that `MAX_CONGESTION_OUTGOING_GAS = 2 Pgas` is enough +for the forwarding limit to settle in the perfect spot before we are restricted +by transaction rejection. Higher values did not yield better results but it does +increase delays in some cases, hence we propose 2 Pgas. + +#### Limiting new transactions + +The remaining parameters work together to adapt how much new workload we accept +in the system, based on how congested the chain is already. + +`REJECT_TX_CONGESTION_THRESHOLD = 0.25` defines how quickly we start rejecting +new workload to a shard. Combined with the 20 PGas limit on the delayed receipts +queue, we only reject new work if we have at least 5 PGas excess workload sent +to that shard already. + +This is more aggressive than other mechanisms simply because rejecting more +workload to known-to-be congested shards is the most effective tool to prevent +the system from accumulating more transactions. The sooner we do it, the shorter +the delays experienced by users who got their transactions accepted. + +`MIN_TX_GAS = 20 Tgas` and `MAX_TX_GAS = 500 Tgas` gives a large range to smooth +out the split between gas spend on new transactions vs delayed receipts. This +only looks at how many delayed receipts the local shard has, not at the +receiving shard. Depending on the workload, it will settle at different values. + +Note that hitting `REJECT_TX_CONGESTION_THRESHOLD`, which looks at the +congestion of the receiving shard, overrules this range and stops all +transactions to the congested shard when it is hit. + +`MIN_TX_GAS = 20 Tgas` guarantees that we can still accept a decent amount of +new transactions to shards that are not congested, even if the local shard +itself is fully congested. This gives fairness properties under certain workloads +which we could not achieve in any other of the tried congestion control +strategies. This is also useful to add transaction priority in +[NEP-541](https://github.com/near/NEPs/pull/541), as we can always auction off +the available space for new transactions without altering the congestion control +algorithm. + + +## Reference Implementation + +A reference implementation is available in this PR against nearcore: +https://github.com/near/nearcore/pull/10918 + +Here are the most important details which are not already described in the +specification above but are defined in the reference implementation. + +### Efficiently computing the congestion information + +The congestion information is computed based on the gas and size of the incoming +queue and the outgoing buffers. A naive implementation would just iterate over all +of the receipts in the queue and buffers and sum up the relevant metrics. This +approach is slow and, in the context of stateless validation, would add too much +to the state witness size. In order to prevent those issues we consider two +alternative optimizations. Both use the same principle of caching the previously +calculated metrics and updating them based on the changes to the incoming queue +and outgoing buffers. + +After applying a chunk, we store detailed information of the shard in the chunk +extra. Unlike the shard header, this is only stored on the shard and not shared +globally. + +The new fields in the chunk extra are included in `ChunkExtraV3`. + +```rust +pub struct ChunkExtraV3 { + + // ...all fields from ChunkExtraV2 + + pub congestion_info: CongestionInfo, +} + +pub struct CongestionInfo { + delayed_receipts_gas: u128, + buffered_receipts_gas: u128, + receipt_bytes: u64, + allowed_shard: u16, +} +``` + +This implementation allows to efficiently update the `StoredReceiptsInfo` during +chunk application by starting with the information of the previous chunk and +applying only the changes. + +Regarding integer sizes, `delayed_receipts_gas` and `buffered_receipts_gas` use +128-bit unsigned integers because 64-bit would not always be enough. `u64::MAX` +would only be enough to store `18_446 Pgas`. This translates to roughly 5 hours +of work, assuming 1 Pgas per second. While the proposed congestion control +strategy should prevent congestion ever reaching such high levels, it is not +possible to rule out completely. + +For `receipt_bytes`, a `u64` is more than enough, we have other problems if we +need to store millions of Terabytes of receipts. + +For the id of the allowed shard, we chose a `u16` which is large enough for +65_535 shards. + +### Bootstrapping + +The previous section explain how the gas and bytes information of unprocessed +receipts is computed based on what it was for the previous chunk. But for the +first chunk with this feature enabled, the information for the previous chunk is +not available. + +In this specific case, we detect that the previous information is not available +and therefore we trigger an iteration of the existing queues to compute the +correct values. + +This computed `StoredReceiptsInfo` only applies locally. But the next value of it +will be shared in the chunk header and other shards will start using it to limit +the transactions they accept and receipts they forward. + +The congestion info of other shards is assumed to be 0 for all values for the +first block with the cross-shard congestion control feature enabled. + +### Missing Chunks + +When a chunk is missing, we use the congestion information of the last available +chunk header for the shard. In practical terms this simply means we take the +chunk header available in the block, even if the included height is not the +latest. + +Additionally, we include the number of missed chunks as part of the congestion +formula, treating a shard with 10 or missed chunks the same way as an otherwise +fully congested shard. This is to prevent sending even more receipts to a shard +that already struggles to produce chunks. + +### Validation Changes + +The following fields in the chunk header must be validated: + +- `receipt_bytes`: must be equal to `receipt_bytes` of the previous chunk, plus + all bytes of new receipts added to delayed or buffered receipts, minus all the + receipts removed of the same types. +- `delayed_receipts_gas` must be equal to `delayed_receipts_gas` of the previous + chunk, plus the gas of receipts added to the delayed receipts queue, minus the + gas of receipts removed from the delayed receipts queue. +- `buffered_receipts_gas` must be equal to `buffered_receipts_gas` of the previous + chunk, plus the gas of receipts added to any of the outgoing receipts buffers, minus the + gas of all forwarded buffered receipts. +- `allowed_shard` must be a valid shard id. +- `allowed_shard` must be equal to the chunk's shard id if congestion is below 1. + +The balance checker also needs to take into account balances stored in buffered receipts. + +## Security Implications + +With cross-shard congestion control enabled, malicious users could try to find +patterns that clog up the system. This could potentially lead to cheaper denial +of service attacks compared to today. + +If such patterns exists, most likely today's implementation would suffer from +different problems, such as validators requiring unbounded amounts of memory. +Therefore, we believe this feature is massive step forward in terms of security, +all things considered. + +## Integration with state sync + +What we described in [Efficiently computing the congestion +information](#efficiently-computing-the-congestion-information) creates a +dependence on the previous block when processing a block. For a fully synced +node this requirement is always fulfilled because we keep at least 3 epochs of +blocks. However in state sync we start processing from an arbitrary place in +chain without access to full history. + +In order to integrate the congestion control and state sync features we will +add extra steps in state sync to download the blocks that may be needed in +order to finalize state sync. + +The blocks that are needed are the `sync hash` block, the `previous block` where +state sync creates chunk extra in order to kick off block sync and the `previous +previous block` that is now needed in order to process the `previous block`. On +top of that we may need to download further blocks to ensure that every shard has +at least one new chunk in the blocks leading up to the sync hash block. + +## Integration with resharding + +Resharding is a process wherin we change the shard layout - the assignment of +accound ids to shards. The centerpiece of resharding is moving the trie / state +records from parent shards to children shards. It's important to preserve the +ability to perform resharding while adding other protocol features such as congestion +control. Below is a short description how resharding and congestion control can be +integrated, in particular how to reshard the new trie columns - the outgoing buffers. + +For simplicity we'll only consider splitting a single parent shard into multiple +children shards which is currently the only supported operation. + +The actual implementation of this integration will be done independently and +outside of this effort. + +Importantly the resharding affects both the shard that is being split and all the +other shards. + +#### Changes to the shard under resharding + +The outgoing buffers of the parent shard can be split among children by iterating +all of the receipts in each buffer and inserting it to appropriate child shard. +The assignment can in theory be arbitrary e.g. all receipts can be reassigned to +a single shard. In practice it would make sense to either split the receipts +equally between children or based on the sender account id of the receipt. + +Special consideration should be given to refund receipts where the sender account +is "system" that may not belong to neither parent nor children shards. +Any assignment of such receipts is fine. + +#### Changes to the other shards + +The other shards, that is all shards that are not under resharding, have an +outgoing buffer to the shard under resharding. This buffer should be split +into one outgoing buffer per child shard. The buffer can be split by iterating +receipts and reassigning each to either of the child shards. Each receipt can +be reassigned based on it's receiver account id and the new shard layout. + +## Alternatives + +A wide range of alternatives has been discussed. It would be too much to list +all suggested variations of all strategies. Instead, here is a list of different +directions that were explored, with a representative strategy for each of them. + +1. Use transaction fees and an open market to reduce workload added to the + system. + + - Problem 1: This does not prevent unbounded memory requirements of + validators, it just makes it more expensive. + - Problem 2: In a sharded environment like Near Protocol, it is hard to + implement this fairly. Because it's impossible to know where a transaction + burns most of its gas, the only simple solution would require all shards to + pay the price for the most congested shard. + +2. Set fixed limits for delayed receipts and drop receipts beyond that. + + - Problem 1: Today, smart contract rely on receipts never being lost. This + network-level failure mode would be completely new. + - Problem 2: We either need to allow resuming with external inputs, or + roll-back changes on all shards to still have consistent states in smart + contracts. Both solutions means we are doing extra work when being + congested, inevitably reducing the available throughput for useful work in + times when the demand is the largest. + +3. Stop accepting transactions globally when any shard has too long of a delayed + receipts queue. ([See this issue](https://github.com/near/nearcore/issues/9228).) + - Problem 1: This gives poor utilization in many workloads, as our model + simulations confirmed. + - Problem 2: A global stop conflicts with plans to add fee based transaction + priorities which should allow sending transactions even under heavy + congestion. + +4. Reduce newly accepted transactions solely based on gas in delayed queues, + without adding new buffers or queues to the system. Gas is tracked per shard + of the transaction signer. ([Zulip Discussion](https://near.zulipchat.com/#narrow/stream/295558-core/topic/congestion.20control/near/429973223) and [idea in code](https://github.com/near/nearcore/pull/10894).) + + - Problem 1: This requires `N` gas numbers in each chunk header, or `N*N` + numbers per block, where `N` is the number of shards. + - Problem 2: We did not have time to simulate it properly. But on paper, it + seems each individual delayed queue can still grow indefinitely as the + number of shards in the system grows. + +5. Smartly track and limit buffer space across different shards. Only accept new + transactions if enough buffer space can be reserved ahead of time. + + - Problem 1: Without knowing which shards a transaction touches and how + large receipts will be, we have to pessimistically reserve more space than + most receipts will actually need. + - Problem 2: If buffer space is shared globally, individual queues can still + grow really large, even indefinitely if we assume the number of shards + grows over time. + - Problem 3: If buffer space is on a per-shard basis, we run into deadlocks + when two shards have no more space left but both need to send to the other + shard to make progress. + +6. Require users to define which shards their transactions will touch and how + much gas is burnt in each. Then use this information for global scheduling + such that congestion is impossible. + + - Problem 1: This requires lots of changes across the infrastructure stack. + It would take too long to implement as we are already facing congestion + problems today. + - Problem 2: This would have a strong impact on usability and it is unclear + if gas usage estimating services could close the gap to make it + acceptable. + +7. An alternative way to what is described in [Efficiently computing the + congestion information](#efficiently-computing-the-congestion-information) would + be to store the total gas and total size of the incoming queue and the outgoing + receipts in the state along the respective queue or buffers. Those values will + be updated as receipts are added or removed from the queue. + - Pro: In this case the CongestionInfo struct can remain small and only + reflect the information needed by other shards. (3 bytes instead of 42 + bytes) + + ```rust + pub struct CongestionInfo { + allowed_shard: u16, + congestion_level: u8, + } + ``` + + - Con: Overall, it would result in more state changes per chunk, since the + congestion value needs to be read before applying receipts anyway. In + light of stateless validation, this would be worse for the state witness + size + + +## Future possibilities + +While this proposal treats all requests the same, it sets the base for a proper +transaction priority implementation. We co-designed this proposal with +[NEP-541](https://github.com/near/NEPs/pull/541), which adds a transaction +priority fee. On a very high level, the fee is used to auction off a part of the +available gas per chunk to the highest bidders. + +We also expect that this proposal alone will not be the final solution for +congestion control. Rather, we just want to build a solid foundation in this NEP +and allow future optimization to take place on top of it. + +For example, estimations on how much gas is burnt on each shard could help with +better load distribution in some cases. + +We also forsee that the round-robin scheduling of shard allowed to forward even +under full congestion is not perfect. It is a key feature to make deadlocks +provably impossible, since every shard is guaranteed to make a minimum progress +after N rounds. But it could be beneficial to allocate more bandwidth to shards +that actually have something to forward, or perhaps it would be better to stop +forwarding anything for a while. The current proposal allows chunk producers to +experiment with this without a protocol version change. + +Lastly, a future optimization could do better transaction rejection for meta +transactions. Instead of looking only at the outer transaction receiver, we +could also look at the receiver of the delegate action, which is most likely +where most gas is going to be burnt, and use this for transaction rejection. + +## Consequences + +### Positive + +- Accepted transaction have lower latencies compared to today under congestion. +- Storage and memory requirements on validator for storing receipts are bounded. + +### Neutral + +- More transactions are rejected at the chunk producer level. + +### Negative + +- Users need to resend transaction more often. + +### Backwards Compatibility + +There are no observable changes on the smart contract, wallet, or API level. +Thus, there are no backwards-compatibility concerns. + +## Unresolved Issues (Optional) + +These congestion problems are out of scope for this proposal: + +- Malicious patterns can still cause queues to grow beyond the parameter limits. +- There is no way to pay for higher priority. + ([NEP-541](https://github.com/near/NEPs/pull/541) adds it.) + +Postponed receipts are also considered to be added to `receipt_bytes`. But at +this point it seems better to not include them to avoid further complications +with potential deadlocks, since postponed receipts can only be executed when +incoming receipts are allowed to come in. + +Following the same logic, yielded receipts are also excluded from the size +limits, as they require incoming receipts to resume. + +A solution that also address the memory space of postponed and yielded receipts +could be added with future proposals but is not considered necessary for this +first iteration of cross-shard congestion control. + +## Changelog + +[The changelog section provides historical context for how the NEP developed over time. Initial NEP submission should start with version 1.0.0, and all subsequent NEP extensions must follow [Semantic Versioning](https://semver.org/). Every version should have the benefits and concerns raised during the review. The author does not need to fill out this section for the initial draft. Instead, the assigned reviewers (Subject Matter Experts) should create the first version during the first technical review. After the final public call, the author should then finalize the last version of the decision context.] + +### 1.0.0 - Initial Version + +> Placeholder for the context about when and who approved this NEP version. + +#### Benefits + +> List of benefits filled by the Subject Matter Experts while reviewing this version: + +- Benefit 1 +- Benefit 2 + +#### Concerns + +> Template for Subject Matter Experts review for this version: +> Status: New | Ongoing | Resolved + +| # | Concern | Resolution | Status | +| --: | :------ | :--------- | -----: | +| 1 | | | | +| 2 | | | | + +## Copyright + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).