Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: store heartbeat metric in RaftMetrics and RaftDataMetrics #1177

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

SteveLauC
Copy link
Collaborator

@SteveLauC SteveLauC commented Jul 18, 2024

Checklist

  • Updated guide with pertinent info (may not always apply).
  • Squash down commits to one or two logical commits which clearly describe the work you've done.
  • Unittest is a friend:)

Will check the above stuff when this PR is no longer a draft:)


What does this PR do

This PR attempts to implement the metric discussed in this thread,
adding the heartbeat information (time of the last replication or heartbeat) to
RaftDataMetrics and RaftMetrics so that users can subscribe to it.

It is a draft PR as I want to ensure it looks basically correct, then I will
try to add tests, code documents, and polish interfaces.

How it is implemented

For the time type, I used RaftTypeConfig::AsyncRuntime::Instant as it is used
widely.

And I found the last applied log ID is stored in struct ProcessEntry, so I
stored this time information there as well. When a response to a replication
request is received, the replication handler will update it in
ReplicationHandler::update_matching() using the sending_time of the replication
result.

And these metrics will be sent to the tokio watch channel in RaftCore::report_metrics().

Unsolved problems/Questions

There are several unsolved problems with this PR, which I want to discuss with
the maintainer before taking any further actions.

  1. The serde::{Deserialize, Serialize} bounds

    RaftDataMetris and RaftMetrics need to satisfy these bounds when the serde
    feature of Openraft is enabled, but neither std::time::Instant nor
    tokio::time::Instant does this.

  2. Should we use the result.sending_time, or we should let the follower pass
    the time when it receives the replication request? The later seems more
    accurate.

  3. Tests

    Looks like most unit tests (related to replication) use special human-made
    log ID to emulate specific cases, should this time info be done similarly?
    I guess no as it is not that important compared to the log ID.


This change is Reviewable

@SteveLauC SteveLauC changed the title feat: store heartbeat metric in RaftMetrics and RaftDataMetrics Feature: store heartbeat metric in RaftMetrics and RaftDataMetrics Jul 18, 2024
@drmingdrmer drmingdrmer self-requested a review July 18, 2024 09:33
@drmingdrmer
Copy link
Member

It looks like you can not use Instant in RaftMetrics which requires serde.

The metrics may have to store the interval since the last acked heartbeat, just like millis_since_quorum_ack does:
https://github.com/datafuselabs/openraft/blob/9dbcd4c0b6723fd4ac5d38ed032d8e656d0abc8d/openraft/src/metrics/raft_metrics.rs#L70

Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion (waiting on @SteveLauC)


openraft/src/progress/entry/mod.rs line 26 at r1 (raw file):

    /// The id of the last matching log on the target following node.
    pub(crate) matching: Option<LogId<C::NodeId>>,
    pub(crate) time: Option<InstantOf<C>>,

You do not need to store the ack time. there is already a field for it:
https://github.com/datafuselabs/openraft/blob/55d009c897d689534cb5cc3b6321b1f42d9b2066/openraft/src/proposer/leader.rs#L61

clock_progress stores the last timestamp which is acked by each follower/learner.

It can be transform to a BTreeMap as it does with progress -> replications:

https://github.com/datafuselabs/openraft/blob/55d009c897d689534cb5cc3b6321b1f42d9b2066/openraft/src/core/raft_core.rs#L520-L528

@SteveLauC
Copy link
Collaborator Author

Thanks for the review and guiding me!


It looks like you can not use Instant in RaftMetrics which requires serde.

The metrics may have to store the interval since the last acked heartbeat, just like millis_since_quorum_ack does:

Done

You do not need to store the ack time. there is already a field for it:

It can be transform to a BTreeMap as it does with progress -> replications:

Done

Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 7 files at r2, all commit messages.
Reviewable status: 6 of 7 files reviewed, 2 unresolved discussions (waiting on @drmingdrmer and @SteveLauC)


openraft/src/progress/entry/mod.rs line 26 at r1 (raw file):

    /// The id of the last matching log on the target following node.
    pub(crate) matching: Option<LogId<C::NodeId>>,
    pub(crate) time: Option<InstantOf<C>>,

BTW, independent of the solution, naming a member time and especially w/o documentation is very confusing. Always add a documentation comment telling what this is and name members and methods appropriately.

For instance, this might be last_known_alive_timestamp or something like that.

Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to create a test to verify the functionality of these features.

Please include the heartbeat tests in this module:
https://github.com/datafuselabs/openraft/blob/ca156ead4af0624e990d1e1e069aab929a5379a3/tests/tests/metrics/main.rs#L10

Reviewed 7 of 7 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @SteveLauC)


openraft/src/metrics/raft_metrics.rs line 77 at r2 (raw file):

    /// The heartbeats states. It is Some() only when this node is leader.
    pub heartbeat: Option<HeartbeatMetrics<C>>,

Please enhance the documentation to clarify the meaning of the value. For example: "The value represents the interval since the timestamp of the last heartbeat acknowledged by a follower."

@SteveLauC
Copy link
Collaborator Author

naming a member time and especially w/o documentation is very confusing.

Please enhance the documentation to clarify the meaning of the value

Yeah, I will handle them, just want to ensure this PR is basically ok before doing these things.

It is a draft PR as I want to ensure it looks basically correct, then I will
try to add tests, code documents, and polish interfaces.

@SteveLauC
Copy link
Collaborator Author

SteveLauC commented Jul 19, 2024

Hi, I just added the code documents.


We need to create a test to verify the functionality of these features.

Please include the heartbeat tests in this module:

https://github.com/datafuselabs/openraft/blob/ca156ead4af0624e990d1e1e069aab929a5379a3/tests/tests/metrics/main.rs#L10

Should I add tests in this module or the module tests/tests/metrics/t30_leader_metrics.rs, looks that the later is for metrics and the replication tests are put in it:

$ rg "replication"
t30_leader_metrics.rs
52:                if let Some(ref q) = x.replication {
61:            "no replication with 1 node cluster",
95:                if let Some(ref q) = x.replication {
102:            "replication metrics to 4 nodes",
128:        "--- replication metrics should reflect the replication state"
137:                    if let Some(ref q) = x.replication {
144:                "replication metrics to 3 nodes",
159:        n1.wait(timeout()).metrics(|x| x.replication.is_some(), "node-1 starts replication").await?;
161:        n0.wait(timeout()).metrics(|x| x.replication.is_none(), "node-0 stopped replication").await?;

Or maybe it should be added in tests/tests/metrics/t10_server_metrics_and_data_metrics.rs?

let mut data_metrics = leader.data_metrics();
let mut prev_node1 = 0;
let mut prev_node2 = 0;
for _ in 0..30 {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, since the heartbeat is disabled, the last ack time of node 1 and node 2 should never change, so I expect the time since the last ack should increase during the loop, but it won't sometimes, I am not sure about the reason:<

running 1 test
panicked at tests/tests/metrics/t10_server_metrics_and_data_metrics.rs:109:13:
interval should increase since last ack won't change
tests/tests/metrics/t10_server_metrics_and_data_metrics.rs:109:13
backtrace is disabled without --features 'bt'

test t10_server_metrics_and_data_metrics::heartbeat_metrics ... FAILED

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any change to RaftMetrics would result in changed() to be ready, it is possible other fields changed but the heartbeat did not change.

I dumped the node-0 metrics from the log (at the bottom of the page):
cat ut.2024-07-19-02 | grep heartbeat_metrics | grep report_metrics:.* -o | grep id:0

But it looks lite impl Display for RaftMetrics is not updated I can not see the heartbeat field:

report_metrics: Metrics{id:0, Learner, term:0, vote:<T0-N0:->, last_log:None, last_applied:None, leader:None(since_last_ack:None ms), membership:{log_id:None, {voters:[], learners:[]}}, snapshot:None, purged:None, replication:{}}
report_metrics: Metrics{id:0, Learner, term:0, vote:<T0-N0:->, last_log:None, last_applied:None, leader:None(since_last_ack:None ms), membership:{log_id:None, {voters:[], learners:[]}}, snapshot:None, purged:None, replication:{}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:1, last_applied:None, leader:0(since_last_ack:0 ms), membership:{log_id:T0-N0.0, {voters:[{0:()}], learners:[]}}, snapshot:None, purged:None, replication:{0:T1-N0.1}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:1, last_applied:T1-N0.1, leader:0(since_last_ack:0 ms), membership:{log_id:T0-N0.0, {voters:[{0:()}], learners:[]}}, snapshot:None, purged:None, replication:{0:T1-N0.1}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:2, last_applied:T1-N0.1, leader:0(since_last_ack:0 ms), membership:{log_id:T1-N0.2, {voters:[{0:()}], learners:[1:()]}}, snapshot:None, purged:None, replication:{0:T1-N0.2,1:None}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:2, last_applied:T1-N0.2, leader:0(since_last_ack:0 ms), membership:{log_id:T1-N0.2, {voters:[{0:()}], learners:[1:()]}}, snapshot:None, purged:None, replication:{0:T1-N0.2,1:None}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:3, last_applied:T1-N0.2, leader:0(since_last_ack:0 ms), membership:{log_id:T1-N0.3, {voters:[{0:()}], learners:[1:(),2:()]}}, snapshot:None, purged:None, replication:{0:T1-N0.3,1:None,2:None}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:3, last_applied:T1-N0.3, leader:0(since_last_ack:0 ms), membership:{log_id:T1-N0.3, {voters:[{0:()}], learners:[1:(),2:()]}}, snapshot:None, purged:None, replication:{0:T1-N0.3,1:None,2:None}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:3, last_applied:T1-N0.3, leader:0(since_last_ack:0 ms), membership:{log_id:T1-N0.3, {voters:[{0:()}], learners:[1:(),2:()]}}, snapshot:None, purged:None, replication:{0:T1-N0.3,1:T1-N0.3,2:T1-N0.3}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:4, last_applied:T1-N0.3, leader:0(since_last_ack:14 ms), membership:{log_id:T1-N0.4, {voters:[{0:()},{0:(),1:(),2:()}], learners:[]}}, snapshot:None, purged:None, replication:{0:T1-N0.4,1:T1-N0.3,2:T1-N0.3}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:4, last_applied:T1-N0.4, leader:0(since_last_ack:10 ms), membership:{log_id:T1-N0.4, {voters:[{0:()},{0:(),1:(),2:()}], learners:[]}}, snapshot:None, purged:None, replication:{0:T1-N0.4,1:T1-N0.3,2:T1-N0.4}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:5, last_applied:T1-N0.4, leader:0(since_last_ack:19 ms), membership:{log_id:T1-N0.5, {voters:[{0:(),1:(),2:()}], learners:[]}}, snapshot:None, purged:None, replication:{0:T1-N0.5,1:T1-N0.4,2:T1-N0.4}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:5, last_applied:T1-N0.5, leader:0(since_last_ack:9 ms), membership:{log_id:T1-N0.5, {voters:[{0:(),1:(),2:()}], learners:[]}}, snapshot:None, purged:None, replication:{0:T1-N0.5,1:T1-N0.4,2:T1-N0.5}}
report_metrics: Metrics{id:0, Leader, term:1, vote:<T1-N0:Q>, last_log:5, last_applied:T1-N0.5, leader:0(since_last_ack:8 ms), membership:{log_id:T1-N0.5, {voters:[{0:(),1:(),2:()}], learners:[]}}, snapshot:None, purged:None, replication:{0:T1-N0.5,1:T1-N0.5,2:T1-N0.5}}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any change to RaftMetrics would result in changed() to be ready, it is possible other fields changed but the heartbeat did not change.

Actually and surprisingly, the interval value in the next loop can even be smaller than the previous one:

$ cargo t t10_server_metrics_and_data_metrics::heartbeat_metrics -- --nocapture
running 1 test
heartbeat: {0: Some(2), 1: Some(9), 2: Some(6)}
heartbeat: {0: Some(2), 1: Some(6), 2: Some(5)}
panicked at tests/tests/metrics/t10_server_metrics_and_data_metrics.rs:111:13:
interval should increase since last ack won't change
tests/tests/metrics/t10_server_metrics_and_data_metrics.rs:111:13
backtrace is disabled without --features 'bt'

test t10_server_metrics_and_data_metrics::heartbeat_metrics ... FAILED

But it looks lite impl Display for RaftMetrics is not updated I can not see the heartbeat field:

Sorry for missing that, it has just been added:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, if another AppendEntries or heartbeat response is received, this interval will become smaller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heartbeat is disabled in this test cluster, so there shouldn't be heartbeat response. Then I guess it is made by a replication during the cluster initialization?

Copy link
Collaborator Author

@SteveLauC SteveLauC Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this test, do we need more tests or the existing 2 tests are sufficient, let me know what you think:)

@drmingdrmer drmingdrmer requested a review from schreter July 19, 2024 07:42
Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r3, 1 of 1 files at r5, all commit messages.
Reviewable status: 7 of 8 files reviewed, 5 unresolved discussions (waiting on @schreter and @SteveLauC)


openraft/src/metrics/raft_metrics.rs line 114 at r5 (raw file):

        write!(
            f,
            "membership:{}, heartbeat:{{{}}}, snapshot:{}, purged:{}, replication:{{{}}}",

The Display implementations should be consistent: always put heartbeat after replication.

@SteveLauC
Copy link
Collaborator Author

The Display implementations should be consistent: always put heartbeat after replication.

Sure

@SteveLauC
Copy link
Collaborator Author

https://github.com/datafuselabs/openraft/actions/runs/10005367869/job/27655934359?pr=1177

failures:

---- t11_add_learner::check_learner_after_leader_transferred stdout ----
panicked at /home/runner/work/openraft/openraft/openraft/src/raft_state/log_state_reader.rs:25:13:
assertion failed: Some(log_id) <= self.committed()
/home/runner/work/openraft/openraft/openraft/src/raft_state/log_state_reader.rs:25:13
backtrace is disabled without --features 'bt'
panicked at /home/runner/work/openraft/openraft/openraft/src/raft_state/log_state_reader.rs:25:13:
assertion failed: Some(log_id) <= self.committed()
/home/runner/work/openraft/openraft/openraft/src/raft_state/log_state_reader.rs:25:13
backtrace is disabled without --features 'bt'
panicked at /home/runner/work/openraft/openraft/openraft/src/raft_state/log_state_reader.rs:25:13:
assertion failed: Some(log_id) <= self.committed()
/home/runner/work/openraft/openraft/openraft/src/raft_state/log_state_reader.rs:25:13
backtrace is disabled without --features 'bt'
Error: timeout after 3s when node in new cluster finally commit at least one blank leader-initialize log .applied_index>=10 latest: Metrics{id:3, Leader, term:2, vote:<T2-N3:Q>, last_log:10, last_applied:T1-N0.9, leader:3(since_last_ack:None ms), membership:{log_id:T1-N0.9, {voters:[{1:(),3:(),4:()}], learners:[2:()]}}, snapshot:None, purged:None, replication:{1:None,2:None,3:T2-N3.10,4:None}, heartbeat:{1:None,2:None,3:76,4:None}}

This test failure seems irrelevant.

Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 7 files at r2, 1 of 2 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @drmingdrmer and @SteveLauC)


openraft/src/metrics/raft_metrics.rs line 78 at r5 (raw file):

    /// Heartbeat metrics. It is Some() only when this node is leader.
    ///
    /// This field records a mapping between a node's ID and milliseconds since

I already opened an issue for this and similar topics. Using "millisecond since" or other relative times is unreliable. We should in general use just timestamps via standard Instant (well, the one from AsyncRuntime) and Duration.

There was just a question on how to serialize it for users who need serde. Possibly, we need a specialized serde serializer to serialize it properly. See also this issue on serde.

Code quote:

milliseconds since

openraft/src/metrics/raft_metrics.rs line 119 at r5 (raw file):

                .as_ref()
                .map(|x| { x.iter().map(|(k, v)| format!("{}:{}", k, DisplayOption(v))).collect::<Vec<_>>().join(",") })
                .unwrap_or_default(),

Ouch! Allocations in Display and similar traits are a very bad idea! Yes, I know there is some legacy stuff with summary() and similar, but continuing it is not good.

Please rewrite allocation-free, e.g., by adding properly-implemented Display trait to HeartbeatMetrics.

Suggestion:

            DisplayOption(&self.heartbeat),

openraft/src/metrics/raft_metrics.rs line 214 at r5 (raw file):

            self.heartbeat
                .as_ref()
                .map(|x| { x.iter().map(|(k, v)| format!("{}:{}", k, DisplayOption(v))).collect::<Vec<_>>().join(",") })

Same here, though it was broken also for replication. Feel free to fix it.


tests/tests/metrics/t10_server_metrics_and_data_metrics.rs line 96 at r4 (raw file):

Hmmm, if another AppendEntries or heartbeat response is received, this interval will become smaller.

That's why use Instant and not some sort of Duration...

@SteveLauC
Copy link
Collaborator Author

Please rewrite allocation-free, e.g., by adding properly-implemented Display trait to HeartbeatMetrics.

This would make HeartbeatMetrics a wrapper type, not an alias since we cannot impl std::fmt::Display for std::collections::BTreeMap due to the orphan rule, I will proceed and implement it (for HeartbeatMetric and ReplicationMetrics) if you guys are ok with this change.

@drmingdrmer
Copy link
Member

@schreter
Shall we consider upgrading Duration in milliseconds to Instant in a separate PR to keep the changes in this PR more manageable?

And GPT-4o has given me a working example of serializing Instant 🤔 :

Below is a complete example that includes both serialization and deserialization for std::time::Instant using serde. We will create a newtype wrapper around Instant and implement both Serialize and Deserialize.

  1. Add dependencies: Ensure you have the necessary dependencies in your Cargo.toml:

    [dependencies]
    serde = { version = "1.0", features = ["derive"] }
    serde_json = "1.0"
  2. Create wrapper and implement serialization and deserialization:

    use serde::{Serialize, Deserialize, Serializer, Deserializer};
    use serde::de::{self, Visitor};
    use std::time::{Instant, SystemTime, UNIX_EPOCH, Duration};
    use std::fmt;
    
    // Newtype wrapper around `Instant`
    #[derive(Debug)]
    struct SerializableInstant(Instant);
    
    impl Serialize for SerializableInstant {
        fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
        where
            S: Serializer,
        {
            // Convert `Instant` to `SystemTime`
            let system_time = SystemTime::now() + (self.0 - Instant::now());
            // Get the duration since the Unix epoch
            let duration = system_time.duration_since(UNIX_EPOCH).expect("Time went backwards");
            // Serialize the duration as seconds since the epoch
            serializer.serialize_u64(duration.as_secs())
        }
    }
    
    impl<'de> Deserialize<'de> for SerializableInstant {
        fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
        where
            D: Deserializer<'de>,
        {
            struct InstantVisitor;
    
            impl<'de> Visitor<'de> for InstantVisitor {
                type Value = SerializableInstant;
    
                fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
                    formatter.write_str("a u64 representing seconds since the Unix epoch")
                }
    
                fn visit_u64<E>(self, value: u64) -> Result<Self::Value, E>
                where
                    E: de::Error,
                {
                    let system_time = UNIX_EPOCH + Duration::new(value, 0);
                    let now = SystemTime::now();
                    let instant = if system_time > now {
                        Instant::now() + (system_time.duration_since(now).unwrap())
                    } else {
                        Instant::now() - (now.duration_since(system_time).unwrap())
                    };
                    Ok(SerializableInstant(instant))
                }
            }
    
            deserializer.deserialize_u64(InstantVisitor)
        }
    }
    
    // Example usage
    fn main() {
        let now = Instant::now();
        let serializable_now = SerializableInstant(now);
    
        // Serialize to JSON
        let json = serde_json::to_string(&serializable_now).unwrap();
        println!("Serialized instant: {}", json);
    
        // Deserialize from JSON
        let deserialized_instant: SerializableInstant = serde_json::from_str(&json).unwrap();
        println!("Deserialized instant: {:?}", deserialized_instant);
    }

Explanation

  1. Serialization:

    • We convert Instant to SystemTime by calculating the difference between self.0 (the stored Instant) and Instant::now(), and then adding it to SystemTime::now().
    • We then get the duration since the Unix epoch and serialize it as a u64 representing seconds since the epoch.
  2. Deserialization:

    • We define a Visitor to handle the deserialization of a u64 value.
    • We convert the u64 back to a SystemTime by adding the duration to the Unix epoch.
    • We then convert SystemTime back to Instant by adjusting from the current SystemTime to the given SystemTime.

This ensures that we can both serialize an Instant to a JSON representation and deserialize it back to an Instant. The Instant is stored as the number of seconds since the Unix epoch, which is a common and portable format for timestamps.

@SteveLauC
Copy link
Collaborator Author

Shall we consider upgrading Duration in milliseconds to Instant in a separate PR to keep the changes in this PR more manageable?

And I am actually not sure if that allocation refactor should be done in this PR, it deserves a separate PR if we want a PR to only does one thing 🤪

Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consider upgrading Duration in milliseconds to Instant in a separate PR to keep the changes in this PR more manageable?

Yes, that's OK. And yes, the suggestion is a similar to one in the serde issue I linked. But not in seconds, better microseconds :-).

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @drmingdrmer and @SteveLauC)


openraft/src/metrics/raft_metrics.rs line 119 at r5 (raw file):

Previously, schreter wrote…

Ouch! Allocations in Display and similar traits are a very bad idea! Yes, I know there is some legacy stuff with summary() and similar, but continuing it is not good.

Please rewrite allocation-free, e.g., by adding properly-implemented Display trait to HeartbeatMetrics.

@SteveLauC I'd strongly suggest to keep discussions pertaining to code locations in Reviewable - it's much simpler to follow and you can easily track "done" items.

Regarding your question about wrappers, there is no need to make HearbeatMetrics a wrapper.

But, you can do the same as for DisplayOption - introduce a helper which wraps the struct only for Display purposes (DisplayBTreeMap, for example).

Suggestion:

DisplayOption(self.heartbeat.as_ref.map(DisplayBTreeMap)),

Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I am actually not sure if that allocation refactor should be done in this PR, it deserves a separate PR if we want a PR to only does one thing 🤪

Since it's broken already for the other metrics, that's OK, but you know, technical debt...

My problem is that openraft is crashing all our malfunction tests which simulate OOMs (i.e., it is panicking instead of "proper" error handling, bubbling up the error to the caller and correctly shutting down). There will be some serious clean up needed :-). Unfortunately, I can't yet open-source our malfunction testing framework (which would be VERY handy for many library writers).

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @drmingdrmer and @SteveLauC)

@SteveLauC
Copy link
Collaborator Author

I just finished that display refactor:)

Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r7, 3 of 3 files at r8, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @SteveLauC)


openraft/src/display_ext/display_btreemap_opt_value.rs line 28 at r8 (raw file):

            }
            None => { /* do nothing, as map length is 0 */ }
        }

This snippet could be simplified to:

Suggestion:

        let len = self.0.len();
        for (idx, (key, value)) in self.0.iter().enumerate() {
            write!(f, "{}:{}", key, DisplayOption(value))?;
            if idx + 1 != len {
                write!(f, ",")?;
            }
        }

openraft/src/metrics/raft_metrics.rs line 119 at r8 (raw file):

            DisplayOption(&self.snapshot),
            DisplayOption(&self.purged),
            DisplayOption(&self.replication.as_ref().map(DisplayBtreeMapOptValueExt::display)),

Given that DisplayBTreeMapOptValue is already available, you can simplify the expression like this:

DisplayOption(&self.replication.as_ref().map(DisplayBTreeMapOptValue)),

@SteveLauC
Copy link
Collaborator Author

Given that DisplayBTreeMapOptValue is already available, you can simplify the expression like this

I wrote it in that way so that there won't be a allow unused attribute in display_btreemap_opt_value.rs, only one in display_ext.rs, to be consistent with other display_xxx.rs files, I agree it's kinda weird, can change to approach you suggested if you don't mind another unused attribute.

@drmingdrmer
Copy link
Member

Given that DisplayBTreeMapOptValue is already available, you can simplify the expression like this

I wrote it in that way so that there won't be a allow unused attribute in display_btreemap_opt_value.rs, only one in display_ext.rs, to be consistent with other display_xxx.rs files, I agree it's kinda weird, can change to approach you suggested if you don't mind another unused attribute.

I do not quite mind about the unused import :)

Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r8, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @drmingdrmer and @SteveLauC)


openraft/src/display_ext.rs line 9 at r8 (raw file):

pub(crate) mod display_slice;

#[allow(unused_imports)]

I don't quite get it. Why are these imports unused? Are they just written for future use and not really used in the crate (yet)?

A possible path to handle such things might be a prelude module. Such as, using all these traits via use crate::display_ext::prelude::*; where all the use statements are pub-exports of the helpers. If the prelude itself is pub(crate) mod prelude;, it's still hidden for the rest of the world, but hopefully you won't get any warnings for unused imports.


openraft/src/display_ext/display_btreemap_opt_value.rs line 28 at r8 (raw file):

Previously, drmingdrmer (张炎泼) wrote…

This snippet could be simplified to:

Or even further to:

        for (idx, (key, value)) in self.0.iter().enumerate() {
            if idx > 0 {
                write!(f, ",")?;
            }
            write!(f, "{}:{}", key, DisplayOption(value))?;
        }

@SteveLauC
Copy link
Collaborator Author

This snippet could be simplified to:

Done


I do not quite mind about the unused import :)

Done


I don't quite get it. Why are these imports unused? Are they just written for future use and not really used in the crate (yet)?

I am not sure about this as well, I just follow the convention:

  • define your wrapper type
  • define your XxxExt trait
  • re-export both of them in display_ext.rs

Previously, drmingdrmer (张炎泼) wrote…

Or even further to:

They are basically same, so I just use the one suggested by @drmingdrmer.

@SteveLauC SteveLauC marked this pull request as ready for review July 20, 2024 01:27
@SteveLauC
Copy link
Collaborator Author

When you guys this PR is in its last review round, ping me so that I will squash my commits:)

@drmingdrmer drmingdrmer requested a review from schreter July 20, 2024 02:42
Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r9, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @SteveLauC)

@SteveLauC
Copy link
Collaborator Author

Gentle ping on @schreter, can I trouble you for another review?

@drmingdrmer
Copy link
Member

Gentle ping on @schreter, can I trouble you for another review?

It's been good enough to me. Please squash the commits and let's merge it.

@schreter sometimes got delays for giving feedback. Do not worry. If any refinements are needed, he will provide feedback later soon, and you can update it in the next PR. This has become a routine working pattern:)

@SteveLauC SteveLauC force-pushed the feat/heartbeat_metric branch from b3a23a8 to ca1fcfe Compare July 22, 2024 02:22
@SteveLauC
Copy link
Collaborator Author

SteveLauC commented Jul 22, 2024

Rust nightly 1.82 renamed std::panic::PanicInfo to std::panic::PanicHookInfo and mark the old name deprecated, this change is currently exclusive to nightly toolchain 1.82.0 and higher.

Looks like we have to use something like rustversion to handle it? 🤔 Or we can #[allow(deprecated)] it.

@drmingdrmer
Copy link
Member

Rust nightly 1.82 renamed std::panic::PanicInfo to std::panic::PanicHookInfo and mark the old name deprecated, this change is currently exclusive to nightly toolchain 1.82.0 and higher.

Looks like we have to use something like rustversion to handle it? 🤔 Or we can #[allow(deprecated)] it.

It looks like a single #[allow(deprecated)] would be good enough.

@SteveLauC
Copy link
Collaborator Author

It looks like a single #[allow(deprecated)] would be good enough.

Let me do this in a separate PR.

@drmingdrmer drmingdrmer merged commit d857a85 into databendlabs:main Jul 22, 2024
25 of 31 checks passed
@SteveLauC SteveLauC deleted the feat/heartbeat_metric branch July 22, 2024 03:15
@drmingdrmer
Copy link
Member

It looks like a single #[allow(deprecated)] would be good enough.

Let me do this in a separate PR.

I'm gonna include this fixup in this:

@SteveLauC
Copy link
Collaborator Author

I'm gonna include this fixup in this:

Feel free to close #1188 if you need to 😁

Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping on @schreter, can I trouble you for another review?

I actually already did, but since I didn't add any additional comments you probably didn't get any mail. I just removed the blocker/marked the Display formatting as "done" in Reviewable.

The refactoring with Instant instead of some sort of duration at multiple places is a separate thing, I just added some random comments for that (no blockers).

Reviewed 3 of 3 files at r9, 4 of 4 files at r10, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions

@SteveLauC
Copy link
Collaborator Author

I actually already did, but since I didn't add any additional comments you probably didn't get any mail. I just removed the blocker/marked the Display formatting as "done" in Reviewable.

Ahh, sorry for missing that, looks like I really need to learn how to use Reviewable. 😶‍🌫️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants