Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unexpected max latency in output of stat #349

Merged
merged 2 commits into from
Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 24 additions & 10 deletions src/Service/ConnectionHandler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -687,20 +687,34 @@ void ConnectionHandler::updateStats(const Coordination::ZooKeeperResponsePtr & r
if (response->xid != Coordination::WATCH_XID && response->getOpNum() != Coordination::OpNum::Heartbeat
&& response->getOpNum() != Coordination::OpNum::SetWatches && response->getOpNum() != Coordination::OpNum::Close)
{
Int64 elapsed = Poco::Timestamp().epochMicroseconds() / 1000 - response->request_created_time_ms;
auto current_time = getCurrentTimeMilliseconds();
Int64 elapsed = current_time - response->request_created_time_ms;
if (elapsed < 0)
Copy link
Contributor

@JackyWoo JackyWoo Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why elapsed can be -1? Caused by different clock? If it is, we must use stedy_lock in all places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe different clock, so I use getCurrentTimeMilliseconds which is steady_clock for request in all places.

Copy link
Contributor

@JackyWoo JackyWoo Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe caused by clock diff in different servers. When leader append_entries the create_time is attached to the log and when the follower takes the log the timestamp in leader is used. If there is a clock diff in the servers -1 may raise.

Copy link
Contributor

@JackyWoo JackyWoo Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment to help readers understand the code.

Copy link
Contributor Author

@lzydmxy lzydmxy Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe caused by clock diff in different servers. When leader append_entries the create_time is attached to the log and when the follower takes the log the timestamp in leader is used. If there is a clock diff in the servers -1 may raise.

When followers takes the log from leader, them won't updateLatency for the log. We used to use system_clock to measure request time cost. Clock rollback can cause this issue with system_clock.

{
conn_stats.updateLatency(elapsed);
if (unlikely(elapsed > 10000))
LOG_WARNING(
log,
"Slow request detected #{}#{}#{}, time costs {}ms, please take care.",
toHexString(session_id.load()),
response->xid,
Coordination::toString(response->getOpNum()),
elapsed);
LOG_WARNING(
Copy link
Contributor

@JackyWoo JackyWoo Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete the logging, for it is confused for the system mantainers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

log,
"Request #{}#{}#{} finish time {} is before than created time {}, please take care.",
toHexString(session_id.load()),
response->xid,
Coordination::toString(response->getOpNum()),
current_time,
response->request_created_time_ms);
elapsed = 0;
}

conn_stats.updateLatency(elapsed);
if (unlikely(elapsed > 10000))
LOG_WARNING(
log,
"Slow request detected #{}#{}#{}, time costs {}ms, please take care.",
toHexString(session_id.load()),
response->xid,
Coordination::toString(response->getOpNum()),
elapsed);

keeper_dispatcher->updateKeeperStatLatency(elapsed);


last_op.set(std::make_unique<LastOp>(LastOp{
.name = Coordination::toString(response->getOpNum()),
.last_cxid = response->xid,
Expand Down
8 changes: 4 additions & 4 deletions src/Service/KeeperDispatcher.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ bool KeeperDispatcher::pushSessionRequest(const Coordination::ZooKeeperRequestPt
request_info.session_id = internal_id;

using namespace std::chrono;
request_info.create_time = duration_cast<milliseconds>(system_clock::now().time_since_epoch()).count();
request_info.create_time = getCurrentTimeMilliseconds();

LOG_TRACE(
log,
Expand Down Expand Up @@ -198,7 +198,7 @@ bool KeeperDispatcher::pushRequest(const Coordination::ZooKeeperRequestPtr & req
request_info.session_id = session_id;

using namespace std::chrono;
request_info.create_time = duration_cast<milliseconds>(system_clock::now().time_since_epoch()).count();
request_info.create_time = getCurrentTimeMilliseconds();

LOG_TRACE(log, "Push user request #{}#{}#{}", toHexString(session_id), request->xid, Coordination::toString(request->getOpNum()));
/// Put close requests without timeouts
Expand All @@ -220,7 +220,7 @@ bool KeeperDispatcher::pushForwardRequest(size_t server_id, size_t client_id, Fo
RequestForSession && request_info = request->requestForSession();

using namespace std::chrono;
request_info.create_time = duration_cast<milliseconds>(system_clock::now().time_since_epoch()).count();
request_info.create_time = getCurrentTimeMilliseconds();

request_info.server_id = server_id;
request_info.client_id = client_id;
Expand Down Expand Up @@ -460,7 +460,7 @@ void KeeperDispatcher::deadSessionCleanThread()
request_info.session_id = dead_session;
// request_info.is_internal = true;
using namespace std::chrono;
request_info.create_time = duration_cast<milliseconds>(system_clock::now().time_since_epoch()).count();
request_info.create_time = getCurrentTimeMilliseconds();
{
std::lock_guard lock(push_request_mutex);
if (!requests_queue->push(std::move(request_info)))
Expand Down
2 changes: 1 addition & 1 deletion src/Service/Metrics.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ namespace RK

inline UInt64 getCurrentTimeMilliseconds()
{
return duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
return duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now().time_since_epoch()).count();
};


Expand Down
2 changes: 1 addition & 1 deletion src/Service/NuRaftStateMachine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ ptr<RequestForSession> NuRaftStateMachine::createRequestSession(ptr<log_entry> &
request_for_session->request->readImpl(buffer);

if (buffer.eof())
request_for_session->create_time = std::chrono::system_clock::now().time_since_epoch() / std::chrono::milliseconds(1);
request_for_session->create_time = getCurrentTimeMilliseconds();
else
Coordination::read(request_for_session->create_time, buffer);

Expand Down
Loading