Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix KvStore unclean exit with thriftSyncTimer fired after thriftPeers…
…_ destroyed Summary: # background Open/R has been haunted with unclean exit issue carried over multiple oncall rotations. Especially the notorious one in `KvStore`, like the following gdb traces happened inside T158806075 ``` [[email protected] ~/local/fbsource (470cb762a)]$ fboss-dbg-helper gdb fsw008.p062.f01.rva2 0) /home/xiangxu1121/local/fboss_dbg/fsw008.p062.f01.rva2/snapshot_2023-07-21-15:42:58 Enter the file's number: 0 0) /home/xiangxu1121/local/fboss_dbg/fsw008.p062.f01.rva2/snapshot_2023-07-21-15:42:58/cores/dogpile.unknown_tw_task.servicerouter.1583324.QUEUE_LAG.230715-025056 - Sat Jul 15 02:50:56 2023 1) /home/xiangxu1121/local/fboss_dbg/fsw008.p062.f01.rva2/snapshot_2023-07-21-15:42:58/cores/dynocat - Fri Jul 21 15:05:59 2023 2) /home/xiangxu1121/local/fboss_dbg/fsw008.p062.f01.rva2/snapshot_2023-07-21-15:42:58/cores/openr - Wed Jul 19 09:07:29 2023Enter the file's number: 2 INFO:root:Received output from file command: INFO:root:/home/xiangxu1121/local/fboss_dbg/fsw008.p062.f01.rva2/snapshot_2023-07-21-15:42:58/cores/openr: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/openr --v=1 --vmodule=BgpSerializer*=1,FiberBgp*=2 --logging=DBG1;def', real uid: 36662, effective uid: 36662, real gid: 36337, effective gid: 36337, execfn: '/usr/sbin/openr', platform: 'x86_64' (gdb) bt #0 0x00000000035aa5b4 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, openr::KvStoreDb<apache::thrift::Client<openr::thrift::OpenrCtrlCpp> >::KvStorePeer>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, openr::KvStoreDb<apache::thrift::Client<openr::thrift::OpenrCtrlCpp> >::KvStorePeer> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () at ./fbcode/third-party-buck/platform010-compat/build/libgcc/include/c++/trunk/bits/hashtable_policy.h:431 facebook#1 0x00000000035baf56 in openr::KvStoreDb<apache::thrift::Client<openr::thrift::OpenrCtrlCpp> >::processThriftFailure(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, folly::basic_fbstring<char, std::char_traits<char>, std::allocator<char>, folly::fbstring_core<char> > const&, std::chrono::duration<long, std::ratio<1l, 1000l> >) () at ./fbcode/third-party-buck/platform010-compat/build/libgcc/include/c++/trunk/bits/unordered_map.h:869 facebook#2 0x00000000035c4fac in openr::KvStoreDb<apache::thrift::Client<openr::thrift::OpenrCtrlCpp> >::requestThriftPeerSync()::{lambda(folly::exception_wrapper const&)facebook#1}::operator()(folly::exception_wrapper const&) const () at fbcode/openr/kvstore/KvStore-inl.h:1753 facebook#3 0x00000000035c4d0c in _ZZNO5folly6FutureINS_4UnitEE13thenErrorImplIZN5openr9KvStoreDbIN6apache6thrift6ClientINS4_6thrift12OpenrCtrlCppEEEE21requestThriftPeerSyncEvEUlRKNS_17exception_wrapperEE_EENSt9enable_ifIXntsr20isFutureOrSemiFutureINS_13invoke_detail6traitsIT_E6resultISD_EEEE5valueES2_E4typeEOSK_NS_7futures6detail18InlineContinuationEENUlONS_8Executor9KeepAliveISU_EEONS_3TryIS1_EEE_clESX_S10_ () at fbcode/folly/futures/Future-inl.h:137 facebook#4 0x00000000035c4bea in _ZN5folly6detail8function14FunctionTraitsIFvRNS_7futures6detail8CoreBaseEONS_8Executor9KeepAliveIS7_EEPNS_17exception_wrapperEEE7callBigIZNS4_4CoreINS_4UnitEE11setCallbackIZNOS_6FutureISH_E13thenErrorImplIZN5openr9KvStoreDbIN6apache6thrift6ClientINSN_6thrift12OpenrCtrlCppEEEE21requestThriftPeerSyncEvEUlRKSB_E_EENSt9enable_ifIXntsr20isFutureOrSemiFutureINS_13invoke_detail6traitsIT_E6resultISB_EEEE5valueESL_E4typeEOS12_NS4_18InlineContinuationEEUlSA_ONS_3TryISH_EEE_EEvS18_OSt10shared_ptrINS_14RequestContextEES19_EUlS6_SA_SC_E_EEvS6_SA_SC_RNS1_4DataE () at fbcode/folly/futures/detail/Core.h:619 warning: Could not find DWO CU buck-out/v2/gen/fbcode/ad98fc21e927c889/folly/futures/detail/__core__/__objects__/Core.cpp.o(0xb8563f2cc6214b1a) referenced by CU at offset 0x10ac73 [in module /data/users/xiangxu1121/fboss_dbg/packages/openr:325/openr.debuginfo] facebook#5 0x0000000002408dc9 in folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<folly::Executor>&&, folly::futures::detail::State)::$_0::operator()(folly::Executor::KeepAlive<folly::Executor>&&) () at fbcode/folly/Function.h:375 warning: Could not find DWO CU buck-out/v2/gen/fbcode/ad98fc21e927c889/folly/io/async/__async_base__/__objects__/EventBase.cpp.o(0x7990ef623e52da16) referenced by CU at offset 0x1029d6 [in module /data/users/xiangxu1121/fboss_dbg/packages/openr:325/openr.debuginfo] facebook#6 0x000000000266571d in folly::EventBase::loopMain(int, bool) () at fbcode/folly/Function.h:375 facebook#7 0x0000000002d20cc3 in folly::EventBase::loopOnce(int) () at fbcode/folly/io/async/EventBase.cpp:345 facebook#8 0x0000000002542c12 in folly::EventBase::~EventBase() () at fbcode/folly/io/async/EventBase.cpp:195 facebook#9 0x00000000035a420e in openr::KvStore<apache::thrift::Client<openr::thrift::OpenrCtrlCpp> >::~KvStore() () at fbcode/openr/kvstore/KvStore.h:629 facebook#10 0x0000000003585a9e in main () at ./fbcode/third-party-buck/platform010-compat/build/libgcc/include/c++/trunk/bits/unique_ptr.h:85 ``` # RCA High level speaking, the issue happens with the following order: - Step1: `KvStoreDb::stop()` destructed `thriftPeers_` obj - Step2: `KvStore::stop()` invokes `OpenrEventBase::stop()`, which destructs `folly::EventBase`, which will call `loopOnce` to clean up injected callbacks in the eventbase. - Step3: In crash scenario, there are >1 callbacks being invoked(e.g. `processThriftSuccess` or `processThriftFailure`). - Step4: Sample failures are: - `processThriftSuccess` P576623566 - `processThriftFailure` P592005831 We can see there are callbacks being invoked with: - `requestThriftPeerSync` - `keepAlive` # Fix The fix should be straight-forward, which is to cancel the scheduled cb from `folly::AsyncTimeout` perspective via `reset()` or `cancelTimeout()` call. NOTE: for the keepAlive `AsyncTimeout`, we can consider remove it by leveraging the socket keepalive option and simplify this logic. Reviewed By: TangoRoxy Differential Revision: D47691545 fbshipit-source-id: 67108399a7ce9f96d08682d45d03799e6ef608fd
- Loading branch information