Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak #151

Closed
reqshark opened this issue Mar 2, 2016 · 38 comments
Closed

memory leak #151

reqshark opened this issue Mar 2, 2016 · 38 comments
Labels

Comments

@reqshark
Copy link
Collaborator

reqshark commented Mar 2, 2016

PollSendSocket leaks. At least valgrind thinks so.

==4428== 2,112 bytes in 12 blocks are possibly lost in loss record 996 of 1,041
==4428==    at 0x4C2CC70: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4428==    by 0x985CF78: PollSendSocket(Nan::FunctionCallbackInfo<v8::Value> const&) (in /home/reqshark/node_modules/nanomsg/build/Release/node_nanomsg.node)
==4428==    by 0x985C555: Nan::imp::FunctionCallbackWrapper(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/reqshark/node_modules/nanomsg/build/Release/node_nanomsg.node)
==4428==    by 0x913C21: v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) (in /home/reqshark/.nvm/versions/node/v5.6.0/bin/node)
==4428==    by 0x93E100: v8::internal::MaybeHandle<v8::internal::Object> v8::internal::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::(anonymous namespace)::BuiltinArguments<(v8::internal::BuiltinExtraArguments)1>&) (in /home/reqshark/.nvm/versions/node/v5.6.0/bin/node)
==4428==    by 0x93E5F1: v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) (in /home/reqshark/.nvm/versions/node/v5.6.0/bin/node)
==4428==    by 0x1D5F9040839A: ???
==4428==    by 0x1D5F90644AB0: ???
==4428==    by 0x1D5F906448B7: ???
==4428==    by 0x1D5F90408296: ???
==4428==    by 0x1D5F9063F5EA: ???
==4428==    by 0x1D5F9066B954: ???

cc: @nickdesaulniers, @kkoopa, @tcr

@reqshark reqshark added the bug label Mar 2, 2016
@kkoopa
Copy link
Contributor

kkoopa commented Mar 2, 2016

False alarm. It says possibly lost. Valgrind does not know about finalizers (and cannot reasonably be expected to). V8's garbage collector will clean it up when it sees fit to do so.

@reqshark
Copy link
Collaborator Author

reqshark commented Mar 2, 2016

I don't see the allocation here getting freed: https://github.com/nickdesaulniers/node-nanomsg/blob/master/src/node_nanomsg.cc#L216 (starting a new libuv poll context)

why would that out-of-scope allocation be V8's responsibility?

and I doubt it would be libuv's responsibility to free it

@reqshark
Copy link
Collaborator Author

reqshark commented Mar 2, 2016

@kkoopa I think we got a legit leak here, but yeah I see your point, not knowing about finalizers means there's a fair chance it's a false alarm..

is there a way I can trigger a GC?

I want to find out what valgrind thinks right after a GC event

@kkoopa
Copy link
Contributor

kkoopa commented Mar 2, 2016

Aha, I thought it was related to that patch I submitted a week ago. Nevermind then. The problem here is that WrapPointer always sets an empty free function, which is only appropriate for static memory, since that should not be freed. Make another WrapPointer function which does free the Buffer's memory and use that for this. Since the memory is allocated with calloc, you need not set a free function for the Buffer constructor, as the default is to call free.

To force garbage collection, run node with --expose-gc and insert a call to gc in JavaScript.

@reqshark
Copy link
Collaborator Author

reqshark commented Mar 2, 2016

OK cool.. 🚀 i'm gonna do some digging around and report back 🔧

@kkoopa
Copy link
Contributor

kkoopa commented Mar 2, 2016

PollReceiveSocket appears to have the same problem. As WrapPointer is only used in those two functions, you can probably change the definition of WrapPointer not to set a free callback. However, it then becomes the responsibility of the javascript code to keep a reference around to the socket so it won't get garbage collected. UnwrapPointer only appears used from PollStop, which takes the socket as an argument form JavaScript so there should not be a problem there, as it cannot have been garbage collected if you have it around and can give it as an argument to PollStop in the first place.

@reqshark
Copy link
Collaborator Author

reqshark commented Mar 3, 2016

totally agree, but for one minor point:

UnwrapPointer only appears used from PollStop, which takes the socket as an argument

it's accepting the wrapped context, see here where that gets returned, and in JS before it's later used in PollStop

i think when we start doing multiplexing we oughtta check for the presence of that wrapped pointer context first, and use it first and foremost before just going and allocating a new one no matter what

otherwise I think we have to clean it up and deallocate the context in PollStop

@reqshark
Copy link
Collaborator Author

reqshark commented Mar 3, 2016

also we're setting that buffer (context) to null in JS without deallocating, that's where it's getting lost

@kkoopa
Copy link
Contributor

kkoopa commented Mar 3, 2016

Context / socket, tomato / tomato. I must confess I did not actually read the code. I am merely speculating. The easiest way is probably to maintain an invariant on the javascript side. Set the variable to null when it is not needed any longer. Then there should be no more references around to the socktext, it will eventually get garbage collected and the free callback will get called, which should call free on the data, which is what the default free callback does when none has been explicitly specified.

@reqshark
Copy link
Collaborator Author

reqshark commented Mar 3, 2016

I must confess I did not actually read the code.

haha, that's awesome. Yea one of the big nanomsg benefits is that sockets are no longer contexts, (as they are in ZeroMQ), here they're just an int you can pass to nn_recv or nn_send or nn_bind, etc..

the easiest way is probably to maintain an invariant on the javascript side.

for now I think the simplest fix will be to follow what's being represented up in JS land, and later we might try to optimize the module by reducing the number of system calls associated w/ memory management

@reqshark
Copy link
Collaborator Author

reqshark commented Mar 3, 2016

ugh, easier said than done..

freeing the context seems to eliminate the mem leak, but crashes w/ inproc in a weird way..

NAN_METHOD(PollStop) {
  nanomsg_socket_t *context = UnwrapPointer<nanomsg_socket_t *>(info[0]);
  int r = uv_poll_stop(&context->poll_handle);
  free(context);
  info.GetReturnValue().Set(Nan::New<Number>(r));
}

if wrapping pointer imbues v8 reference, am I trying to free v8 memory after the call to uv_poll_stop?

@kkoopa
Copy link
Contributor

kkoopa commented Mar 3, 2016

Yes. Just keep the code as is, apart from changing https://github.com/nickdesaulniers/node-nanomsg/blob/master/src/node_pointer.h#L22 to return Nan::NewBuffer(static_cast<char *>(ptr), length) and removing wrap_pointer_cb.

@reqshark
Copy link
Collaborator Author

reqshark commented Mar 3, 2016

ok interesting,

yea freeing that structure seems problematic especially once its Nan::Callback gets initialized

typedef struct nanomsg_socket_s {
  uv_poll_t poll_handle;
  uv_os_sock_t sockfd;
  Nan::Callback *callback;
} nanomsg_socket_t;

@kkoopa
Copy link
Contributor

kkoopa commented Mar 3, 2016

Add a destructor to it.

On March 3, 2016 3:01:07 AM GMT+02:00, Bent Cardan [email protected] wrote:

ok interesting,

yea freeing that structure seems problematic especially once its
Nan::Callback gets initialized

typedef struct nanomsg_socket_s {
 uv_poll_t poll_handle;
 uv_os_sock_t sockfd;
 Nan::Callback *callback;
} nanomsg_socket_t;

Reply to this email directly or view it on GitHub:
#151 (comment)

@k-keisuke
Copy link

I have encountered this and tried to fix by refering to your proposed codes.
The memory leak is resolved but segmentation fault sometimes occurs.
I'm analyzing this cause. Could you advice something please ?

 NAN_METHOD(PollStop) {
   nanomsg_socket_t *context = UnwrapPointer<nanomsg_socket_t *>(info[0]);
   int r = uv_poll_stop(&context->poll_handle);
   delete(context->callback); // <- add
   info.GetReturnValue().Set(Nan::New<Number>(r));
 }
 inline static v8::Local<v8::Value> WrapPointer(void *ptr, size_t length) {
   return Nan::NewBuffer(static_cast<char *>(ptr), length).ToLocalChecked(); // <- changed
 }

@reqshark
Copy link
Collaborator Author

@k-keisuke thanks for fixing the memory leak. I'm testing your fix today.

The memory leak is resolved but segmentation fault sometimes occurs.

What does the output message from segmentation fault look like? Is there any assertion or debug message or does that seg fault go down with a silent exit?

@kkoopa
Copy link
Contributor

kkoopa commented Oct 14, 2016

That ToLocalChecked is a likely culprit.

On October 14, 2016 9:44:54 PM GMT+03:00, Bent Cardan [email protected] wrote:

@k-keisuke thanks for fixing the memory leak. I'm testing your fix
today.

The memory leak is resolved but segmentation fault sometimes occurs.

What does the output message from segmentation fault look like? Is
there any assertion or debug message or does that seg fault go down
with a silent exit?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#151 (comment)

@k-keisuke
Copy link

Thanks.
Output message is "Segmentation fault" only. But I obtained back trace.
It seems that there is something at garbage collection on the occasion of arbitrary JavaScript app code.
(double free?)

#0  0x088091fd in v8::internal::Heap::Scavenge() ()
#1  0x08810673 in v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) ()
#2  0x08810c07 in v8::internal::Heap::CollectGarbage(v8::internal::GarbageCollector, char const*, char const*, v8::GCCallbackFlags) ()
#3  0x087ceece in v8::internal::Factory::NewRawOneByteString(int, v8::internal::PretenureFlag) ()
#4  0x0851a23e in v8::internal::JsonParser<true>::ParseJsonValue() ()
#5  0x08519a17 in v8::internal::JsonParser<true>::ParseJsonObject() ()
#6  0x0851a28f in v8::internal::JsonParser<true>::ParseJsonValue() ()
#7  0x0851a69e in v8::internal::JsonParser<true>::ParseJson() ()
#8  0x089e8938 in v8::internal::Runtime_ParseJson(int, v8::internal::Object**, v8::internal::Isolate*) ()
(omit following stack)

That ToLocalChecked is a likely culprit.

Sorry I can't understand.
(ToLocalChecked is only assert and getter, I think.)

@nickdesaulniers
Copy link
Owner

nickdesaulniers commented Oct 19, 2016

While the stack trace mentions JSON parsing, I doubt it was the culprit. I bet things went like:

  1. Allocate an object off of v8's heap.
  2. Give it to v8's GC to manage.
  3. Free/delete it via our reference, without letting v8 know.

...some time later...

  1. parse some json
  2. allocate some memory for the json parser
  3. memory pressure triggers gc
  4. gc double frees dangling pointer

@nickdesaulniers
Copy link
Owner

we could build this lib with -fsanitize=address to help find this.

@k-keisuke
Copy link

Segmentation fault occurs in only this fix, I found.

 inline static v8::Local<v8::Value> WrapPointer(void *ptr, size_t length) {
   return Nan::NewBuffer(static_cast<char *>(ptr), length).ToLocalChecked(); // <- changed
 }

I think this fix means I delegate freeing ptr to V8 garbage collector. (referenced to following link)
https://github.com/nodejs/nan/blob/master/doc/buffers.md#api_nan_new_buffer
But, did garbage collector free a ptr twice?

@deepakprabhakara
Copy link
Contributor

I am seeing a steady growth in heap and rss memory in one of our production services that heavily uses req-rep protocol. Are there any updates on a fix for this issue?

@nickdesaulniers
Copy link
Owner

Does someone have a simple JS test case that repro's this? I'm happy to hook up ASAN to this to try and spot the leak.

@deepakprabhakara
Copy link
Contributor

@nickdesaulniers Here's a simple test that send a req request every 10ms - https://github.com/redsift/node-nanomsg-leak-test . You can take heap snapshots by sending a 'kill -USR2 '. Inspecting the heap in Chrome dev tools shows no leaks in js code but the 'System Objects' size increases steadily. Thanks for looking into this.

@reqshark
Copy link
Collaborator Author

what i'm seeing after running @deepakprabhakara's test case:

-------- <LEAK> -------
{"growth":20719656,"reason":"heap growth over 5 consecutive GCs (9m 50s) - 120.57 mb/hr"}
-------- </LEAK> -------
-------- <LEAK> -------
{"growth":22130304,"reason":"heap growth over 5 consecutive GCs (10m 22s) - 122.15 mb/hr"}
-------- </LEAK> -------
-------- <LEAK> -------
{"growth":28205736,"reason":"heap growth over 5 consecutive GCs (13m 17s) - 121.5 mb/hr"}
-------- </LEAK> -------

nickdesaulniers added a commit that referenced this issue Nov 26, 2016
build with `npm i --asan=true`.  This will use a build that links in
asan at runtime.

The odd split in .gyp files exists because we want to compile nanomsg
with asan instrumentation, but when we link our node_namomsg.node shared
object we want to remind gyp to link asan's runtime library.

This can likely also be done for Linux, but I haven't tested.

Also, we segfault off the bat with this.  I don't have time to look into
it, but at least it builds correctly.
reqshark pushed a commit that referenced this issue Nov 26, 2016
build with `npm i --asan=true`.  This will use a build that links in
asan at runtime.

The odd split in .gyp files exists because we want to compile nanomsg
with asan instrumentation, but when we link our node_namomsg.node shared
object we want to remind gyp to link asan's runtime library.

This can likely also be done for Linux, but I haven't tested.

Also, we segfault off the bat with this.  I don't have time to look into
it, but at least it builds correctly.
@deepakprabhakara
Copy link
Contributor

Hi @nickdesaulniers, just wondering if you had any success spotting the leak using ASAN.

@deepakprabhakara
Copy link
Contributor

deepakprabhakara commented Dec 8, 2016

I tried running it with asan=true but end up with this on Mac Sierra:-

==8057==ERROR: Interceptors are not working. This may be because AddressSanitizer is loaded too late (e.g. via dlopen). Please launch the executable with:
DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/8.0.0/lib/darwin/libclang_rt.asan_osx_dynamic.dylib
==8057==AddressSanitizer CHECK failed: /Library/Caches/com.apple.xbs/Sources/clang_compiler_rt/clang-800.0.42.1/src/projects/compiler-rt/lib/sanitizer_common/sanitizer_mac.cc:690 "(("interceptors not installed" && 0)) != (0)" (0x0, 0x0)

Any idea how I can fix this?

@nickdesaulniers
Copy link
Owner

@deepakprabhakara sorry, see my commit message for the asan patch (latest). You don't run with any env vars set. You build the library with a gyp arg, then everything should just work. Problem is we get a segfault off the bat. I don't have time to track down what that is.

@m-ohuchi
Copy link

I sent pull request to fix this memory leak as above. Please check the detail. Thanks.

@reqshark
Copy link
Collaborator Author

thanks @m-ohuchi i'm looking at it

@deepakprabhakara
Copy link
Contributor

Thanks @m-ohuchi, nice work!

@nickdesaulniers
Copy link
Owner

@deepakprabhakara, @reqshark how long do you have to run the test case to repro? I can run the test against master with no leaks. I did change the require to require('..')

@nickdesaulniers
Copy link
Owner

memwatch-next doesn't even compile for node v7.2.1. What version of node was used to repro this? marcominetti/node-memwatch#24

@nickdesaulniers
Copy link
Owner

ah got it, you have to run it for a very long time. We should find out a way to speed that up and add it to our test suite.

@nickdesaulniers
Copy link
Owner

nickdesaulniers commented Dec 18, 2016

I really really don't want to have to maintain code with synchronization. The double call of free_context is weird and it seems to be heavy handed to use mutexes for doing essentially reference counting.

At the risk of having to maintain that, I've authored #169 which takes the same approach of using uv_close rather than uv_poll_stop, but uses no synchronization. It passes make check and has been running the memory leak test for half an hour now with no leak. EDIT multiple hours not and haven't seen memory peak 11MB.

It's a smaller change and removes more code. I think it will be more maintainable.

I would like to add some kind of memory leak test to our test suite. And I do very much appreciate @m-ohuchi taking a crack at fixing this.

@m-ohuchi
Copy link

Thanks @nickdesaulniers to accept my code. I'm glad I could contribute this project.

I really really don't want to have to maintain code with synchronization. The double call of free_context is weird and it seems to be heavy handed to use mutexes for doing essentially reference counting.

I think so, too. I don't mind your changes to my code, I know synchronization often cause a problem and also free operation is weird. I think my code should be improved to much more safer way.

@nickdesaulniers
Copy link
Owner

Thanks for your code pointing to where the leaks were occurring. Thanks all, it's good to get long standing issues resolved.

@deepakprabhakara
Copy link
Contributor

Thanks everyone, excellent work closing the leak!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants