Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy more efficiently in Vector::realloc. #26

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

gsivek
Copy link

@gsivek gsivek commented Apr 29, 2020

The only two Vectors in grimoire::trie::State have char and History as their value types, which are both trivially copyable. Copying via memcpy is significantly faster than iteratively copying individual elements.

} else {
for (std::size_t i = 0; i < size_; ++i) {
new (&new_objs[i]) T(objs_[i]);
}
}
for (std::size_t i = 0; i < size_; ++i) {
objs_[i].~T();
Copy link
Contributor

@glebm glebm May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop can be moved inside the else clause above.

If a class is trivially copyable, it must have a trivial destructor:

Requirements

Trivial destructors are no-op:

Objects with trivial destructors don't require a delete-expression and may be disposed of by simply deallocating their storage.
https://en.cppreference.com/w/cpp/language/destructor#Trivial_destructor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few other ~T calls, so it seems inconsistent to avoid it here, but not the other places.

The loop is optimized away with -O1 or higher anyway: https://godbolt.org/z/-zH4gd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit confused by the destructor calls in the trivially copyable case. It made me stop, question my understanding of C++, and look up the definition of trivially copyable.

#else
const bool trivially_copyable = false;
#endif
if (trivially_copyable) {
Copy link
Contributor

@glebm glebm May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this with SFINAE we can avoid both the trivially_copyable bool and including <cstring> for C++98.

private:
#if __cplusplus >= 201103L
  template <typename T,
            std::enable_if<std::is_trivially_copyable<T>::value, int>::type = 0>
  static void realloc_move(T *dest, const T *src, std::size_t size) {
    std::memcpy(reinterpret_cast<void*>(dest),
                reinterpret_cast<const void*>(src),
                sizeof(T) * size);
  }

  template <typename T,
            std::enable_if<!std::is_trivially_copyable<T>::value, int>::type = 0>
#endif
  static void realloc_move(T *dest, const T *src, std::size_t size) {
    for (std::size_t i = 0; i < size; ++i) {
      new (&dest[i]) T(src[i]);
    }
    for (std::size_t i = 0; i < size; ++i) {
      src[i].~T();
    }
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This greatly increases the level of C++ knowledge required to understand the code. All to avoid one #include?

Copy link
Contributor

@glebm glebm May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I find the template easier to read but it's subjective.

As this code already uses placement-new and calls destructors explicitly I think we can assume that the reader has to be an expert C++ user :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More concretely: I don't understand std::enable_if well enough to want to use it in code I write, and I wrote this. :-)

@@ -231,8 +235,19 @@ class Vector {
MARISA_DEBUG_IF(new_buf.get() == NULL, MARISA_MEMORY_ERROR);
T *new_objs = reinterpret_cast<T *>(new_buf.get());

for (std::size_t i = 0; i < size_; ++i) {
new (&new_objs[i]) T(objs_[i]);
#if __cplusplus >= 201103L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just require C++11?

It looks is_trivially_copyable has been implemented by libstdc++ for 6 years and libc++ for a similar amount of time.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60132

@s-yata s-yata self-assigned this Jun 14, 2020
@s-yata
Copy link
Owner

s-yata commented Jun 14, 2020

I'm not sure about "Copying via memcpy is significantly faster than iteratively copying individual elements" in actual cases. Do you have any data about this?

State's Vector<T> members (key_buf_ and history_) are not assumed to be long. In addition, realloc is not called so many times if Agent is reused adequately.

@jmr
Copy link
Contributor

jmr commented Jun 15, 2020

Do you have any data about this?

We did get about a 0.5-1% speedup on a large benchmark that uses marisa-trie as a small part. Microbenchmarking wasn't done or was inadequately documented, but could be done.

In addition, realloc is not called so many times if Agent is reused adequately.

Lack of reuse is probably part of the problem, but harder to solve because higher level APIs wrapping marisa-trie might not operate on the Agent paradigm.

Initially, I thought it would be overkill, but now I'm thinking a nicer solution would be to rewrite this using std::uninitialized_move_n (assuming that's as fast as memcpy) and providing a substitute for pre-C++17.

Similarly, destroy_n etc could be used for the other places.

const bool trivially_copyable = false;
#endif
if (trivially_copyable) {
std::memcpy(reinterpret_cast<void*>(new_objs),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be memmove.

I was just playing around with godbolt.org and noticed that std::copy_n, uninitialized_copy_n and uninitialized_move_n all compile to memmove for char. This definitely makes the uninitialized_move_n plus compatibility wrapper seem nicer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memcpy is usually faster than memmove (but requires the memory regions to not overlap).

I'm surprised about std::copy_n, it should compile to memcpy as it also forbids overlap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just assumed that the optimizer was doing the most efficient thing, since it usually does, so got memcpy and memmove switched in my head.

My initial example was actually insufficient, because my pointers could have overlapped. When one comes from new, memcpy is used.

https://godbolt.org/z/HZ5VU5

s-yata added a commit that referenced this pull request Jun 22, 2020
-R, --reuse-on   reuse agents (default)
-r, --reuse-off  don't reuse agents

Ref: #26
@s-yata
Copy link
Owner

s-yata commented Jun 23, 2020

Vector<T> is used with the following types:

  • Primitive types: UInt32, UInt64
  • Trivial types: RankIndex, WeightedRange, TerminalIdPair, History
  • Non-trivial types: Entry, ReverseKey, Key, Cache

The above non-trivial types have user-defined copy constructors and assignment operators, but they only copy member variables.
There is actually no reason to define them.
Maybe we can remove them (use default ones) and optimize Vector<T> for trivially copyable types.

@s-yata
Copy link
Owner

s-yata commented Jun 24, 2020

Hmm, I could not confirm significant difference among for-loop, std::memcpy, and std::copy, even if agent reuse is disabled...

gsivek added 3 commits July 21, 2020 11:17
The only two Vectors in grimoire::trie::State have char and History as their value types, which are both trivially copyable. Copying via memcpy is significantly faster than iteratively copying individual elements.
@jmr
Copy link
Contributor

jmr commented Sep 29, 2020

Hmm, I could not confirm significant difference among for-loop, std::memcpy, and std::copy, even if agent reuse is disabled...

I saw marisa-benchmark results ranging from 0.5% slower on prefix search to 3.7% faster on predict search, but unfortunately without any -O options. When I tried to enable -O3, the differences went away.

We are seeing a few percent faster other benchmarks, though. Maybe the optimizer is doing different things between the different benchmarks and it's hard to reproduce. Would you accept that there's some situation where memcpy is faster? Would you merge a version that used something like the realloc_move function that @glebm suggested?

Merge recent changes from master into the patch-1 branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants