Skip to content

Commit

Permalink
Merge branch 'release/1.2.2'
Browse files Browse the repository at this point in the history
  • Loading branch information
mjansson committed Jan 24, 2018
2 parents 0f87f54 + d69fb71 commit 82fbcc0
Show file tree
Hide file tree
Showing 25 changed files with 1,555 additions and 481 deletions.
3 changes: 2 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
*.rtf diff=astextplain
*.RTF diff=astextplain

# Ignore build scripts in language stats
# Ignore build scripts and test cases in language stats
build/* linguist-vendored
configure.py linguist-vendored=true
test/* linguist-vendored
6 changes: 3 additions & 3 deletions BENCHMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ https://github.com/rampantpixels/rpmalloc-benchmark

Benchmarks are run with parameters `benchmark <num threads> <mode> <distribution> <cross-thread rate> <loop count> <block count> <op count> <min size> <max size>`. It runs the given number of threads allocating randomly or fixed sized blocks in `[<min size>, <max size>]` bytes range. The `<mode>` parameter controls if it is random or fixed size. If random size, the `<distribution>` parameter controls if sizes are evenly distributed, have a linear falloff rate with size or an exponential falloff rate with size. In each thread, `<loop count>` number of loops are performed, allocating up to `<block count>` blocks in each thread. Every loop iteration `<op count>` number of blocks are deallocated, scattered across the entire set of blocks, then another set of `<op count>` number of blocks are allocated, also scattered across the entire set of slots in the block array (also deallocating any previous block in that slot). Every `<cross-thread rate>` loop iteration an `<op count>` number of blocks are allocated and handed off to another thread for cross-thread deallocation (memory allocated in one thread is freed in another thread).

The benchmark also measures the maximum requested allocated size and the used virtual memory by the process to calculate a overhead percentage. This is done at the end of the loop iterations, once all cross-thread deallocations are processed. This will naturally introduce overhead in allocator implementations that have some form of caching of free blocks, which is intented.
The benchmark also measures the maximum requested allocated size and the used virtual memory by the process to calculate a overhead percentage. This is done at the end of the loop iterations, once all cross-thread deallocations are processed. This will naturally introduce overhead in allocator implementations that have some form of caching of free blocks, which is intended.

The loop sequence is run four times in all threads, first two times in each thread with a full free of all blocks inbetween. The threads are then terminated and restarted, running another set of similar two loop sequences. The idea is to measure both how well allocators perform at starting up fresh threads, as well as reusing memory within a thread and when warm starting another thread.
The loop sequence is run four times in all threads, first two times in each thread with a full free of all blocks in between. The threads are then terminated and restarted, running another set of similar two loop sequences. The idea is to measure both how well allocators perform at starting up fresh threads, as well as reusing memory within a thread and when warm starting another thread.

Below is a collection of benchmark results for various allocation sizes. The machines running the Windows 10 and Linux benchmarks have 8 cores (4 physical cores with HT) and 12GiB RAM. The macOS machine is a MacBook, 2 cores (1 physical with HT) and 8GiB RAM. (Windows and macOS results coming soon)

Expand All @@ -30,7 +30,7 @@ Linear falloff distributed sizes in `[16, 8000]` range, 20000 loops with 50000 b
![Ubuntu 16.10 random [16, 8000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=301017877&format=image)
![Ubuntu 16.10 random [16, 8000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=1224595675&format=image)

Ignoring the lockfree-malloc and jemalloc memory overhead range and focusing on the other allocators we can see they are pretty close in memory overhead factors, with most of the multihreaded cases hovering around the 10-15% overhead mark.
Ignoring the lockfree-malloc and jemalloc memory overhead range and focusing on the other allocators we can see they are pretty close in memory overhead factors, with most of the multithreaded cases hovering around the 10-15% overhead mark.

![Ubuntu 16.10 random [16, 8000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=812830245&format=image)

Expand Down
2 changes: 1 addition & 1 deletion CACHE.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Thread caches
rpmalloc has a thread cache of free memory blocks which can be used in allocations without interfering with other threads or going to system to map more memory, as well as a global cache shared by all threads to let pages flow between threads. Configuring the size of these caches can be crucial to obtaining good performance while minimizing memory overhead blowup. Below is a simple case study using the benchmark tool to compare different thread cache configurations for rpmalloc.

The rpmalloc thread cache is configured to be unlimited, performance oriented as meaning default values, size oriennted where both thread cache and global cache is reduced significantly, or disabled where both thread and global caches are disabled and completely free pages are directly unmapped.
The rpmalloc thread cache is configured to be unlimited, performance oriented as meaning default values, size oriented where both thread cache and global cache is reduced significantly, or disabled where both thread and global caches are disabled and completely free pages are directly unmapped.

The benchmark is configured to run threads allocating 150000 blocks distributed in the `[16, 16000]` bytes range with a linear falloff probability. It runs 1000 loops, and every iteration 75000 blocks (50%) are freed and allocated in a scattered pattern. There are no cross thread allocations/deallocations. Parameters: `benchmark n 0 0 0 1000 150000 75000 16 16000`. The benchmarks are run on an Ubuntu 16.10 machine with 8 cores (4 physical, HT) and 12GiB RAM.

Expand Down
16 changes: 16 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
1.2.2

Add configurable memory mapper providing map/unmap of memory pages. Default to VirtualAlloc/mmap if none provided. This allows rpmalloc to be used in contexts where memory is provided by internal means.

Avoid using explicit memory map addresses to mmap on POSIX systems. Instead use overallocation of virtual memory space to gain 64KiB alignment of spans. Since extra pages are never touched this should have no impact on real memory usage and remove the possibility of contention in virtual address space with other uses of mmap.

Detect system memory page size at initialization, and allow page size to be set explicitly in initialization. This allows the allocator to be used as a sub-allocator where the page granularity should be lower to reduce risk of wasting unused memory ranges, and adds support for modern iOS devices where page size is 16KiB.

Add build time option to use memory guards, surrounding each allocated block with a dead zone which is checked for consistency when block is freed.

Always finalize thread on allocator finalization, fixing issue when re-initializing allocator in the same thread.

Add basic allocator test cases


1.2.1

Split library into rpmalloc only base library and preloadable malloc wrapper library.
Expand All @@ -10,6 +25,7 @@ Improve preload compatibility on Apple platforms by using pthread key for TLS in

Fix ABA issue in orphaned heap linked list


1.2

Dual license under MIT
Expand Down
26 changes: 21 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Platforms currently supported:
- Linux
- Android

The code should be easily portable to any platform with atomic operations and an mmap-style virtual memory management API.
The code should be easily portable to any platform with atomic operations and an mmap-style virtual memory management API. The API used to map/unmap memory pages can be configured in runtime to a custom implementation.

This library is put in the public domain; you can redistribute it and/or modify it without any restrictions. Or, if you choose, you can use it under the MIT license.

Expand All @@ -18,7 +18,7 @@ Please consider our Patreon to support our work - https://www.patreon.com/rampan
Created by Mattias Jansson ([@maniccoder](https://twitter.com/maniccoder)) / Rampant Pixels - http://www.rampantpixels.com

# Performance
We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~1700 lines of C code.
We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~1800 lines of C code.

Contained in a parallel repository is a benchmark utility that performs interleaved allocations (both aligned to 8 or 16 bytes, and unaligned) and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.

Expand All @@ -37,6 +37,8 @@ The easiest way to use the library is simply adding rpmalloc.[h|c] to your proje

__rpmalloc_initialize__ : Call at process start to initialize the allocator

__rpmalloc_initialize_config__ : Optional entry point to call at process start to initialize the allocator with a custom memory mapping backend and/or memory page size

__rpmalloc_finalize__: Call at process exit to finalize the allocator

__rpmalloc_thread_initialize__: Call at each thread start to initialize the thread local data for the allocator
Expand Down Expand Up @@ -70,11 +72,13 @@ Integer safety checks on all calls are enabled if __ENABLE_VALIDATE_ARGS__ is de

Asserts are enabled if __ENABLE_ASSERTS__ is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in `rpmalloc.c`.

Overwrite and underwrite guards are enabled if __ENABLE_GUARDS__ is defined to 1 (default is 0, or disabled), either on compile command line or by settings the value in `rpmalloc.c`. This will introduce up to 32 byte overhead on each allocation to store magic numbers, which will be verified when freeing the memory block. The actual overhead is dependent on the requested size compared to size class limits.

# Quick overview
The allocator is similar in spirit to tcmalloc from the [Google Performance Toolkit](https://github.com/gperftools/gperftools). It uses separate heaps for each thread and partitions memory blocks according to a preconfigured set of size classes, up to 2MiB. Larger blocks are mapped and unmapped directly. Allocations for different size classes will be served from different set of memory pages, each "span" of pages is dedicated to one size class. Spans of pages can flow between threads when the thread cache overflows and are released to a global cache, or when the thread ends. Unlike tcmalloc, single blocks do not flow between threads, only entire spans of pages.

# Implementation details
The allocator is based on 64KiB page alignment, where all runs of memory pages are mapped to 64KiB boundaries. On Windows this is automatically guaranteed by the VirtualAlloc granularity, and on mmap systems it is achieved by atomically incrementing the address where pages are mapped to. By aligning to 64KiB boundaries the free operation can locate the header of the memory block without having to do a table lookup (as tcmalloc does) by simply masking out the low 16 bits of the address.
The allocator is based on 64KiB page alignment and 16 byte block alignment, where all runs of memory pages are mapped to 64KiB boundaries. On Windows this is automatically guaranteed by the VirtualAlloc granularity, and on mmap systems it is achieved by atomically incrementing the address where pages are mapped to. By aligning to 64KiB boundaries the free operation can locate the header of the memory block without having to do a table lookup (as tcmalloc does) by simply masking out the low 16 bits of the address.

Memory blocks are divided into three categories. Small blocks are [16, 2032] bytes, medium blocks (2032, 32720] bytes, and large blocks (32720, 2097120] bytes. The three categories are further divided in size classes.

Expand All @@ -86,6 +90,18 @@ Each span for a small and medium size class keeps track of how many blocks are a

Large blocks, or super spans, are cached in two levels. The first level is a per thread list of free super spans. The second level is a global list of free super spans.

# Memory mapping
By default the allocator uses OS APIs to map virtual memory pages as needed, either `VirtualAlloc` on Windows or `mmap` on POSIX systems. If you want to use your own custom memory mapping provider you can use __rpmalloc_initialize_config__ and pass function pointers to map and unmap virtual memory. These function should reserve and free the requested number of bytes.

The functions do not need to deal with alignment, this is done by rpmalloc internally. However, ideally the map function should return pages aligned to 64KiB boundaries in order to avoid extra mapping requests (see caveats section below). What will happen is that if the first map call returns an address that is not 64KiB aligned, rpmalloc will immediately unmap that block and call a new mapping request with the size increased by 64KiB, then perform the alignment internally. To avoid this double mapping, always return blocks aligned to 64KiB.

Memory mapping requests are always done in multiples of the memory page size. You can specify a custom page size when initializing rpmalloc with __rpmalloc_initialize_config__, or pass 0 to let rpmalloc determine the system memory page size using OS APIs. The page size MUST be a power of two in [512, 16384] range.

# Memory guards
If you define the __ENABLE_GUARDS__ to 1, all memory allocations will be padded with extra guard areas before and after the memory block (while still honoring the requested alignment). These dead zones will be filled with a pattern and checked when the block is freed. If the patterns are not intact the callback set in initialization config is called, or if not set an assert is fired.

Note that the end of the memory block in this case is defined by the total usable size of the block as returned by `rpmalloc_usable_size`, which can be larger than the size passed to allocation request due to size class buckets.

# Memory fragmentation
There is no memory fragmentation by the allocator in the sense that it will not leave unallocated and unusable "holes" in the memory pages by calls to allocate and free blocks of different sizes. This is due to the fact that the memory pages allocated for each size class is split up in perfectly aligned blocks which are not reused for a request of a different size. The block freed by a call to `rpfree` will always be immediately available for an allocation request within the same size class.

Expand All @@ -106,12 +122,12 @@ Since each thread cache maps spans of memory pages per size class, a thread that

An application that has a producer-consumer scheme between threads where one thread performs all allocations and another frees all memory will have a sub-optimal performance due to blocks crossing thread boundaries will be freed in a two step process - first deferred to the allocating thread, then freed when that thread has need for more memory pages for the requested size. However, depending on the use case the performance overhead might be small.

Threads that perform a lot of allocations and deallocations in a pattern that have a large difference in high and low water marks, and that difference is larger than the thread cache size, will put a lot of contention on the global cache. What will happen is the thread cache will overflow on each low water mark causing pages to be released to the global cache, then underflow on high water mark causing pages to be re-aqcuired from the global cache.
Threads that perform a lot of allocations and deallocations in a pattern that have a large difference in high and low water marks, and that difference is larger than the thread cache size, will put a lot of contention on the global cache. What will happen is the thread cache will overflow on each low water mark causing pages to be released to the global cache, then underflow on high water mark causing pages to be re-acquired from the global cache.

# Caveats
Cross-thread deallocations are more costly than in-thread deallocations, since the spans are completely owned by the allocating thread. The free operation will be deferred using an atomic list operation and the actual free operation will be performed when the owner thread requires a new block of the corresponding size class.

VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead keeps track and atomically increases a running address counter of where memory pages should be mapped to in the virtual address range. If some other code in the process uses mmap to reserve a part of virtual memory spance this counter needs to catch up and resync in order to keep the 64KiB granularity of span addresses, which could potentially be time consuming.
VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead oversizes the memory mapping with 64KiB to be able to always return a memory area with this alignment. Since the extra memory pages are never touched this will not result in extra committed physical memory pages, but rather only increase virtual memory address space.

The free, realloc and usable size functions all require the passed pointer to be within the first 64KiB page block of the start of the memory block. You cannot pass in any pointer from the memory block address range.

Expand Down
3 changes: 1 addition & 2 deletions build/ninja/android.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
"""Ninja toolchain abstraction for Android platform"""

import os
import urlparse
import subprocess

import toolchain
Expand Down Expand Up @@ -87,7 +86,7 @@ def initialize_toolchain(self):
self.hostarchname = 'linux-x86_64'
else:
self.hostarchname = 'linux-x86'
elif self.host.is_macosx():
elif self.host.is_macos():
self.hostarchname = 'darwin-x86_64'

def build_toolchain(self):
Expand Down
Loading

0 comments on commit 82fbcc0

Please sign in to comment.