You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Related to #337 (this option is simpler to implement and reason about) - this would instead have a small fixed number of locks (e.g. NUM_SHARDS=4) and either:
A cache entry would use cache_entry_index % NUM_SHARDS as the sharded lock id - that should allow reducing write contention
This would improve both read and write throughput in most cases(as long as reads/writes are spread out over different keys, which is very common)
The process id(or zts thread id) would be used to reduce memory contention for memory reads, with initial_process_or_thread_id % NUM_SHARDS - a cache write would need to acquire all write locks in order to find and write to an entry. PHP code that benefits from it could optionally be given an api to override the lock id, e.g. pcntl_set_lock_id_hint(int $id) to set (unsigned id)id % NUM_SHARDS after pcntl_fork, though uses of pcntl_fork similar to the included benchmark are probably rare
This would improve read throughput a lot, at the cost of write throughput, and would have worse performance by default with pcntl_fork (calling getmypid() in every call to apcu would probably harm performance)
Anything needing to write to the entire cache would also acquire all NUM_SHARDS locks in order (done in order to avoid deadlock)
I started looking into this because I didn't see it documented/referenced anywhere.
For read throughput, I was seeing 1,200,000*8 operations per second when running 8 forked php processes on Linux (4 physical core cpu, 8 logical cores) for scalar values, for a rough estimate of a throughput of 10 million read operations on scalars per second when using pthread_rwlock_t.
Some other locking implementations such as mutexes would have worse read throughput, since mutexes are exclusively locked
If I run independent php process groups (e.g. start multiple instances of the bencharking script as separate process groups, or run multiple apache worker pools), there is no interference between their apcu uses on throughput
This isn't a bottleneck for me - even if a web application was serving 100 requests per second on each core of a 128 core CPU, and each request made 100 calls to apcu_fetch, in a single apache(httpd) pool, that would be 1,280,000 operations per second, which is less than 10 million read operations per second.
I assume there's some memory contention from attempts to write to memory of the lock while acquiring a read lock, as well as from atomic increments of stats (if individual cache entries are simultaneously accessed)
Processes can share read locks, so I'd hope the size of the data being unserialized/copied shouldn't affect lock throughput, but I haven't checked
If someone were to hit that as a bottleneck (e.g. using apcu hosting their site on a single many-core server similar to how memcached/redis is used, or fetching a lot of configuration from apcu without caching), they'd have workarounds:
Scalars were used for this benchmark to reduce time spent in places other than locking, to get an idea of locking throughput
If possible, run multiple web server pools with separate process groups (this means they will use separate apcu shared memory) on different ports
Benchmarking code, run on 4 physical core (8 logical core) cpu in php 7.4 (click to expand)
<?php// apcu was recompiled without cachefunctiontest_apcu(int$n, $value) {
for ($i = 0; $i < 16; $i++) {
apcu_store('key' . $i, $value);
}
if (!apcu_store('key', $value)) {
var_export(apcu_sma_info());
thrownewRuntimeException("Failed to store value\n");
}
$ser = serialize($value);
$start = hrtime(true);
for ($i = 0; $i < $n; $i++) {
// apcu_store throughput empty new stdClass()// 100000 per second with 8 processes (4 real cores with 2 threads - effectively 4 since all of this is CPU-bound)// 230000 per second with 4 processes// 490000 per second with 2 processes// 3070000 per second with 1 process//apcu_store('key', $value); // using different keys doesn't have much difference// unserialize only, no apcu (cpu benchmarking only)// 1420000 per second with 8 processes// 3080000 per second with 4 processes// 3680000 per second with 2 processes// 3770000 per second with 2 processes//unserialize($ser);// apcu_fetch throughput empty new stdClass(), default serializer// 350000 per second with 32// 860000 per second with 8 processes (97 no lock)// 1290000 per second with 4 processes (145 no lock, 170 no stats, 190 separate keys)// 1760000 per second with 2 processes (230 no lock)// 2660000 per second with 1 process (260 no lock)// apcu_fetch throughput value=42, storing raw value// 1200000 per second with 8 processes// 1790000 per second with 4 processes (4100000 no lock)// 3000000 per second with 2 processes// 9850000 per second with 1 processapcu_fetch('key' . ($i & 15), $value);
}
$end = hrtime(true);
$elapsed = ($end - $start)/1e9;
printf("APCu Elapsed: %.6f throughput %10.0f\n", $elapsed, (1/$elapsed * $n));
//printf("count=%d\n", count((array)apcu_fetch('key', $value)));
}
constN = 4000000;
constNUM_PROCESSES = 4;
$value = 42;//new stdClass();printf("Testing with %d processes\n", NUM_PROCESSES);
$pids = [];
for ($i = 0; $i < NUM_PROCESSES; $i++) {
$result = pcntl_fork();
if ($result <= 0) {
//ob_start();// $outname = "out-" . getmypid() . ".out";//file_put_contents($outname, ob_get_clean());//test_shm(N, new stdClass());test_apcu(N, $value);
//file_put_contents($outname, ob_get_clean());exit(0);
} else {
$pids[] = $result;
echo"Child pid is $result\n";
}
}
foreach ($pidsas$pid) {
pcntl_waitpid($pid, $status);
echo"status of $pid is $status\n";
}
The text was updated successfully, but these errors were encountered:
TysonAndre/immutable_cache-pecl@9967889 is what I'd had in mind - there's around a 2x speedup even for read-only workflows in an 8 logical core(4 physical) computer, and it'd probably be more noticeable for servers with really high processor counts.
also, writes would no longer block reads to other shards, which would improve read throughput on servers
There would be a global lock that's only acquired for writes, in addition to acquiring the lock for key_hash % shards (shards==slots)
I'm not completely sure if there's edge cases I'm missing due to APCu being much more complex (eviction, gc, full cache clears, etc)
E.g. which read operations would trigger writes by garbage collecting
Related to #337 (this option is simpler to implement and reason about) - this would instead have a small fixed number of locks (e.g. NUM_SHARDS=4) and either:
A cache entry would use
cache_entry_index % NUM_SHARDS
as the sharded lock id - that should allow reducing write contentionThis would improve both read and write throughput in most cases(as long as reads/writes are spread out over different keys, which is very common)
The process id(or zts thread id) would be used to reduce memory contention for memory reads, with
initial_process_or_thread_id % NUM_SHARDS
- a cache write would need to acquire all write locks in order to find and write to an entry. PHP code that benefits from it could optionally be given an api to override the lock id, e.g.pcntl_set_lock_id_hint(int $id)
to set(unsigned id)id % NUM_SHARDS
after pcntl_fork, though uses of pcntl_fork similar to the included benchmark are probably rareThis would improve read throughput a lot, at the cost of write throughput, and would have worse performance by default with pcntl_fork (calling getmypid() in every call to apcu would probably harm performance)
Anything needing to write to the entire cache would also acquire all
NUM_SHARDS
locks in order (done in order to avoid deadlock)Padding the values to at least the cache line size of 64 bytes (e.g. C union) may also help in reducing contention
https://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size
Existing notes on APCu performance
I started looking into this because I didn't see it documented/referenced anywhere.
For read throughput, I was seeing 1,200,000*8 operations per second when running 8 forked php processes on Linux (4 physical core cpu, 8 logical cores) for scalar values, for a rough estimate of a throughput of 10 million read operations on scalars per second when using pthread_rwlock_t.
This isn't a bottleneck for me - even if a web application was serving 100 requests per second on each core of a 128 core CPU, and each request made 100 calls to apcu_fetch, in a single apache(httpd) pool, that would be 1,280,000 operations per second, which is less than 10 million read operations per second.
If someone were to hit that as a bottleneck (e.g. using apcu hosting their site on a single many-core server similar to how memcached/redis is used, or fetching a lot of configuration from apcu without caching), they'd have workarounds:
Benchmarking code
Benchmarking code, run on 4 physical core (8 logical core) cpu in php 7.4 (click to expand)
The text was updated successfully, but these errors were encountered: