- Overview
  1.1 Prerequisites
  1.2 How to run test - CPU and Memory analysis
  2.1 Execution environment
  2.2 CPU analysis - 500,000 numbers, 1500 threads
  2.3 CPU analysis - 50,000 numbers, 50,000 threads
  2.4 Heap analysis - 50,000 numbers, 50,000 threads
  2.5 Lock analysis - Automation - benchmark for performance comparison
3.1 Java Microbenchmark Harness (JMH)
3.2 JMH configuration
3.2.1 JMH Benchmark mode
3.2.2 Warmup iterations
3.2.3 Avoiding dead code elimination by JVM
3.2.4 SetUp / TearDown - Code enhancements
4.1 "isPrime(...)" method
4.2 Concurrency level and execution
4.3 Redundant use of BigIntegerIterator - Existing algorithms
- Experiments
6.1 Test configuration
6.2 Baseline - benchmark of original implementation
6.3 Elimination of excessive objects
6.4 Change of concurrency level and management - Results
7.1 Visualization
7.2 Comparison of implemented algorithms - Observed issues
8.1 MacBook Air M1 - thread limitations
8.2 YourKit - allocation profiling issue - Troubleshooting
9.1 Missing /META-INF/BenchmarkLis @ JMH start up
Qualification, characterization and improvement the performance of prime number's calculator.
CPU, RAM and Lock analysis have been done via YourKit Java profiler, the comparison between different implementation have been done via Java Microbenchmark Harness (JMH). In order to simplify comparison between the builds, automation for visualization of JMH results have been implemented.
The repository contains original code sample, YourKit's CPU and Allocation profiling data, its analysis and code enhancements.
In order to view the final results, please, see 7.2 Comparison of implemented algorithms
JRE 8
. Please, note that JRE is required to use the solution itself, yet, performance test automation (Java Microbenchmark Harness) requiresJDK
.(optional) YourKit Java Profiler
- visualization of CPU / allocation samples.(optional) Python 3
- visualization of results.
In order to build the solution and execute performance test, please, run the following command:
./gradlew clean build jmh
The results would be available in CSV format at build/reports/benchmarks.csv'
. In order to change the format or file location,
please, modify jmh {...}
section of build.gradle:
jmh {
...
resultFormat = 'csv'
resultsFile = file('build/reports/benchmarks.csv')
}
The section contains details about conducted performance analysis of original implementation of prime numbers' calculator.
Each test had been executed locally on laptops that were available to me.
- Max Prime number: 50,000 - 500,0000. Such decision had been made in order to get sufficient amount of CPU samples during profiling.
- JVM Options.
-Xms4096m -Xmx4096m -Xss1024k
That had led to the need to make code changes for testing purposes:
MacBook Air (M1, 2020)
- JVM supports only 4,000 threads. Therefore, thread pool's capacity had been modified in some experiments
(see: 2.2 CPU analysis - 500,000 numbers, 1500 threads and
8.1 MacBook Air M1 - thread limitations)
Prior to execution of the experiment, thread pool's capacity had been reduced. In original implementation, that was either 3000 or maxPrime
, depends on
which one was higher. Such configuration had led to Java OOM within my Macbook Pro M1.
After multiple failed attempts to eliminate it (see: 8.1 MacBook Air M1 - thread limitations), it was considered to reduce amount of threads within the pool down to 1500, otherwise the amount of samples would be relatively small, which would make it harder to analyze the results of CPU and Heap profiling.
public static List<Integer> getPrimes(int maxPrime) throws InterruptedException {
// ExecutorService executors = Executors.newFixedThreadPool(Math.max(maxPrime / 100, 3000));
ExecutorService executors = Executors.newFixedThreadPool(1500);
...
}
Afterwards, the application had been profiled using Yourkit.
A more simple view would be in the form of flamegraph:
Looking at (500k-primes-1500-threads-PrimeCalculator-2022-07-28.snapshot), the following observations had been made:
- Issue 1. The samples link to the removal of non-prime numbers -
primeNumbers.remove(toRemove)
. Given the factprimeNumbers
is an instance ofLinkedList
, each removal operation requires traversing the list to look up the element and remove it - it's an O(N) operation. Omitting changes within business logic itself (to not use a collection for storage of non-prime numbers), a more suitable collection for this use case would beHashSet
, as it removes element by O(1) time. - Issue 2 & 3. Both of them are related to insufficient control of application workflow - via
Exception
instances. Depending on stack trace, stack depth and type, the creation ofException
instance is expensive. Considering their frequent creation in our case, the affection on performance (CPU, Heap and, as a result, GC) is inevitable. A more simple and less expensive approach would be the use ofboolean
type.
Snapshot: 500k-primes-1500-threads-PrimeCalculator-2022-07-28.snapshot
Unlike 2.2 CPU analysis - 500,000 numbers, 1500 threads, the experiment had been conducted on Windows laptop, which I was able to use for a day. Within that environment, JVM was capable of having ~50,000 threads - such value would be enough for efficient profiling of the application.
Thus, the logic from original implementation had been used - thread pool's capacity is either 3000 or maxPrime
, depends on
what is higher.
public static List<Integer> getPrimes(int maxPrime) throws InterruptedException {
ExecutorService executors = Executors.newFixedThreadPool(Math.max(maxPrime / 100, 3000));
...
}
Looking at 50k-primes-cpu-PrimeCalculator.snapshot, the following observations could be made:
- Most (92%) of the operations within CPU samples are dedicated to thread's initialization, while only 7% is dedicated to actual business logic - determination of prime numbers.
- 77% of such operations is related to the creation of the thread.
Therefore, the following changes could be suggested:
- Reduce concurrency level. Each thread requires stack. Such immense amount of threads - 3000 at minimum - introduces significant context switching time, as well as requires memory allocation.
- Use different thread pool implementation. Thread pool used within the original implementation -
newFixedThreadPool(...)
- creates a certain amount of worker threads (equal to max prime number) and a queue of the task (check if a number is prime). All the tasks are put onto blocking queue (see: implementation ofnewFixedThreadPool(...)
below), thus, given our concurrency level, a significant contention within the queue is unavoidable.
...
public static ExecutorService newFixedThreadPool(int nThreads) {
return new ThreadPoolExecutor(nThreads, nThreads,
0L, TimeUnit.MILLISECONDS,
new LinkedBlockingQueue<Runnable>());
}
...
Looking at snapshots/allocation/60k-allocation-profiling-original.csv, the following observations could be made:
- Redundant use of
BigIntegerIterator
. We generate [2; maxPrime] instances ofBigIntegerIterator
and append it onto Collection. The following aspects could be retrieved from the issue: (a)BigIntegerIterator
contains two collection with excessive initial capacity - 500 elements, while we append at-most 1 element to it; (b) we don't need two separate collections to store Integer values (reference
) and its string representatives (contain
); (c)BigIntegerIterator
class is redundant - it could be replaced by operations with the collection itself.
class BigIntegerIterator {
private final List<String> contain = new ArrayList<>(500);
private final List<Integer> reference = new ArrayList<>(500);
...
}
...
List<BigIntegerIterator> myFiller = Stream.generate(new Supplier<BigIntegerIterator>() {
...
@Override
public BigIntegerIterator get() {
return new BigIntegerIterator(i++);
}
}).limit(maxPrime).collect(Collectors.toList());
- Creation of Runnable tasks. Given the nature of the application - concurrent determination of prime numbers - the creation of such objects is reasonable. Yet, as stated within CPU profiling (2.3 CPU analysis - 50,000 numbers, 50,000 threads, concurrency level and thread pool implementation could be changed.
- Conversion of
String
toInteger
. As stated in (1), there's no need to store integer value and its string representative separately. Thus, this part of the code could be eliminated. - Creation of Lambda for executor service task. As stated in (2), given the concurrency nature of the application, it's reasonable to have such objects within heap.
- Creation of sublist and iteration over it. The logic for prime numbers determination creates sublist of dividers for prime
number candidates. Given the fact we know the bounds of the range for potential candidates, there's no need to
generate a separate collection for it at each call of
isPrime(...)
.
private static void isPrime(List<Integer> primeNumbers, Integer candidate) throws Exception {
for (Integer j : primeNumbers.subList(0, candidate - 2)) {
if (candidate % j == 0) {
throw new Exception();
}
}
...
}
-
Creation of
Exception
in order to create application workflow. As stated in "CPU" section, that could be eliminated. As stated within 2.2 CPU analysis - 500,000 numbers, 1500 threads, the use of Exceptions is redundant, especially considering performance affection it causes via additional CPU and Heap pressure. Generation ofExceptions
instances could be replaced with returning a primitiveboolean
value fromisPrime(...)
. -
Excessive concurrency level. Original implementation assumes at least 3000 threads within the pool would be created. Depends on the environment, it might lead to excessive use of RAM (thread stacks) and native OS threads. As a result, in multiple environments, such approach would lead to Java
OutOfMemoryError
due to inability to create a new thread.
ExecutorService executors = Executors.newFixedThreadPool(Math.max(maxPrime / 100, 3000));
...
java.lang.OutOfMemoryError: unable to create new native thread
Excessive allocation of objects could be found in:
- Creation of excessive threads. each thread occupies stack, but we could reduce concurrency. Leads to OOM.
During CPU profiling, YourKit reported potential deadlock.
In my assumption, it's not a logical deadlock, but rather the indicator that multiple threads are waiting for the acquisition of resources for more than 10 seconds. Such behavior is caused by the combination of the following factors:
Executors.newFixedThreadPool(...)
usesLinkedBlockingQueue
for executable tasks.primeNumbers
are stored within synchronized LinkedList.primeNumbersToRemove
are stored in synchronized LinkedList and accessed withinsynchronized
block.
...
List<Integer> primeNumbers = Collections.synchronizedList(new LinkedList<>());
List<Integer> primeNumbersToRemove = Collections.synchronizedList(new LinkedList<>());
synchronized (primeNumbersToRemove) {
...
}
...
The following points for improvements could be made:
primeNumbers
is always accessed by a single thread, thus it might be a regular collection.primeNumbersToRemove
is asynchronizedList
, yet it's being modified withinsynchronized(primeNumbersToRemove)
block, making the application logically single-threaded in this area. We would've neededsynchronized
if we'd have been iterating overprimeNumbersToRemove
, but we iterate overprimeNumbers
instead and perform onlyadd(...)
method, which uses mutex internally.
static class SynchronizedList<E> ... {
public void add(int index, E element) {
synchronized (mutex) {list.add(index, element);}
}
...
}
Based on CPU, RAM and Lock analysis, we could make enhancements to the application: data structures, concurrency, application workflow.
In order to sufficiently compare the performance, that'd be useful to understand how user would see it.
Profiling of the application provides significant benefits while investigating its behavior: resource consumption (CPU / RAM / Heap / off-heap), object allocation, state of threads. Using the objectives made based on such analysis, developers could enhance application performance and stability.
While being useful for the analysis, it might be hard to use profiling data to compare the performance of 2 (or more) versions of the application, since
the majority of profilers use CPU sampling instead of wall-clock time. Instead, in order to determine performance of the application
in different use-cases, we could use benchmarking
- a programmatic way to configure, execute and measure useful work of business logic
from user perspective.
As an outcome from benchmarking, we'd retrieve numerical characteristics, which we could use to qualify and characterize the performance of different versions of the app.
Java Microbenchmark Harness (JMH) is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM.
Within the implementation, we'd use the following JMH-related dependencies:
jmh-core
- business logic of microbenchmark harness.jmh-generator-annprocess
- annotation processor for simplified configuration and usage of JMH API.
The section describes core configuration options for the tool used within automation for performance tests - JMH.
JMH has the following modes of execution (java doc):
Throughput
- measures the number of operations per second - number of times per second the method could be executed. Given the nature of the application (concurrent detection of prime numbers), that'd be better to focus on duration rather than throughput.Average time
- measures average time for a single execution. "Average" wouldn't be an efficient metric due to GC pauses. It might be convenient for us to get a complete distribution of measurements (1st - 100th percentiles).Sample time
- measures how long time it takes for the benchmark method to execute, including max, min time etc. Such distribution of the values should be convenient for our case.Single shot time
- measures how long time a single benchmark method execution takes to run, which doesn't include JVM warm up. Given the nature of our application, a single method execution should be sufficient measurement.All
- runs all benchmark modes. This is mostly useful for internal JMH testing due to significant overhead.
Given all the above, Sample time
mode would provide duration metrics, which we'd be able to use for the comparison of different PrimeCalculator
versions.
The distribution of such values (1st - 100th percentile) would allow us to have a precise comparison and omit the internment pauses during runtime.
Given the fact PrimeCalclator
is Java application, the first invocation of application would be slower than the following ones.
During the initial execution, additional time would be taken to lazy class loading and JIT.
By having some amount of iterations that wouldn't be included into the measurement - "warmup
", all classes would be
cached beforehand, thus they'd be instantly accessed at runtime during the primary phase of benchmark.
While conducting performance experiments, that'd be useful to simulate the workload that's close to real-world scenario.
In case the result of benchmarking method - getPrimes(...)
- wouldn't be used anywhere, JVM would detect that and apply
a related optimizations, which would misleadingly affect performance measurements.
In order to exclude such situations, JMH provides Blackhole
object, which could be used as a consumer of the output of benchmarking method.
That'd prevent an unwanted dead code elimination by JVM.
Given the nature of the original method and the fact it generates sequence of the numbers on demand, no set up or tear down actions are needed.
The section lists enhancements that had been made to the original implementation of prime numbers' calculator.
private static void isPrime(List<Integer> primeNumbers, Integer candidate) throws Exception {
for (Integer j : primeNumbers.subList(0, candidate - 2)) {
if (candidate % j == 0) {
throw new Exception();
}
}
}
- Problem description: , (exceptions, sublist)
- Replace subList(...) with for-loop. Odd numbers are eliminated from the loop.
New:
private static boolean isPrime(int number) {
....
// sequentially check for other numbers
for (int i = 3; i < number; i+= 2) {
if (number % i == 0) {
return false;
}
}
return true;
}
The original implementation uses thread pool with at least 3000 threads.
public static List<Integer> getPrimes(int maxPrime) throws InterruptedException {
...
ExecutorService executors = Executors.newFixedThreadPool(Math.max(maxPrime / 100, 3000));
...
}
ThreadPoolExecutor
has one queue (LinkedBlockingQueue
) of the tasks.
During execution, each worker thread locks the queue, dequeue a task and remove the lock.
In case a task is short (non-IO-bound), there's a lot of contention within the queue.
An alternative would be the use of lock-free queue, however, that would result into different issues related to distribution of the tasks.
An alternative to that would be work-stealing techniques, each thread has its own queue. In case it runs out of tasks - it "steals" the tasks from other threads. Thus, the contention between threads is lower.
In Java, work stealing technique is implemented within ForkJoin framework.
ForkJoinPool
, according to documentation, keeps given amount of threads active at any moment of time.
We could create a pool with ForkJoin work stealing model via newWorkStrealingPool(...)
method. Unlike ForkJoinPool.commonPool(...)
,
it creates an asynchronous thread pool with first-in-first-out (FIFO) queue configuration, which reduces contention between idle workers.
As a result, the code had been changed to the following:
public static List<Integer> getPrimes(int maxPrime) throws InterruptedException {
...
final int cores = Runtime.getRuntime().availableProcessors();
ExecutorService executors = Executors.newWorkStealingPool(cores);
...
}
A note on concurrency level: by default, I'm using the amount of available processors as a concurrency level here. However,
assuming the algorithm would be a part of server-side logic, instead of hard-coding the value, I'd let user set it via configuration options.
Thus, in case the application would be launched in different, shared environments, such as Kubernetes
cluster, the user would be able to
implicitly define concurrency level.
As stated in both CPU and Heap analysis, given the fact we use 2 collections within each BigIntegerIterator
instance, as well as
keeping separate collection with all non-prime numbers that have to be removed (primeNumbersToRemove
), we introduce significant CPU / RAM
pressure and, therefore, performance penalty.
With that regard, the following enhancements have been made:
BigIntegerIterator
class had been eliminated.- Generation of Collection containing all integers from range
[2; maxPrime]
prior to determination of prime numbers had been replaced with a simple for-loop. - Separate collection for non-prime numbers had been eliminated. Now, we store only prime numbers, which would be used as a return value.
public static List<Integer> getPrimes(int maxPrime) throws InterruptedException {
...
ConcurrentLinkedQueue<Integer> primeNumbersQueue = new ConcurrentLinkedQueue<>();
CountDownLatch latch = new CountDownLatch(maxPrime - 2);
for (int i = 2; i <= maxPrime; i++) {
// final efficiency requirement
final int candidate = i;
executors.submit(() -> {
if (isPrime(candidate)) {
primeNumbersQueue.add(candidate);
}
latch.countDown();
});
}
...
return Arrays.asList(primeNumbersQueue.toArray(new Integer[0]));
}
An alternative the enhancement of current approach would be usage of existing algorithms for determination of prime numbers.
An example of such algorithms is Sieve of Eratosthenes, which finds prime numbers up to given limit. It's based on sequential identification of numbers that are divisible by primes.
The section contains details about conducted experiments and their configurations. System.out.println (standard output) had been excluded from measurement, since the ways to provide the results may vary (serialization, send over the wire, etc.)
- Input argument - max prime number. Within original implementation, I've observed significant (1 second+) GC pauses while specifying ~15,000 as max prime number. Thus, the following options have been determined to provide repeatable results, yet fulfill an efficient amount of CPU / Heap samples during profiling: 1000, 10000, 50000.
@Param({"1000", "10000", "50000"})
public int maxPrimeNumber;
- Warmup iterations. The motivation behind warmup iterations have been described in 3.2.2 Warmup iterations. In order to decrease the probability of lazy class loading and JIT affection onto results, it was decided to set 3 warmup iterations.
@Fork(value = 1, warmups = 3)
- Heap size. In case Java heap size is not constant, JVM would be adjusting heap size to keep a reasonably available free space
(
MinHeapFreeRatio
,MaxHeapFreeRatio
) for live object at each GC iteration (see: Java SE 8 documentation @ Oracle). In order to increase repeatability of the results, it's possible to prevent JVM from making heap sizing decisions by specifying equal minimal and maximal heap size, 4GB each:-Xms4000m -Xmx4000m
.
In order to get the numerical representation of performance enhancements conducted in the form of code changes, it's necessary to determinate the baseline - performance of the original implementation.
Results (raw output of JMH): experiment-0-baseline.txt
Changes made:
- Exceptions. As stated in 4.1 "isPrime(...)" method, the control of application from had been
changed from using
Exception
to boolean value. - Collections (BigInteger Iterator, primeNumbersToRemove). As stated in 4.3 Redundant use of BigIntegerIterator,
the change was aimed at removal of
BigIntegerIterator
class, as well as other unnecessary collections mentioned in 2.4 Heap analysis - 50,000 numbers, 50,000 threads.
Results (raw output of JMH): experiment-2-linked-list.txt
Changes made:
- Concurrency level. Given the suggestions made in 4.2 Concurrency level and execution based on 2. CPU and Memory analysis, concurrency level had been reduced from "at least 3000" down to the amount of available processors.
- Concurrency management. As suggested in 4.2 Concurrency level and execution, work-stealing approach had been applied to our use case.
Results (raw output of JMH): experiment-4-work-stealing-pool.txt
Algorithm | maxPrime - 1000 | maxPrime - 5000 | maxPrime - 10000 |
---|---|---|---|
Original implementation, duration, ms/op, 95th percentile |
138.77 | 3339.5 | 13019.12 |
Enhancement - removal of redundant objects, duration, ms/op, 95th percentile |
1.25 (-99.09%) | 3.47 (-99.90%) | 6.72 (-99.94%) |
Enhancement - work stealing thread pool, ms/op, 95th percentile |
1.22 (-99.12%) | 2.26 (-99.93%) | 4.02 (-99.97%) |
Looking at the results, the following observations could be made:
- Removal of redundant collections and prevention of spawning multiple exceptions as a return value had significant increased
the performance. It's expected - creation and handling of exceptions, storage of all collections within each
BigIntegerIterator
instance, operations for the removal of non-prime numbers from the list - each of them introduces higher CPU and heap pressure, thus increase performance penalty. - The effect of using work stealing within thread pool increases in proportion to specified max prime number. It could be explained by the fact that lower amount of numbers within range [2; maxPrime] causes fewer tasks within thread pool's queue, thus less contention.
In order to simplify the comparison, automation for the visualization of JMH measurements had been implemented.
Please, refer to visualization.
The section includes comparison of performance experiments' results between each implemented algorithm.
Algorithm | maxPrime - 1000 | maxPrime - 10000 | maxPrime - 50000 |
---|---|---|---|
Original implementation, duration, ms/op, 95th percentile |
131.27 | 508.454 | 1942.28 |
Enhanced implementation, duration, ms/op, 95th percentile |
1.36 (-98.96%) | 3.56 (-99.30%) | 39.66 (-97.96%) |
Sieve of Eratosthenes, duration, ms/op, 95th percentile |
0.01 (-99.99%) | 0.08 (-99.98%) | 0.59 (-99.96%) |
Looking at the results, the following observations could be made:
- Differences in duration. Within the used test environment, the most significant differences between original and enhanced solutions had been observed with 10,000 as max prime number - 142 times decrease (-99.30%), while 1000 and 50,000 shown 100-times (-99.30%) and 50-times (-97.96%) decrease respectfully.
- Comparison of algorithms - concurrent vs single-threaded. Within the used environment, single-threaded algorithm -
Sieve of Eratosthenes
- shown the best performance - uo to 3292-times decrease (-99.96%) in duration.
As a conclusion, we could state that the efficient use of resources (CPU / RAM) and their sharing across threads is important, yet the most significant changes in performance could be achieved by designing a sufficient algorithm.
On MacBook Air (M1, 2020), the original implementation of PrimeCalculator had been crashing in case maxPrime
had been set to
value higher than ~2000, while most of Windows laptops were capable of handling 250,000+ threads.
Thus, the sample size collected during profiling of application had been insufficient.
Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.$$YJP$$start0(Native Method)
at java.lang.Thread.start0(Thread.java)
...
Error occurred during initialization of VM
java.lang.OutOfMemoryError: unable to create new native thread
Process finished with exit code 1
Using implemented ThreadCountTest.java, it was determined that JMH on MacBook Air (M1, 2020) supports 4051 thread.
...
Thread count: 4049
Thread count: 4051
...
Looking at kernel configuration, we could see it could handle 10,000 threads.
% sysctl kern.num_threads
kern.num_threads: 10240
Looking at ulimit
statistics for the current user, we could see that the current limit is 1333 threads.
andreykoltsov@Andreys-MacBook-Air ~ % ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8176
-c: core file size (blocks) 0
-v: address space (kbytes) unlimited
-l: locked-in-memory size (kbytes) unlimited
-u: processes 1333
-n: file descriptors 2560
Looking at kernel limit, we could assume it's a (currently) hard-limit for max processes for the user.
sysctl -a |grep kern | grep proc
kern.maxproc: 2000
In order to modify the hard limit on macOS Big Sur, we should register a new launch deamon. The process if the following:
# 1. Create manifest for launch daemon.
$ sudo vi /Library/LaunchDaemons/com.startup.sysctl.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.startup.sysctl</string>
<key>LaunchOnlyOnce</key>
<true/>
<key>ProgramArguments</key>
<array>
<string>/usr/sbin/sysctl</string>
<string>kern.maxproc=50000</string>
</array>
<key>RunAtLoad</key>
<true/>
</dict>
</plist>
# 2. Register launch daemin
sudo chown root:wheel /Library/LaunchDaemons/com.startup.sysctl.plist
sudo launchctl load /Library/LaunchDaemons/com.startup.sysctl.plist
Unfortunately, it didn't work - the limit had stayed to be 2k even after the reboot.
Alternative approach - try to run macOS could run in server performance mode
sudo nvram boot-args="serverperfmode=1 $(nvram boot-args 2>/dev/null | cut -f 2-)"
Afterwards, the attempts to set maxproc
to value higher than hard limit didn't return an error (unlike previous attempts), yet, the value
remained consistent - 2000 processes.
As an alternative, in order to be able to use hard limit (maxproc
2000 instead of 1333), I've launched IntelliJ IDEA as root:
sudo /Applications/IntelliJ\ IDEA\ CE.app/Contents/MacOS/idea
Allocation profiling is started within YourKit, yet no related data is shown.
Port statistics:
$ lsof -i :10001
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
profiler 16735 andreykoltsov 27u IPv6 0xd62e9a0c1d349b07 0t0 TCP localhost:58978->localhost:scp-config (ESTABLISHED)
java 16833 andreykoltsov 9u IPv4 0xd62e9a0c22a56627 0t0 TCP localhost:scp-config (LISTEN)
java 16833 andreykoltsov 37u IPv4 0xd62e9a0c1aa0fa57 0t0 TCP localhost:scp-config->localhost:58978 (ESTABLISHED)
Once memory snapshot is captured, all object don't have any allocation information.
Allocation profiling configuration:
It seems that each object had been recorded, yet memory snapshot contains mostly unreachable objects with unknown allocations.
On Windows with 60k primes as an input, allocation profiling worked, but when I've tried to capture memory snapshot, the following message appeared, yet application was still running: Allocation profiling with 60k and Thread.sleep(10000) in order to start allocation profiling
The following error might occur while launching JMH benchmark:
Exception in thread "main" java.lang.RuntimeException: ERROR: Unable to find the resource: /META-INF/BenchmarkList
at org.openjdk.jmh.runner.AbstractResourceReader.getReaders(AbstractResourceReader.java:96)
at org.openjdk.jmh.runner.BenchmarkList.find(BenchmarkList.java:104) at org.openjdk.jmh.runner.Runner.internalRun(Runner.java:256)
at org.openjdk.jmh.runner.Runner.run(Runner.java:206)
at com.test.BTest.main(BTest.java:24)
...
In order to fix that, please:
- Ensure all dependencies specified within build.gradle are loaded.
- If (1) didn't help, please, explicitly specify JDK path into gradle.properties. Please, ensure the path points to JDK, not JRE.
$ cat gradle.properties
org.gradle.java.home=<full path to JDK>