Intel Neural Compressor provides a command incbench
to launch the Intel CPU performance benchmark.
To get the peak performance on Intel Xeon CPU, we should avoid crossing NUMA node in one instance.
Therefore, by default, incbench
will trigger 1 instance on the first NUMA node.
Platform | Status |
---|---|
Linux | ✔ |
Windows | ✔ |
Parameters | Default | comments |
---|---|---|
num_instances | 1 | Number of instances |
num_cores_per_instance | None | Number of cores in each instance |
C, cores | 0-${num_cores_on_NUMA-1} | decides the visible core range |
cross_memory | False | whether to allocate memory cross NUMA |
Note: cross_memory is set to True only when memory is insufficient.
incbench main.py
: run 1 instance on NUMA:0.incbench --num_i 2 main.py
: run 2 instances on NUMA:0.incbench --num_c 2 main.py
: run multi-instances with 2 cores per instance on NUMA:0.incbench -C 24-47 main.py
: run 1 instance on COREs:24-47.incbench -C 24-47 --num_c 4 main.py
: run multi-instances with 4 COREs per instance on COREs:24-47.
Note: > -
num_i
works the same asnum_instances
> -num_c
works the same asnum_cores_per_instance
To merge benchmark results from multi-instances, "incbench" automatically checks log file messages for "throughput" and "latency" information matching the following patterns.
throughput_pattern = r"[T,t]hroughput:\s*([0-9]*\.?[0-9]+)\s*([a-zA-Z/]*)"
latency_pattern = r"[L,l]atency:\s*([0-9]*\.?[0-9]+)\s*([a-zA-Z/]*)"
print("Throughput: {:.3f} samples/sec".format(throughput))
print("Latency: {:.3f} ms".format(latency * 10**3))