Running benchmarks#

Pre-requisite: an installation of Black that’s importable in the current environment (please make sure the task you’re using supports your installed version of Black).

The simplest way of running benchmarks is to call the run command providing a filepath to dump results to:

dev@example:~/blackbench$ blackbench run example.json
[*] Versions: blackbench: 21.7.dev2, pyperf: 2.2.0, black: 21.7b0
[*] Created temporary workdir at `/tmp/blackbench-workdir-67vki43p`.
[*] Alright, let's start!
[*] Running `fmt-black/__init__` benchmark (1/17)
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (546 ms) is 30% of the mean (1.84 sec)
* the maximum (4.64 sec) is 153% greater than the mean (1.84 sec)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

fmt-black/__init__: Mean +- std dev: 1.84 sec +- 0.55 sec
[*] Took 166.059 seconds.

[snipped ...]

[*] Cleaning up.
[*] Results dumped.
[*] Blackbench run finished in 818.794 seconds.

Note how there’s a “WARNING: the benchmark result may be unstable” line in the output. This leads perfectly into the next topic when running benchmarks: stability and reliability.

Benchmark stability#

While blackbench is supposed to be rather easy to use, one must understand the basics of stable benchmarks and their importance. Stable as in the data doesn’t vary all of the place for absolutely no good reason (it’s sorta like avoiding flakey tests). Instable benchmarks don’t produce quality data or allow for accurate comparisons.

For a good backgrounder on stable benchmarks I’d recommend Victor Stinner’s “My journey to stable benchmark” series. In particular the My journey to stable benchmark, part 1 (system) and My journey to stable benchmark, part 3 (average) articles. But in general, pyperf has good documentation on tuning your system to increase benchmark stability.

Warning

Note the suggested modifications may not be supported for your specific environment and also can be annoying to undo (a simple reboot should clear them though).

Some concrete advice is to 1) use pyperf’s great system tuning features. Not only does it have the automagicial pyperf system tune command, there’s the lovely pyperf system show command which emits revelant system information and even some advice to further tweak your system! 2) Even if you can’t isolate a CPU core, always use CPU pinning via pyperf’s --affinity. This avoids the noise caused by the worker process being constantly assigned and reassigned to different CPU cores over time. 3) If you feel like it, try learning pyperf’s benchmark parameters and see what works well for you (eg. maybe two warmups is better than one for you!).

If you’re curious what I, Richard aka @ichard26, do to tune my system in preparation, here’s a summary:

Personal tuning steps

System notes: it’s a dual-core running Ubuntu 20.04 LTS :P

Reboot
At the boot menu, add the following Linux kernel parameters: isolcpus=1, nohz_full=1, and rcu_nocbs=1
Once booted, run pyperf system tune

Then run my-custom-script.bash:

# This has to be run under a root shell
echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo userspace > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo 2100000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
echo 2100000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_setspeed
echo 2100000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
echo 2100000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq
echo 2100000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
echo 2100000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
echo 1 > /proc/sys/kernel/perf_event_max_sample_rate
echo "System tuned :D"

This exists because my laptop’s cooling capacity isn’t good enough to handle the performance scaling governor that pyperf system tune sets. Eventually the CPU frequency would gradually go down, killing any hope at reliable results. So instead I lock the CPU frequency at 2.1 GHz. There’s also a perf event config but that’s less cool :)

Run pyperf system show to verify I haven’t missed anything dumb

Task & target selection#

By default, all targets will selected (i.e. --targets all) with the fmt task. If you’d like to use a different task and/or use a specific kind of targets, there’s options for that:

--task: Choices are parse, fmt-fast, and fmt.
--targets: Choices are micro, normal, and all.

Blackbench’s slowness#

Blackbench can be quite slow, this is because pyperf favours rigourness over speed. Many data points are collected over a series of processes which while does increase the accuracy it also does increase total benchmark duration.

You can pass --fast (which is actually an alias for -- --fast) to ask pyperf to collect less values for faster result turnaround at the price of result quality. Although with a well tuned system, the reduction in benchmarking time is well worth the (not too bad) drop in result quality.

pyperf configuration#

pyperf is the library handling the benchmarking work and while its defaults are excellent (and blackbench just leaves everything on default) sometimes you’ll need to modify the benchmark settings for stability or time requirement reasons. It’s possible to pass all1 pyperf.Runner CLI options and flags.

Just call blackbench run with your usual arguments PLUS -- and then any pyperf arguments. The -- is strongly recommended since anything that comes after will be left unprocessed and won’t be treated as options to blackbench.

Examples include:

$ blackbench run example.json -- --fast

$ blackbench run example2.json --task parse -- --affinity 3

$ blackbench run example3.json --targets micro --format-config "experimental_string_processing=True" -- --values 3 --warmups 2

Benchmark customization#

If you’re using a format type task, you can use --format-config to pass custom formatting options to Black during benchmarking. The value is substituted into black.FileMode({VALUE}) so it must be valid argument Python code. The substitution context has the Black package imported. For example, passing a custom line length can be done with --format-config "line_length=79". The generated benchmark script will look something like this:

# In reality, there's some more supporting code, but it's irrelevant here.

import black

def format_func(code):
    try:
        black.format_file_contents(code, fast=True, mode=black.FileMode(line_length=79))
    except black.NothingChanged:
        pass

runner.bench_func("example-task-example-target", format_func, code)

1: Although note that not all options will play nicely with blackbench’s integration with pyperf. Examples include --help, --output, and --append.