Analyzing results#

Blackbench’s output is actually JSON output directly from pyperf. The JSON file can be loaded as an instance of pyperf.BenchmarkSuite. Anyway, this means that all analyzing of the results will happen with pyperf directly. Don’t worry, pyperf should have been installed alongside blackbench.

Analyzing a single run#

Summary#

For a short summary you can use pyperf show and pass the JSON file.

dev@example:~/blackbench$ pyperf show normal.json
fmt-black/__init__: Mean +- std dev: 1.50 sec +- 0.05 sec
fmt-black/brackets: Mean +- std dev: 479 ms +- 13 ms
fmt-black/comments: Mean +- std dev: 382 ms +- 6 ms
fmt-black/linegen: Mean +- std dev: 1.52 sec +- 0.04 sec
fmt-black/lines: Mean +- std dev: 1.09 sec +- 0.05 sec
fmt-black/mode: Mean +- std dev: 167 ms +- 3 ms
fmt-black/nodes: Mean +- std dev: 1.23 sec +- 0.04 sec
fmt-black/output: Mean +- std dev: 161 ms +- 6 ms
fmt-black/strings: Mean +- std dev: 282 ms +- 8 ms
fmt-comments: Mean +- std dev: 163 ms +- 7 ms
fmt-dict-literal: Mean +- std dev: 227 ms +- 8 ms
fmt-flit/install: Mean +- std dev: 740 ms +- 75 ms
fmt-flit/sdist: Mean +- std dev: 381 ms +- 17 ms
fmt-flit_core/config: Mean +- std dev: 993 ms +- 48 ms
fmt-list-literal: Mean +- std dev: 134 ms +- 4 ms
fmt-nested: Mean +- std dev: 141 ms +- 29 ms
fmt-strings-list: Mean +- std dev: 43.2 ms +- 1.7 ms

Indepth statistics#

For more indepth information pyperf stats works wonders1:

ichard26@acer-ubuntu:~/programming/oss/blackbench$ pyperf stats example.json
normal
======

Number of benchmarks: 17
Total duration: 2 min 6.1 sec
Start date: 2021-07-26 16:49:33
End date: 2021-07-26 16:52:24

fmt-black/__init__
------------------

Total duration: 18.6 sec
Start date: 2021-07-26 16:49:33
End date: 2021-07-26 16:49:55
Raw value minimum: 1.45 sec
Raw value maximum: 1.56 sec

Number of calibration run: 1
Number of run with values: 5
Total number of run: 6

Number of warmup per run: 1
Number of value per run: 1
Loop iterations per value: 1
Total number of values: 5

Minimum:         1.45 sec
Median +- MAD:   1.49 sec +- 0.04 sec
Mean +- std dev: 1.50 sec +- 0.05 sec
Maximum:         1.56 sec

  0th percentile: 1.45 sec (-4% of the mean) -- minimum
  5th percentile: 1.45 sec (-3% of the mean)
 25th percentile: 1.48 sec (-2% of the mean) -- Q1
 50th percentile: 1.49 sec (-1% of the mean) -- median
 75th percentile: 1.55 sec (+3% of the mean) -- Q3
 95th percentile: 1.55 sec (+3% of the mean)
100th percentile: 1.56 sec (+4% of the mean) -- maximum

Number of outlier (out of 1.38 sec..1.65 sec): 0

fmt-black/brackets
------------------

[snipped ...]

Histogram#

pyperf hist is rather useful if you’re curious to how instable the data is:

ichard26@acer-ubuntu:~/programming/oss/blackbench$ pyperf hist example.json
fmt-black/__init__
==================

1.52 sec:  3 ######################
1.56 sec:  4 #############################
1.60 sec: 11 ###############################################################################
1.64 sec: 11 ###############################################################################
1.67 sec:  9 #################################################################
1.71 sec:  8 #########################################################
1.75 sec:  5 ####################################
1.79 sec:  3 ######################
1.83 sec:  1 #######
1.87 sec:  0 |
1.91 sec:  1 #######
1.95 sec:  1 #######
1.99 sec:  0 |
2.03 sec:  1 #######
2.06 sec:  0 |
2.10 sec:  0 |
2.14 sec:  0 |
2.18 sec:  0 |
2.22 sec:  0 |
2.26 sec:  0 |
2.30 sec:  0 |
2.34 sec:  0 |
2.38 sec:  1 #######
2.41 sec:  0 |
2.45 sec:  0 |
2.49 sec:  1 #######

fmt-black/brackets
==================

[snipped ...]

Tip

You can extract the results for a single benchmark via [pyperf convert in.json --include-benchmark "${BENCHMARK}" -o out.json][pyperf-convert] . Also, most pyperf commands should support --benchmark to select only one or a few benchmarks when passed in a BenchmarkSuite JSON file.

Tip

If you need the source for either a task or a target for even deeper analysis, you can call blackbench dump ${name}.

Comparing multiple runs#

Comparisons between different runs can be done via pyperf compare_to. You can pass as many files as you’d wish, although note the order to ensure the deltas make sense.

dev@example:~/blackbench$ pyperf compare_to normal.json with-esp.json
fmt-black/__init__: Mean +- std dev: [normal] 1.50 sec +- 0.05 sec -> [with-esp] 1.68 sec +- 0.03 sec: 1.12x slower
fmt-black/brackets: Mean +- std dev: [normal] 479 ms +- 13 ms -> [with-esp] 515 ms +- 11 ms: 1.07x slower
fmt-black/comments: Mean +- std dev: [normal] 382 ms +- 6 ms -> [with-esp] 400 ms +- 11 ms: 1.05x slower
fmt-black/mode: Mean +- std dev: [normal] 167 ms +- 3 ms -> [with-esp] 175 ms +- 5 ms: 1.05x slower
fmt-black/strings: Mean +- std dev: [normal] 282 ms +- 8 ms -> [with-esp] 298 ms +- 6 ms: 1.06x slower
fmt-dict-literal: Mean +- std dev: [normal] 227 ms +- 8 ms -> [with-esp] 244 ms +- 6 ms: 1.08x slower
fmt-list-literal: Mean +- std dev: [normal] 134 ms +- 4 ms -> [with-esp] 156 ms +- 19 ms: 1.17x slower
fmt-strings-list: Mean +- std dev: [normal] 43.2 ms +- 1.7 ms -> [with-esp] 184 ms +- 4 ms: 4.25x slower

Benchmark hidden because not significant (9): fmt-black/linegen, fmt-black/lines, fmt-black/nodes, fmt-black/output, fmt-comments, fmt-flit/install, fmt-flit/sdist, fmt-flit_core/config, fmt-nested

Geometric mean: 1.12x slower

Note how pyperf determines whether two samples differ significantly (using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95). This helps out a lot by ignoring non-meaningful differences, but there’s more to know! Getting stable numbers is really hard, so there’s a possibility “significant” results are still just noise (or are actual results, but are so small to be meaningless). In this case applying a cutoff might be a good idea (you can ask pyperf to do this for you via --min-speed). What cutoff to use depends on what benchmarks you ran - a 5% perf improvement on a microbenchmark most likely isn’t as meaningful as one on a (normal) benchmark, AND how stable your data was (if you system was very noisy then maybe the great results you’re seeing aren’t actually real …). One final tip is to use the “Geometric mean” value, if you see a general speedup by 10%, then it seems likely you got a nice win on your hands!

Table view#

While’s compare_to’s default format is neatly compact, it can be a bit hard to parse. Using --table fixes that:

dev@example:~/blackbench$ pyperf compare_to normal.json with-esp.json --table
+--------------------+----------+------------------------+
| Benchmark          | normal   | with-esp               |
+====================+==========+========================+
| fmt-black/__init__ | 1.50 sec | 1.68 sec: 1.12x slower |
+--------------------+----------+------------------------+
| fmt-black/brackets | 479 ms   | 515 ms: 1.07x slower   |
+--------------------+----------+------------------------+
| fmt-black/comments | 382 ms   | 400 ms: 1.05x slower   |
+--------------------+----------+------------------------+
| fmt-black/mode     | 167 ms   | 175 ms: 1.05x slower   |
+--------------------+----------+------------------------+
| fmt-black/strings  | 282 ms   | 298 ms: 1.06x slower   |
+--------------------+----------+------------------------+
| fmt-dict-literal   | 227 ms   | 244 ms: 1.08x slower   |
+--------------------+----------+------------------------+
| fmt-list-literal   | 134 ms   | 156 ms: 1.17x slower   |
+--------------------+----------+------------------------+
| fmt-strings-list   | 43.2 ms  | 184 ms: 4.25x slower   |
+--------------------+----------+------------------------+
| Geometric mean     | (ref)    | 1.12x slower           |
+--------------------+----------+------------------------+

Benchmark hidden because not significant (9): fmt-black/linegen, fmt-black/lines, fmt-black/nodes, fmt-black/output, fmt-comments, fmt-flit/install, fmt-flit/sdist, fmt-flit_core/config, fmt-nested

Tip

Passing the -G flag causes compare_to’s output to be organized in groups of faster/slower/not-significant. This usually makes the output more readable.

Todo

Provide more examples and also improve their quality. Perhaps also add some more prose and discussion on then using this data to make inferences and conclusions (as much as that makes this ever closer to some sort of statistics 101 primer).

Todo

Provide an example demonstrating pyperf metadata once blackbench injects useful metadata.


1

I gave up trying to make my hastily gathered (I asked pyperf to collect like only five values per benchmark!) data look normal, please don’t @ me if your data doesn’t look like mine :P