# Analyzing results Blackbench's output is actually JSON output directly from {pypi}`pyperf`. The JSON file can be loaded as an instance of {py:class}`pyperf.BenchmarkSuite `. Anyway, this means that all analyzing of the results will happen with pyperf directly. Don't worry, pyperf should have been installed alongside blackbench. ```{seealso} [pyperf's excellent analyzing benchmark results docs](https://pyperf.readthedocs.io/en/latest/analyze.html). ``` ## Analyzing a single run ### Summary For a short summary you can use {ref}`pyperf show ` and pass the JSON file. ```console dev@example:~/blackbench$ pyperf show normal.json fmt-black/__init__: Mean +- std dev: 1.50 sec +- 0.05 sec fmt-black/brackets: Mean +- std dev: 479 ms +- 13 ms fmt-black/comments: Mean +- std dev: 382 ms +- 6 ms fmt-black/linegen: Mean +- std dev: 1.52 sec +- 0.04 sec fmt-black/lines: Mean +- std dev: 1.09 sec +- 0.05 sec fmt-black/mode: Mean +- std dev: 167 ms +- 3 ms fmt-black/nodes: Mean +- std dev: 1.23 sec +- 0.04 sec fmt-black/output: Mean +- std dev: 161 ms +- 6 ms fmt-black/strings: Mean +- std dev: 282 ms +- 8 ms fmt-comments: Mean +- std dev: 163 ms +- 7 ms fmt-dict-literal: Mean +- std dev: 227 ms +- 8 ms fmt-flit/install: Mean +- std dev: 740 ms +- 75 ms fmt-flit/sdist: Mean +- std dev: 381 ms +- 17 ms fmt-flit_core/config: Mean +- std dev: 993 ms +- 48 ms fmt-list-literal: Mean +- std dev: 134 ms +- 4 ms fmt-nested: Mean +- std dev: 141 ms +- 29 ms fmt-strings-list: Mean +- std dev: 43.2 ms +- 1.7 ms ``` ### Indepth statistics For more indepth information {ref}`pyperf stats ` works wonders[^1]: ```console ichard26@acer-ubuntu:~/programming/oss/blackbench$ pyperf stats example.json normal ====== Number of benchmarks: 17 Total duration: 2 min 6.1 sec Start date: 2021-07-26 16:49:33 End date: 2021-07-26 16:52:24 fmt-black/__init__ ------------------ Total duration: 18.6 sec Start date: 2021-07-26 16:49:33 End date: 2021-07-26 16:49:55 Raw value minimum: 1.45 sec Raw value maximum: 1.56 sec Number of calibration run: 1 Number of run with values: 5 Total number of run: 6 Number of warmup per run: 1 Number of value per run: 1 Loop iterations per value: 1 Total number of values: 5 Minimum: 1.45 sec Median +- MAD: 1.49 sec +- 0.04 sec Mean +- std dev: 1.50 sec +- 0.05 sec Maximum: 1.56 sec 0th percentile: 1.45 sec (-4% of the mean) -- minimum 5th percentile: 1.45 sec (-3% of the mean) 25th percentile: 1.48 sec (-2% of the mean) -- Q1 50th percentile: 1.49 sec (-1% of the mean) -- median 75th percentile: 1.55 sec (+3% of the mean) -- Q3 95th percentile: 1.55 sec (+3% of the mean) 100th percentile: 1.56 sec (+4% of the mean) -- maximum Number of outlier (out of 1.38 sec..1.65 sec): 0 fmt-black/brackets ------------------ [snipped ...] ``` ### Histogram {ref}`pyperf hist ` is rather useful if you're curious to how instable the data is: ```console ichard26@acer-ubuntu:~/programming/oss/blackbench$ pyperf hist example.json fmt-black/__init__ ================== 1.52 sec: 3 ###################### 1.56 sec: 4 ############################# 1.60 sec: 11 ############################################################################### 1.64 sec: 11 ############################################################################### 1.67 sec: 9 ################################################################# 1.71 sec: 8 ######################################################### 1.75 sec: 5 #################################### 1.79 sec: 3 ###################### 1.83 sec: 1 ####### 1.87 sec: 0 | 1.91 sec: 1 ####### 1.95 sec: 1 ####### 1.99 sec: 0 | 2.03 sec: 1 ####### 2.06 sec: 0 | 2.10 sec: 0 | 2.14 sec: 0 | 2.18 sec: 0 | 2.22 sec: 0 | 2.26 sec: 0 | 2.30 sec: 0 | 2.34 sec: 0 | 2.38 sec: 1 ####### 2.41 sec: 0 | 2.45 sec: 0 | 2.49 sec: 1 ####### fmt-black/brackets ================== [snipped ...] ``` ```{tip} You can extract the results for a single benchmark via [`pyperf convert in.json --include-benchmark "${BENCHMARK}" -o out.json`][pyperf-convert] . Also, most pyperf commands should support `--benchmark` to select only one or a few benchmarks when passed in a BenchmarkSuite JSON file. ``` ```{tip} If you need the source for either a task or a target for even deeper analysis, you can call `blackbench dump ${name}`. ``` ## Comparing multiple runs Comparisons between different runs can be done via {ref}`pyperf compare_to `. You can pass as many files as you'd wish, although note the order to ensure the deltas make sense. ```console dev@example:~/blackbench$ pyperf compare_to normal.json with-esp.json fmt-black/__init__: Mean +- std dev: [normal] 1.50 sec +- 0.05 sec -> [with-esp] 1.68 sec +- 0.03 sec: 1.12x slower fmt-black/brackets: Mean +- std dev: [normal] 479 ms +- 13 ms -> [with-esp] 515 ms +- 11 ms: 1.07x slower fmt-black/comments: Mean +- std dev: [normal] 382 ms +- 6 ms -> [with-esp] 400 ms +- 11 ms: 1.05x slower fmt-black/mode: Mean +- std dev: [normal] 167 ms +- 3 ms -> [with-esp] 175 ms +- 5 ms: 1.05x slower fmt-black/strings: Mean +- std dev: [normal] 282 ms +- 8 ms -> [with-esp] 298 ms +- 6 ms: 1.06x slower fmt-dict-literal: Mean +- std dev: [normal] 227 ms +- 8 ms -> [with-esp] 244 ms +- 6 ms: 1.08x slower fmt-list-literal: Mean +- std dev: [normal] 134 ms +- 4 ms -> [with-esp] 156 ms +- 19 ms: 1.17x slower fmt-strings-list: Mean +- std dev: [normal] 43.2 ms +- 1.7 ms -> [with-esp] 184 ms +- 4 ms: 4.25x slower Benchmark hidden because not significant (9): fmt-black/linegen, fmt-black/lines, fmt-black/nodes, fmt-black/output, fmt-comments, fmt-flit/install, fmt-flit/sdist, fmt-flit_core/config, fmt-nested Geometric mean: 1.12x slower ``` Note how pyperf determines whether two samples differ significantly (using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95). This helps out a lot by ignoring non-meaningful differences, but there's more to know! Getting stable numbers is really hard, so there's a possibility "significant" results are still just noise (or are actual results, but are so small to be meaningless). In this case applying a cutoff might be a good idea (you can ask pyperf to do this for you via `--min-speed`). What cutoff to use depends on what benchmarks you ran - a 5% perf improvement on a microbenchmark most likely isn't as meaningful as one on a (normal) benchmark, AND how stable your data was (if you system was very noisy then maybe the great results you're seeing aren't actually real ...). One final tip is to use the "Geometric mean" value, if you see a general speedup by 10%, then it seems likely you got a nice win on your hands! ### Table view While's compare_to's default format is neatly compact, it can be a bit hard to parse. Using `--table` fixes that: ```console dev@example:~/blackbench$ pyperf compare_to normal.json with-esp.json --table +--------------------+----------+------------------------+ | Benchmark | normal | with-esp | +====================+==========+========================+ | fmt-black/__init__ | 1.50 sec | 1.68 sec: 1.12x slower | +--------------------+----------+------------------------+ | fmt-black/brackets | 479 ms | 515 ms: 1.07x slower | +--------------------+----------+------------------------+ | fmt-black/comments | 382 ms | 400 ms: 1.05x slower | +--------------------+----------+------------------------+ | fmt-black/mode | 167 ms | 175 ms: 1.05x slower | +--------------------+----------+------------------------+ | fmt-black/strings | 282 ms | 298 ms: 1.06x slower | +--------------------+----------+------------------------+ | fmt-dict-literal | 227 ms | 244 ms: 1.08x slower | +--------------------+----------+------------------------+ | fmt-list-literal | 134 ms | 156 ms: 1.17x slower | +--------------------+----------+------------------------+ | fmt-strings-list | 43.2 ms | 184 ms: 4.25x slower | +--------------------+----------+------------------------+ | Geometric mean | (ref) | 1.12x slower | +--------------------+----------+------------------------+ Benchmark hidden because not significant (9): fmt-black/linegen, fmt-black/lines, fmt-black/nodes, fmt-black/output, fmt-comments, fmt-flit/install, fmt-flit/sdist, fmt-flit_core/config, fmt-nested ``` ```{tip} Passing the `-G` flag causes compare_to's output to be organized in groups of faster/slower/not-significant. This usually makes the output more readable. ``` ```{todo} Provide more examples and also improve their quality. Perhaps also add some more prose and discussion on then using this data to make inferences and conclusions (as much as that makes this ever closer to some sort of statistics 101 primer). ``` ```{todo} Provide an example demonstrating `pyperf metadata` once blackbench injects useful metadata. ``` [^1]: I gave up trying to make my hastily gathered (I asked pyperf to collect like only five values per benchmark!) data look normal, please don't @ me if your data doesn't look like mine :P