picobenchmarks - very short sequence of few instructions. Measures
latency and throughput of even individual instructions, usually in
assembler, or simple obvious C code. Sometimes same instructions, just
different order, to measure pipeline effects. Only of interest to
compiler writers, and software writers. Often cannot be used to compare
very different CPU architectures. Datasets usually entirely in L1 cache
or registers.

Example: xor ax, ax; xor bx, bx vs xor ax, ax; mov bx, ax

nanobenchmarks - a bit bigger, usually have some kernel-like structure.
I.e. looping, vectorization of loops, implementing basic functions, like
math special functions, string functions, etc. Dataset often in L1 cache,
and have very regular data patterns access, easy to prefetch
automatically or manually.

Example: LU decomposition of medium size matrices, strstr implementation,
sin(x) implementation, float2str implementation, classic Dryhstone
benchmark. Fibonacci. Mandelbrot set generation.

synthetic - type of nanobenchmark, usually measure key aspects of
computer system in abstract workload. Often nonsense code, that just
mimicks statistically other programs in number of instructions and code
patterns. Often also used for things like database, file system, store
tests, with a lot of small random operations.

Examples: memory bandwidth, postmark, stress-ng, Whetstone, NBench

microbenchmarks - usually synthetics, rearly excercise more than one
maybe two system at a time. Can be often used to measure different
systems, but usually designed for comparing multiple different
implementation / algorithms for solving some problem. Compiler
optimizations can easily undo a lot of assumptions, so care needs to be
taken. Often is important in other apps, but also often results can be
skwed by assembler optimizations (i.e. x86 SSE, are more popular than ARM
or  or dedicated hardware acceleration (cryptography is a good example).

Example: FFT, N-Queens, Sieve, Blowfish, zlib compression, small
raytracer, A* algorithm, audio beat-detection, JSON parser.

minibenchmarks - real world application fragments, reformed to be easy to
run. More open code, a lot of conditions and jumps, high dependence on
input data. Have usually no or very small dataset, or it is generated at
runtime. Usually no IO, with all input and output entirely in memory.
minibenchmarks rearly do real multithreading testing, instead running
multiple copies of same benchmarks, with (almost) no shared state (beyond
maybe input data). Tests way more of memory subsystem, branch prediction,
speculation, etc. Almost always code and data does not fit L1, but on
some modern CPUs, code could fit in L2 or L3. Binary code between 100kB
and 500kB usually, rearly much bigger.

Example: gcc compiler parser, sqlite3 database, ray tracer with some
complex shaders, structure from motion, HTML DOM manipulation from
JavaScript, PDF rendering, video object detection. Geekbench, JavaScript
Octane, and SPEC suites are example of such benchmarks.

Minibenchmarks can be very deciving, especially as many of them are in
suites, and have dozens of subtests, but report often weighted geometric
mean, which can be highly influenced just by single sub-test.

Also same microbenchmark can be run with different input size, which
could move from compute bound to memory bound, and test a very different
things.

macrobenchmarks - usually complex apps, either end-user app that has own
utility, and benchmarking is not a primary or even secondary purpose.
Most realistic, highest number of variables. Big code, big datasets, IO,
networking, encryption, libraries, graphics, multithreaded with complex
synchronization, a lot of random memory allocations, etc, etc. Might
require a long time to remove variance, do warmups, or do setup. Easily
can go out of date due to use of old APIs or prioritary nature of some
parts. Often macrobenchmarks can also be used as a stress test tool, to
asses limits of the system or software. Often will take hours, to ensure
stability, measure tail latency, ensure no memory leaks, etc. Code and
data almost always cannot fit even in L2 or L3 caches, but some small
sub-parts could be considered hot. Binary code between 5MB and 100MB,
with about half of it being cold or dead, and other half actually
executed. A lot of code being complex due to glue, configuration, error
handling, business logic, special cases handling, etc.

Example: Cybperpunk 2077, PostgreSQL 17 stress test, Linux kernel
compilation (highly parallel but varied), GROMACS molecular dynamics
simulation, Blender Cycles renderer (embrasingly parallal), LaTeX run on
a big complex document, complex web app in Python or PHP with MySQL,
behind Apache and nginx reverse proxy.