Infinitely faster initial Rust builds with DOCKER_HOST (and BuildKit)

Following my first blog post ever on a similar subject I found cargo-wharf, a cacheable and efficient Docker images builder for Rust.

This is an alternate Dockerfile frontend syntax implementation for Rust: it converts a Cargo.toml (a file listing cargo / Rust dependencies and things) into a docker buildable recipe by adding this # syntax line (+ caveats)

# syntax = denzp/cargo-wharf-frontend:v0.1.0-alpha.2

[package]
...

Then with the following one is able to create a Docker image from this Cargo.toml

$ DOCKER_BUILDKIT=1 docker build -t service:latest -f Cargo.toml .

Not demonstrated in that repo but supposedly supported is building binaries:

$ DOCKER_BUILDKIT=1 docker build --platform=local --output=. --file=Cargo.toml .
# or even:
$ DOCKER_BUILDKIT=1 docker build --platform=local --output=. https://github.com/some/repo.git#master:sub/context

Note that using Docker context sub/context isn’t yet supported by BuildKit…

As noted in that repo’s README:

Every dependency is built in its isolated environment and cached independently from others.

Rust projects are notoriously slow to build, especially the initial build.

Incremental builds fast enough for a short dev loop though!
using “thin” LTO helps but build times can still be a hindrance when using cross

Being a fan of DOCKER_HOST this immediately tickled my ears!

A mutualized build artifacts cache

Say you want to work on some large Rust project for the first time. You clone it, then DOCKER_BUILDKIT=1 docker build --platform=local ... it having set DOCKER_HOST=ssh://some_machine.com.

Now your build runs on a beefy machine somewhere and sends you the outputs.

Not only

did the build take a fraction of the time it would take on your machine
but it reaped the benefits of a lightspeed fast connection to the dependency cache!

The Rust community and its backers

could support such a cache (for example by paying for compute, bandwidth and storage)
share the hurdle of building dependencies (given specific flags / architecture triplet / rustc version / …)

Unsure your remote-built project wasn’t backdoored by a malicious some_machine.com / cache / middle person?

Unset DOCKER_HOST, re-run the command and compare sha256(remote-built) with sha256(locally-built)

In fact, all language communities should be sharing a build cache, provided

they often suffer from long build times
tooling allows for hermetic builds
they trust their tools

I’m obviously not a genius. Here are some of other people’s take on this:

5x Faster Rust Docker Builds with cargo-chef
- A cargo-subcommand to speed up Rust Docker builds using Docker layer caching.
- note how it only caches dependencies, maybe even just the public ones
Mozillas’s sccache is ccache with cloud storage
Gradle’s Build Cache
- a (centralized) cache is provided for money
  - Did we just find a[nother] financial incentive to support developer communities?
Nix’s Binary Cache
- see also Cachix

Towards a distributed crate cache

cf rust-lang/cargo#1997 which mentions sccache.

Building Rust code with cargo and verbosity toggled on, one sees rustc calls such as:

❯ cargo --verbose install cargo-edit
# ...
rustc \
    --crate-name autocfg $HOME/.cargo/registry/src/github.com-1ecc6299db9ec823/autocfg-1.1.0/src/lib.rs \
    --error-format=json \
    --json=diagnostic-rendered-ansi,future-incompat \
    --crate-type lib \
    --emit=dep-info,metadata,link \
    -C embed-bitcode=no \
    -C debug-assertions=off \
    -C metadata=6e4def821aa49e9d \
    -C extra-filename=-6e4def821aa49e9d \
    --out-dir $PWD \
    -L dependency=$PWD \
    --cap-lints allow
# which produces the files
# -rw-rw-r-- autocfg-6e4def821aa49e9d.d
# -rw-rw-r-- libautocfg-6e4def821aa49e9d.rlib
# -rw-rw-r-- libautocfg-6e4def821aa49e9d.rmeta

which builds a crate. These arguments,

plus what crate features are requested,
plus whether a build.rs file is involved,
plus the Rust toolchain versions or hashes,
plus hashes of all OS-specific libs and binaries that some crates rely on,
~~plus ???~~

should provide enough information to build a crate’s cache key (humongous caveat: in hopefully most cases).

From there it’s hermetic builds, content-addressable storage and distributed compilation turtles all the way down. Easy!

Figuring out a crate’s cache key depends on that project’s specific build plan which itself might depend on some environment variables, the version of the current shell… in short once a crate relies on a build.rs file for compilation (meaning some non-cargo/rustc code from the Internet gets executed on your machine!) all cache key bets are off.

Caching crates should be possible for some (most?) crates, but not these ones anyway.

RUSTC_WRAPPER setting should help get there. Although looking at sccache’s caveats…

UPDATE: my attempt: cargo-green

Private code & security

These are the privacy and security concerns I can see from my echo-chamber-slash-comfy-chair:

Building non-public code / private dependencies
- and not contributing back to the build cache
Not being able to reconstruct a private project from the cache nor how it accesses the cache
Asserting a given build artifact was not tempered with

To address these points I see:

Tooling should separate public from private
- the tools closer to the language have the best semantics
- a private cache should be easy to set up (in any Continuous Integration system)
- with fallback on the single public cache on misses
A Merkle tree
- content-addressable storage
- with granular enough blocks (Bittorrent v2 says >=16kiB)
  - a dependency would be associated its root hash in the tree and resolve to many blocks
- with numerous enough accesses to hide a single actor’s usage
- Bazil seems related, has good security buzzwords
  - it does not seem to be aimed at being a cache (e.g. pruning should be harmless)
Stochastic re-building / cache pruning
- should provide a reliable way of Proof-of-Work
- by computing builds multiple times (and achieving the same results)

I’m describing a DHT, a decentralized one. Maybe even IPFS.