Infinitely faster initial Rust builds with DOCKER_HOST (and BuildKit)
Following my first blog post ever on a similar subject I found cargo-wharf
, a cacheable and efficient Docker images builder for Rust.
This is an alternate Dockerfile frontend syntax implementation for Rust:
it converts a Cargo.toml
(a file listing cargo
/ Rust dependencies and things) into a docker build
able recipe by adding this # syntax
line (+ caveats)
# syntax = denzp/cargo-wharf-frontend:v0.1.0-alpha.2
[package]
...
Then with the following one is able to create a Docker image from this Cargo.toml
$ DOCKER_BUILDKIT=1 docker build -t service:latest -f Cargo.toml .
Not demonstrated in that repo but supposedly supported is building binaries:
$ DOCKER_BUILDKIT=1 docker build --platform=local --output=. --file=Cargo.toml .
# or even:
$ DOCKER_BUILDKIT=1 docker build --platform=local --output=. https://github.com/some/repo.git#master:sub/context
Note that using Docker context
sub/context
isn’t yet supported by BuildKit…
As noted in that repo’s README:
Every dependency is built in its isolated environment and cached independently from others.
Rust projects are notoriously slow to build, especially the initial build.
- Incremental builds fast enough for a short dev loop though!
- using “thin” LTO helps but build times can still be a hindrance when using
cross
Being a fan of DOCKER_HOST
this immediately tickled my ears!
A mutualized build artifacts cache
Say you want to work on some large Rust project for the first time. You clone it, then DOCKER_BUILDKIT=1 docker build --platform=local ...
it having set DOCKER_HOST=ssh://some_machine.com
.
Now your build runs on a beefy machine somewhere and sends you the outputs.
Not only
- did the build take a fraction of the time it would take on your machine
- but it reaped the benefits of a lightspeed fast connection to the dependency cache!
The Rust community and its backers
- could support such a cache (for example by paying for compute, bandwidth and storage)
- share the hurdle of building dependencies (given specific flags / architecture triplet /
rustc
version / …)
Unsure your remote-built project wasn’t backdoored by a malicious some_machine.com / cache / middle person?
Unset
DOCKER_HOST
, re-run the command and comparesha256(remote-built)
withsha256(locally-built)
In fact, all language communities should be sharing a build cache, provided
- they often suffer from long build times
- tooling allows for hermetic builds
- they trust their tools
I’m obviously not a genius. Here are some of other people’s take on this:
- 5x Faster Rust Docker Builds with cargo-chef
- A cargo-subcommand to speed up Rust Docker builds using Docker layer caching.
- note how it only caches dependencies, maybe even just the public ones
- Mozillas’s
sccache
isccache
with cloud storage - Gradle’s Build Cache
- a (centralized) cache is provided for money
- Did we just find a[nother] financial incentive to support developer communities?
- a (centralized) cache is provided for money
- Nix’s Binary Cache
- see also Cachix
Towards a distributed crate cache
cf rust-lang/cargo#1997 which mentions sccache
.
Building Rust code with cargo
and verbosity toggled on, one sees rustc
calls such as:
❯ cargo --verbose install cargo-edit
# ...
rustc \
--crate-name autocfg $HOME/.cargo/registry/src/github.com-1ecc6299db9ec823/autocfg-1.1.0/src/lib.rs \
--error-format=json \
--json=diagnostic-rendered-ansi,future-incompat \
--crate-type lib \
--emit=dep-info,metadata,link \
-C embed-bitcode=no \
-C debug-assertions=off \
-C metadata=6e4def821aa49e9d \
-C extra-filename=-6e4def821aa49e9d \
--out-dir $PWD \
-L dependency=$PWD \
--cap-lints allow
# which produces the files
# -rw-rw-r-- autocfg-6e4def821aa49e9d.d
# -rw-rw-r-- libautocfg-6e4def821aa49e9d.rlib
# -rw-rw-r-- libautocfg-6e4def821aa49e9d.rmeta
which builds a crate. These arguments,
- plus what crate features are requested,
- plus whether a
build.rs
file is involved, - plus the Rust toolchain versions or hashes,
- plus hashes of all OS-specific libs and binaries that some crates rely on,
plus ???
should provide enough information to build a crate’s cache key (humongous caveat: in hopefully most cases).
From there it’s hermetic builds, content-addressable storage and distributed compilation turtles all the way down. Easy!
Figuring out a crate’s cache key depends on that project’s specific build plan which itself might depend on some environment variables, the version of the current shell… in short once a crate relies on a build.rs
file for compilation (meaning some non-cargo/rustc code from the Internet gets executed on your machine!) all cache key bets are off.
Caching crates should be possible for some (most?) crates, but not these ones anyway.
RUSTC_WRAPPER
setting should help get there. Although looking at sccache
’s caveats…
UPDATE: my attempt: cargo-green
Private code & security
These are the privacy and security concerns I can see from my echo-chamber-slash-comfy-chair:
- Building non-public code / private dependencies
- and not contributing back to the build cache
- Not being able to reconstruct a private project from the cache nor how it accesses the cache
- Asserting a given build artifact was not tempered with
To address these points I see:
- Tooling should separate public from private
- the tools closer to the language have the best semantics
- a private cache should be easy to set up (in any Continuous Integration system)
- with fallback on the single public cache on misses
- A Merkle tree
- content-addressable storage
- with granular enough blocks (Bittorrent v2 says >=16kiB)
- a dependency would be associated its root hash in the tree and resolve to many blocks
- with numerous enough accesses to hide a single actor’s usage
- Bazil seems related, has good security buzzwords
- it does not seem to be aimed at being a cache (e.g. pruning should be harmless)
- Stochastic re-building / cache pruning
- should provide a reliable way of Proof-of-Work
- by computing builds multiple times (and achieving the same results)