The other week, an acquaintance of mine was kvetching on Twitter
about how the Rust compiler is
written in Rust, and so to get started with the language you have to
download a binary, and there’s no way to validate it—you could use the
binary plus the matching compiler source to recreate the binary, but that
doesn’t prove anything, and also if the compiler were
really out to get you, you would be screwed the moment you ran
the binary.
This is not a new problem, nor is it a Rust-specific problem. I
recall having essentially the same issue back in 2000, give or take,
with GNAT, the Ada
front-end for GCC. It is written in Ada, and (at the time, anyway) not
just any Ada compiler would do, you had to have a roughly
contemporaneous version of … GNAT. It was especially infuriating
compared to the rest of GCC, which (again, at the time) bent over
backward to be buildable with any C compiler you could get your
hands on, even a traditional
one that didn’t support all of the
1989 language standard. But even that is problematic for someone who
would rather not trust any machine code they didn’t verify
themselves.
One way around the headache is diverse
recompilation,
in which you compile the same compiler with
two different compilers, then recompile it with
itself-as-produced-by-each, and compare the results. But this requires
you to have two different compilers in the first place. As of this
writing there is only one Rust compiler. There aren’t that many
complete implementations of C++ out there, either, and you need one of
those to build LLVM (which Rust depends on). I think you could devise a
compiler virus that could propagate itself via both LLVM and
GCC, for instance.
What’s needed, I think, is an independent root of correctness. A
software environment built from scratch to be verifiable, maybe even
provably correct, and geared specifically to host independent
implementations of compilers for popular languages. They need not be
terribly good at optimizing, because the only thing you’d ever use them
for is to be one side of a diversely-recompiled bootstrap sequence. It
has to be a complete and isolated environment, though, because it
wouldn’t be impossible to propagate a compiler virus through the
operating system kernel, which can see every block of I/O, after
all.
And it seems to me that this environment naturally divides into four
pieces. First, a tiny virtual machine. I’m thinking a FORTH
interpreter, which is small enough that one programmer can code it by
hand in assembly language, and having done that, another programmer can
audit it by hand. You need multiple implementations of this, so
you can check them against each other to guard against malicious lower
layers—it could run on the bare metal, maybe, but the bare metal
has an awful lot of clever embedded in it these days. But hopefully this
is the only thing you need to implement more than once.
Second, you use the FORTH interpreter as the substratum for a more
powerful language. If there’s a language in which each program is its
own proof of correctness, that would be the obvious choice, but my
mental allergy to arrow
languages has put me off following that branch of PL research. Lisp
is generally a good language to write compilers in, so a small dialect
of that would be another obvious choice. (Maybe leave out the call/cc.)
Third, you write compilers in the more powerful language, with both
the FORTH interpreter and more conventional execution environments as
code-generation targets. These compilers can then be used to compile
other stuff to run in the environment, and conversely, you can build
arbitrary code within the environment and export it to your more
conventional OS.
The fourth and final piece is a way of getting data in and out of the
environment. I imagine it as strictly batch-oriented, not interactive at
all, simply because that cuts out a huge chunk of complexity from the
FORTH interpreter; similarly it does not have any business talking to
the network, nor having any notion of time, maybe not even
concurrency—most compile jobs are embarrassingly
parallel, but again, huge chunk of complexity. What feels not-crazy
to me is some sort of trivial file system: ar
archive level of trivial, all files write-once, imposed on a linear
array of disk blocks.
It is probably also necessary to reinvent Make,
or at least some sort of batch job control
language.