Cross-checking Tutorial
Introduction
The C2Rust transpiler aims to convert C code to semantically equivalent unsafe Rust code, and later incremental refactoring passes gradually transform this code to Rust code. However, the initial Rust translation might not be a perfect semantic match to the original C code, and the refactoring passes may also change the code in ways that break semantics. Cross-checking is an automated way to verify that the translated program behaves the same as the original C code.
The way cross-checking achieves this goal is by comparing the execution traces of all versions (henceforth called "variants") of the program (original C, unsafe and refactored Rust) and checking for any differences. Our cross-checking implementation modifies the source code of the program at compile-time (separately during C and Rust compiler invocation) so that the variants output the traces at run-time, and then checks the traces against each other either online during execution (using the ReMon MVEE), or offline by comparing log files. The C2Rust cross-checkers currently instrument function entry and exit points, function return values, and function call arguments (currently experimental and disabled by default, but can be enabled per argument, function or file).
Example
To illustrate how cross-checking works, let us take the following code snippet:
int foo() {
return 1;
}
Calling the foo
function will cause the following cross-check events to be emitted:
XCHECK(Ent):193491849/0x0b887389
XCHECK(Exi):193491849/0x0b887389
XCHECK(Ret):8680820740569200759/0x7878787878787877
Building code with cross-checks
C2Rust contains one cross-checking implementation per language, in the form of a compiler plugin in both cases. We provide a clang plugin for C code, and a rustc plugin for Rust code.
Building C code
To build C variants with cross-checks enabled, first build the cross-checking plugin using $C2RUST/scripts/build_cross_checks.py
,
then run clang (or pass it to the build system) with the following options:
-Xclang -load -Xclang $C2RUST/build/clang-xcheck-plugin.$(uname -n)/plugin/CrossChecks.so
to load the plugin-Xclang -add-plugin -Xclang crosschecks
to activate the plugin-Xclang -plugin-arg-crosschecks -Xclang <...>
for every additional option to pass to the plugin-ffunction-sections
may be required to correctly deduplicate some linkonce functions inserted by the plugin
Note that every option passed to clang requires a -Xclang
prefix before the actual option,
so that the compiler driver passes it to the clang backend correctly.
We provide a cc_wrapper.sh
script in the plugin source code directory that inserts these automatically,
as well as several project-specific scripts in directories under examples/
.
Additionally, the following arguments should be passed to the linker:
- The cross-checking runtime library from
$C2RUST/build/clang-xcheck-plugin.$(uname -n)/runtime/libruntime.a
- A cross-checking backend library that provides the
rb_xcheck
function, e.g.,libfakechecks
for offline logging orlibclevrbuf
for online MVEE-based checks
Building Rust code
Building Rust code with cross-checks is simpler that C code, and only requires a few additions
to Cargo.toml
and the main Rust source file. Add the following to your Cargo.toml
file
(replacing $C2RUST
to the actual path to this repository):
[dependencies.c2rust-xcheck-plugin]
path = "$C2RUST/cross-checks/rust-checks/rustc-plugin"
[dependencies.c2rust-xcheck-derive]
path = "$C2RUST/cross-checks/rust-checks/derive-macros"
[dependencies.c2rust-xcheck-runtime]
path = "$C2RUST/cross-checks/rust-checks/runtime"
features = ["libc-hash", "fixed-length-array-hash"]
and this preamble to your lib.rs
or main.rs
:
#![feature(plugin, custom_attribute)]
#![cross_check(yes)]
#[macro_use] extern crate c2rust_xcheck_derive;
#[macro_use] extern crate c2rust_xcheck_runtime;
You may also add #![plugin(c2rust_xcheck_plugin(...))]
to pass additional arguments to the cross-checking plugin.
Cross-check configuration
Cross-checks can be customized at a fine granularity using cross-check configuration files or inline attributes.
Running cross-checked programs
Offline mode
When cross-checking in offline mode, all variants are executed independentely on the same inputs, and their cross-checks are written to either standard output or log files. After running all the variants, divergence can be detected by manually comparing the logs for mismatches. There are several backend libraries that support different types of logging outputs:
libfakechecks
outputs a list of the cross-checks linearly to either standard output or a file (specified using theFAKECHECKS_OUTPUT_FILE
environment variable)zstd-logging
library fromcross-checks/rust-checks/backends
(can also be used with the clang plugin) outputs a binary encoding of the cross-checks that is compressed using zstd, and is much more space-efficient than the text output oflibfakechecks
. The compressed output files can be converted to text using thexcheck-printer
tool.
Before running the C and Rust variants, you may need to load in one of these libraries using LD_PRELOAD
if you
haven't linked against it and passed in its path using -rpath
(this is fairly easy to do for a C build, but
more complicated when using Cargo for Rust code), like this:
$ env LD_PRELOAD=$C2RUST/cross-checks/libfakechecks/libfakechecks.so ./a.out
Running each variant with cross-checks enabled will print a list of cross-check results to the specified output. A simple diff
or cmp
command will show differences in cross-checks, if any.
Online (MVEE) mode
The other execution mode for cross-checks is the online mode, where a monitor program (the MVEE) runs all variants in parallel with exactly the same inputs (by intercepting input system calls like read
and replicating their return values) and cross-checks all the output system calls and instrumentation points inserted by our plugins. This approach has several advantages over offline mode:
- Input operations are fully replicated, including those from stateful resources like sockets; only the master variant performs each actual operation, and each other variant only gets a copy of the data.
- Outputs are cross-checked but not duplicated, so each output operation is only executed by the master variant; the others are only cross-checked for matching outputs. For example, only the master variant opens and writes to output files.
- The lock-step MVEE automatically eliminates most sources of non-determinism, like threading and non-deterministic syscalls, e.g., reading from
/dev/urandom
(see the Troubleshooting section below for more details)
However, the main disadvantage of this approach is that some applications may not run correctly under the MVEE, due to either incomplete support from the MVEE or fundamental MVEE limitations. In such cases, we recommend using offline mode instead.
To run your application inside our MVEE, first build it following the instructions in its README. After building it successfully, write an MVEE configuration file for your application (there is a sample file in the MVEE directory, and a few others in our examples directory), then run the MVEE:
$ ./MVEE/bin/Release/MVEE -f <path/to/MVEE_config.ini> -N<number of variants> -- <variant arguments>
The MVEE.ini
configuration file is fairly self-explanatory, but there are a few notable settings that are important:
xchecks_initially_enabled
disables system call replication and cross-checks up to the first function cross-check (usually for themain
function), and should befalse
by default for cross-language checks. This is because the Rust runtime performs a few additional system calls that C code does not, and the MVEE would terminate with divergence if cross-checks were enabled.relaxed_mman_checks
andunsynced_brk
disable MVEE cross-checks on themmap
family of calls andbrk
, respectively, and should both be set totrue
if the Rust code performs significantly different memory allocations.path
specifies the path to the variant's executable, and should be specified separately per variant (ReMon also supports running multiple variants for the same binary, but with different command line arguments; this is not used by C2Rust cross-checks). All variants should be files in the same directory, otherwise the MVEE will abort with a divergence inside the ELF loader.argv
specifies the arguments to pass to each variant (can be configured per-variant or globally for all variants).env
specifies the environment variables to pass to the variants, and should at least contain aLD_LIBRARY_PATH
entry for thelibclevrbuf.so
library, and aLD_PRELOAD
entry for the zeroing allocatorlibzero_malloc.so
, like this:
{
"variant": {
"global": {
"exec": {
"env": [
"LD_LIBRARY_PATH=../../../cross-checks/ReMon/libclevrbuf",
"LD_PRELOAD=../../../cross-checks/zero-malloc/target/release/libzero_malloc.so"
]
}
}
}
}
Troubleshooting
In case you run into any issues while building or running the variants, please refer to this section for possible fixes.
Build failures
Builds may occasionally fail because of partially or completely unsupported features in our plugins:
- Bitfields in C structures: these do not currently have a Rust equivalent, and the transpiler converts them to regular integer types. The clang plugin will exit with an error when trying to cross-check a bitfield.
- Variadic functions: Rust does not support these yet, and our clang plugin cannot handle them either.
- Fixed-sized arrays of large or unusual sizes: as of the writing of this document, Rust does not have const generics yet, and we need them to support arbitrary-sized arrays on the Rust side. Until then, the rustc plugin runtime only supports cross-checking on fixed-sized arrays from a limited set of sizes (all integers up to 32, all powers of 2 up to 1024).
- Function pointers with more than 12 arguments: these require variadic generics in Rust.
In all these cases, we recommend that you either disable cross-checks for values of these types, or manually provide a custom cross-check function (see the cross-check configuration for more details).
Inline functions in C headers
Cross-checker output for C variants may sometimes contain additional cross-checks for inline functions from system headers which are missing from the corresponding Rust translation. In such cases, we recommend manually disabling cross-checks for the C inline functions using an external cross-check configuration file.
Non-determinism
Ideally, divergence (cross-checks differences between variants) is only caused by actual semantic differences between the variants. However, there is another cause of unintended divergence: nondeterminism in program execution. In most cases, non-determinism will simply cause each variant to produce different cross-checks, but it may also occasionally cause crashes. There are many causes of program non-determinism that interfere with cross-checking:
- Calls to time, RNG, PID (
getpid()
and friends) and other non-deterministic OS functions and operations, e.g., reads from/dev/urandom
- File descriptors. We have hard-coded a fixed cross-check value for all
FILE*
objects for this exact reason. - Threading, i.e., non-deterministic thread scheduling
- Pointer-to-integer casts under ASLR.
Generally, running the variants in online mode inside the MVEE fixes these issues by replicating all system calls between the variants, which ensures that they all receive the same values from the OS.
In case the MVEE does not support a specific application and you need to run it in offline mode (or for any other reason), the recommended fix is to remove all such non-determinism from your code manually, e.g., replace all reads from /dev/urandom
with constant values.
To verify that non-determinism truly is the cause of divergence, we recommend running each separate variant multiple times and cross-checking it against itself. If non-determinism really is the problem, each run will produce different cross-checks.
A note on ASLR and pointers: our cross-checkers currently check pointers by dereference instead of by address, thereby making the checks insensitive to ASLR. However, manually casting pointers to integers poses problems because integers cannot be dereferenced. Integers are cross-checked by value regardless of their source, and their values will differ across runs when they originate from pointers with ASLR enabled.
Uninitialized memory
Uninitialized memory is one common source of non-determinism, since an uninitialized value may have different actual values across different runs of a program. Since our plugins cross-check pointers by dereferencing them, invalid pointers can also crash our cross-checking runtime.
To eliminate this problem, we force zero-initialization for all C and Rust values.
The plugins enforce this for stack (function-local values), and all global values are already zero-initialized (as required by the C standard), which only leaves heap-allocated values, i.e., those allocated using malloc
.
We provide a zeroing malloc
wrapper under cross-checks/zero-malloc
which can be preloaded into an application using LD_PRELOAD
. This wrapper library intercepts all memory allocation calls, and zeroes the allocated buffer after each call. To use this in Rust executables in place of the default jemalloc
, add the following lines to your code to use the system allocator, which our library intercepts:
#[global_allocator]
static A: ::std::alloc::System = ::std::alloc::System;
We recommend that you use our zeroing memory allocator for all cross-checks.
Pointer aliasing
Some C code will use pointers of some type T
to refer to values of another type U
, tricking our runtime into cross-checking the values incorrectly.
This may not only cause divergence, but also potential crashes when our runtime attempts to cross-check actual integers as invalid pointers (see example below). If the value of the integer incidentally represents a valid memory address, the runtime will try to cross-check that memory as a T
; otherwise, the runtime will most likely crash.
struct T {
int n;
};
struct U {
char *s;
};
void foo(struct U *x) {
// Cross-check problem here: will try to cross-check `x` as a `struct U*`,
// when it's a `struct T*`, so the check will most likely crash
// when attempting to dereference x->s
}
int main() {
T x = { 0x1234 };
foo((struct U*)&x);
return 0;
}
Our cross-checking runtimes can recover from attempts to dereference invalid pointers, but rely on the pointer-tracer
tool that uses ptrace
to check and restart all invalid pointer dereferences.
To use this recovery feature, you must pointer-tracer
to start the variants:
$ $C2RUST/cross-checks/pointer-tracer/target/release/pointer-tracer ./a.out
Alternatively, some issues caused by pointer aliasing can be fixed by disabling cross-checks altogether for certain types and values, or by providing custom cross-check functions for certain types. For example, one common pattern is the tagged union, where multiple structures have an overlapping prefix with a tag, followed by a type-specific part:
enum Tag {
TAG_A,
TAG_B
};
struct TypeA {
// Common part
Tag tag;
char *foo;
// Specific part
int valA;
};
struct TypeB {
// Common part
Tag tag;
char *foo;
// Specific part
void *valB;
};
For this example, you can either disable cross-checks manually for all the type-specific types, e.g., valA
and valB
above, or provide a custom cross-check function that cross-checks each value based on its tag.
End-of-buffer pointers
Another common C pattern related to memory allocations is the pointer to end of buffer pattern:
struct Buf {
char *start;
char *end;
};
void alloc(struct Buf *buf, unsigned size) {
buf->start = malloc(size);
buf->end = buf->start + size;
}
In this example, cross-checks will diverge on the end
pointer, since it points to the first byte after the allocation returned by malloc
. Since we only require the allocation itself to be zero-initialized, the value of that byte is undefined, and could change at any time during the execution of the variant.
For any pointers outside allocated memory, we recommend disabling cross-checks altogether.
Benign dangling pointers
The cross-checking runtimes may attempt to cross-check pointers to values that have been deallocated, e.g., by calling free
, but still linger in memory without being used by the program.
This means that the runtimes may dereference these pointers, even if the program never does, which may lead to divergence since the allocator is free to reuse that memory.
struct Data {
int allocated;
char *buf;
};
struct Data *free_data(struct Data *data) {
data->allocated = 0;
free(data->buf);
// Potential non-determinism: data->buf has been freed,
// but our runtime will try to dereference it
return data;
}
As of the writing of this document, we have no automatic way to detect when the runtimes attempt to dereference deallocated memory, so we recommend manually disabling cross-checks when this occurs.
Compiler builtins and optimizations
In some cases, clang and rustc may optimize the generated code differently in ways that produce divergence.
For example, clang (more specifically, a LLVM optimization pass enabled by clang) converts single-argument printf
calls to direct calls to puts
. For example, clang converts printf("Hello world!\n");
to puts("Hello world!");
. The two functions have different internal implementations and make different syscalls (mainly the write
syscall), so this optimization causes divergence. We recommend compiling all C code with the -fno-builtin
argument to prevent this.