Is that really the source code for this software?
I've been looking into how easy it is to confirm that a binary package corresponds to a source package. It turns out that it is not easy at all. So I've written down my findings in this blog entry.
I think that the topic of reproducible builds is one that is of fundamental importance to the free software and larger community; the trustworthiness of binaries based on source code is a topic quite neglected. We know about tivoization and the reality that code can be open yet unchangeable. What is not appreciated in sufficient measure is that parties can, quite unchecked, distribute binaries that do not correspond to the alleged source code.
Trust is good, but especially in a post-Snowden world, control is better. Can a person rely on binaries or should we all compile from source? I hope to raise awareness about the need for a reproducible way to create binaries from source code.
Free software means users have the four essential freedoms. Freedom 1 is the freedom to study how the program works and change it so it does your computing as you wish. It also means that the program does not do you what you do not want it to do. Instead of having to trust the supplier of the software, you can check that the software works as advertised and does not contain e.g. spyware. Access to the source code is a precondition for this freedom.
Many software packages are distributed in binary form and come with a license that makes the right to the source code explicit. For example the GNU GPL v2.0 says: For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable..
A license that promises access to the source code is one thing, but an interesting question is: is the published source code the same source code that was used to create the executable? The straightforward way to find this out is to compile the code and check that the result is the same. Unfortunately, the result of compiling the source code depends on many things besides the source code and build scripts such as which compiler was used. No free software license requires that this information is made available and so it would seem that it is a challenge to confirm if the given source code corresponds to the executable.
Collecting software packages that form a working operating system is one of the services of a distribution. Another service that most provide is compiling that software into executables and shipping those in convenient packages. Most distributions ship two types of packages: source packages and binary packages. A distribution is a complete system that includes all the tools to compile source code. Those tools go beyond the tools that are used in the build scripts from the upstream developer. Distributions contain tools to create binary packages from source packages. Does this mean that it is less of a challenge to confirm if the source code corresponds to the executable?
Doing the test
I have built a binary package from a source package for a number of distributions (Debian, Fedora, and OpenSUSE) and compared the self-built binary package with the one published by the distribution. All tests were run on fresh, minimal installs of the latest version of each distribution using the tools that are recommended by the distributions. To keep the complexity low, one simple package was chosen: tar. Will the self-built package be exactly the same, totally different or only slightly different?
Debian was installed from a downloaded netinstall image: debian-7.0.0-amd64-netinst.iso. The system was installed on a VirtualBox machine. The version of tar that comes with Debian is 1.26+dfsg-0.1. According to the instructions compiling the tar package from source is as simple as running:
apt-get -b source tar
which downloads the source files and compiles them. This results in a file: tar_1.26+dfsg-0.1_amd64.deb. The name of the file is the same as the name of the binary package published by Debian, but the size of the file is different from the size of the published package, 984376 vs 984768. Running the command again in a different empty directory gives yet another size for the deb file. The command apt-get -b source tar is clearly not deterministic.
To investigate what the differences between the packages are, they are unpacked:
ar vx tar_1.26+dfsg-0.1_amd64.deb tar xf control.tar.gz tar xf data.tar.gz
and the files in the self-built and the published package are compared. The files bin/tar, /usr/sbin/rmt-tar and /usr/share/man/man1/tar.1.gz differ. The manual file is the easiest to investigate. It turns out that it has a header with the date and time at which it was created: generated by script on Fri May 24 15:52:20 2013.
The files /bin/tar differ in 20 consecutive bytes. The files /usr/sbin/rmt-tar also only differ in 20 consecutive bytes. With readelf -a bin/tar this can be investigated: the difference between the two executables is in the ".note.gnu.build-id" ELF note section. This section can be set with the argument --build-id of ld which defaults to taking the sha1 sum of the linked object files. The build id is derived from the object files. In the Debian build, the object files are created with debug information which is later removed from the executable by stripping. The debug information contains the build path and it is this build path which is the reason for the different build id. If tar is compiled repeatedly in the same directory the binary will be identical. A tar executable compiled in a different directory will have a different build id. The build id could be left out with ld --build-id=none.
Apart from these two differences, there is another common difference from the published binary package. A deb archive is an ar archive that contains two tar archives: control.tar.gz and data.tar.gz. The ar archive and the two tar archives contain timestamps. This can be seen with ar tv tar_1.26+dfsg-0.1_amd64.deb and tar tvf control.tar.gz. If a build should be repeatable, the time that is stored should be a time that is taken from the provided files and not from the computer clock. The timestamps, user and group and file mode information can be left out of archives. ar can be run in a deterministic mode: ar qD archive-file file... and tar takes arguments for explicitly setting this information (--mtime, --owner, --group, and --mode).
The binary package that was built from a Debian source package was not identical to the published binary package, but the differences are limited to timestamps and the build id in the executables. Unless the function of the executable relies on this build-id, the self-built tar executable functions in the same way as the published version.
Fedora 18 was installed from a net install. The option 'minimal system' was chosen as the software selection option. This creates a system with 236 packages that take up 625MB. The tar binary and source RPMs were downloaded from the fedora repository and built with:
mock rebuild tar-126-9.fc18.src.rpm
and unpacked with
rpm2cpio tar-1.26-9.fc18.x86_64.rpm | cpio -idmv
The files that differ are /bin/tar and four info files with paths like /usr/share/info/tar.info.gz. Interestingly, the files /usr/share/man/man1/tar.1.gz are the same in published and compiled packages. This is because the man file is taken from the source package: Fedora has modified the man page and ships the generated version in the source rpm. The man pages for tar and gtar are the same file.
The info files give a large diff. This is due to the presence of a timestamp and a lot of generated cross-references. The executables are also very different. The self-compile tar is 8 bytes larger. The build id is different and there are differences scattered throughout the file. Many of these are just single bytes and probably different offsets to functions. This idea is consistent with the difference in output of readelf -a tar. All the function names are there in the same order, but many numbers are different.
Just like ar and tar files, rpm files contain timestamps which can be seen with rpm -qvlp tar-1.26-9.fc18.x86_64.rpm. The timestamps of the compiled files have the time of the build as their time stamp.
The Fedora package showed more differences with the published package than the Debian package did and unlike the Debian case, not all of the differences can be explained. The executable built from the published sources is so different from the published executable that it is not easy to know if it will function the same way.
A minimal system with OpenSUSE 12.3 was set up from a network installation iso. In OpenSUSE it is also easy to create a binary package from a source package:
rpmbuild --rebuild tar-1.26-14.1.1.src.rpm
Only two files differed: /bin/tar and /usr/share/man/man1/tar.1.gz. The man files differed, as in the deb file, due to their timestamp. The tar binary contained a surprise: the self-built file was 5 times as large as the published version. The debug information was not stripped. Stripping the file completely reduced the difference in size to 48 bytes. The build id was different and the published version contained a .gnu_debuglink entry whereas the self-built file contained a .comment section. Apart from the header and the last 2k bytes the files were identical.
A cherished characteristic of computers is their deterministic behaviour: software gives the same result for the same input. This makes it possible, in theory, to build binary packages from source packages that are bit for bit identical to the published binary packages. In practice however, building a binary package results in a different file each time. This is mostly due to timestamps stored in the builds. In packages built on OpenSUSE and Fedora differences are seen that are harder to explain. They may be due to any number of differences in the build environment. If these can be eliminated, the builds will be more predictable. Binary package would need to contain a description of the environment in which they were built.
Compiling software is resource intensive and it is valuable to have someone compile software for you. Unless it is possible to verify that compiled software corresponds to the source code it claims to correspond to, one has to trust the service that compiles the software. Based on a test with a simple package, tar, there is hope that with relatively minor changes to the build tools it is possible to make bit perfect builds.