Rob's notes on programming busybox.

What are the goals of busybox?

Busybox aims to be the smallest and simplest correct implementation of the standard Linux command line tools. First and foremost, this means the smallest executable size we can manage. We also want to have the simplest and cleanest implementation we can manage, be standards compliant, minimize run-time memory usage (heap and stack), run fast, and take over the world.

What is the design of busybox?

Busybox is like a swiss army knife: one thing with many functions. The busybox executable can act like many different programs depending on the name used to invoke it. Normal practice is to create a bunch of symlinks pointing to the busybox binary, each of which triggers a different busybox function. (See getting started in the FAQ for more information on usage, and the busybox documentation for a list of symlink names and what they do.)

The "one binary to rule them all" approach is primarily for size reasons: a single multi-purpose executable is smaller then many small files could be. This way busybox only has one set of ELF headers, it can easily share code between different apps even when statically linked, it has better packing efficiency by avoding gaps between files or compression dictionary resets, and so on.

Work is underway on new options such as "make standalone" to build separate binaries for each applet, and a "libbb.so" to make the busybox common code available as a shared library. Neither is ready yet at the time of this writing.

The applet directories

The directory "applets" contains the busybox startup code (applets.c and busybox.c), and several subdirectories containing the code for the individual applets.

Busybox execution starts with the main() function in applets/busybox.c, which sets the global variable bb_applet_name to argv[0] and calls run_applet_by_name() in applets/applets.c. That uses the applets[] array (defined in include/busybox.h and filled out in include/applets.h) to transfer control to the appropriate APPLET_main() function (such as cat_main() or sed_main()). The individual applet takes it from there.

This is why calling busybox under a different name triggers different functionality: main() looks up argv[0] in applets[] to get a function pointer to APPLET_main().

Busybox applets may also be invoked through the multiplexor applet "busybox" (see busybox_main() in applets/busybox.c), and through the standalone shell (grep for STANDALONE_SHELL in applets/shell/*.c). See getting started in the FAQ for more information on these alternate usage mechanisms, which are just different ways to reach the relevant APPLET_main() function.

The applet subdirectories (archival, console-tools, coreutils, debianutils, e2fsprogs, editors, findutils, init, loginutils, miscutils, modutils, networking, procps, shell, sysklogd, and util-linux) correspond to the configuration sub-menus in menuconfig. Each subdirectory contains the code to implement the applets in that sub-menu, as well as a Config.in file defining that configuration sub-menu (with dependencies and help text for each applet), and the makefile segment (Makefile.in) for that subdirectory.

The run-time --help is stored in usage_messages[], which is initialized at the start of applets/applets.c and gets its help text from usage.h. During the build this help text is also used to generate the BusyBox documentation (in html, txt, and man page formats) in the docs directory. See adding an applet to busybox for more information.

libbb

Most non-setup code shared between busybox applets lives in the libbb directory. It's a mess that evolved over the years without much auditing or cleanup. For anybody looking for a great project to break into busybox development with, documenting libbb would be both incredibly useful and good experience.

Common themes in libbb include allocation functions that test for failure and abort the program with an error message so the caller doesn't have to test the return value (xmalloc(), xstrdup(), etc), wrapped versions of open(), close(), read(), and write() that test for their own failures and/or retry automatically, linked list management functions (llist.c), command line argument parsing (getopt_ulflags.c), and a whole lot more.

Adding an applet to busybox

To add a new applet to busybox, first pick a name for the applet and a corresponding CONFIG_NAME. Then do this:

What standards does busybox adhere to?

The standard we're paying attention to is the "Shell and Utilities" portion of the Open Group Base Standards (also known as the Single Unix Specification version 3 or SUSv3). Note that paying attention isn't necessarily the same thing as following it.

SUSv3 doesn't even mention things like init, mount, tar, or losetup, nor commonly used options like echo's '-e' and '-n', or sed's '-i'. Busybox is driven by what real users actually need, not the fact the standard believes we should implement ed or sccs. For size reasons, we're unlikely to include much internationalization support beyond UTF-8, and on top of all that, our configuration menu lets developers chop out features to produce smaller but very non-standard utilities.

Also, Busybox is aimed primarily at Linux. Unix standards are interesting because Linux tries to adhere to them, but portability to dozens of platforms is only interesting in terms of offering a restricted feature set that works everywhere, not growing dozens of platform-specific extensions. Busybox should be portable to all hardware platforms Linux supports, and any other similar operating systems that are easy to do and won't require much maintenance.

In practice, standards compliance tends to be a clean-up step once an applet is otherwise finished. When polishing and testing a busybox applet, we ensure we have at least the option of full standards compliance, or else document where we (intentionally) fall short.

Programming tips and tricks.

Various things busybox uses that aren't particularly well documented elsewhere.

Encrypted Passwords

Password fields in /etc/passwd and /etc/shadow are in a special format. If the first character isn't '$', then it's an old DES style password. If the first character is '$' then the password is actually three fields separated by '$' characters:

  $type$salt$encrypted_password

The "type" indicates which encryption algorithm to use: 1 for MD5 and 2 for SHA1.

The "salt" is a bunch of ramdom characters (generally 8) the encryption algorithm uses to perturb the password in a known and reproducible way (such as by appending the random data to the unencrypted password, or combining them with exclusive or). Salt is randomly generated when setting a password, and then the same salt value is re-used when checking the password. (Salt is thus stored unencrypted.)

The advantage of using salt is that the same cleartext password encrypted with a different salt value produces a different encrypted value. If each encrypted password uses a different salt value, an attacker is forced to do the cryptographic math all over again for each password they want to check. Without salt, they could simply produce a big dictionary of commonly used passwords ahead of time, and look up each password in a stolen password file to see if it's a known value. (Even if there are billions of possible passwords in the dictionary, checking each one is just a binary search against a file only a few gigabytes long.) With salt they can't even tell if two different users share the same password without guessing what that password is and decrypting it. They also can't precompute the attack dictionary for a specific password until they know what the salt value is.

The third field is the encrypted password (plus the salt). For md5 this is 22 bytes.

The busybox function to handle all this is pw_encrypt(clear, salt) in "libbb/pw_encrypt.c". The first argument is the clear text password to be encrypted, and the second is a string in "$type$salt$password" format, from which the "type" and "salt" fields will be extracted to produce an encrypted value. (Only the first two fields are needed, the third $ is equivalent to the end of the string.) The return value is an encrypted password in /etc/passwd format, with all three $ separated fields. It's stored in a static buffer, 128 bytes long.

So when checking an existing password, if pw_encrypt(text, old_encrypted_password) returns a string that compares identical to old_encrypted_password, you've got the right password. When setting a new password, generate a random 8 character salt string, put it in the right format with sprintf(buffer, "$%c$%s", type, salt), and feed buffer as the second argument to pw_encrypt(text,buffer).

Fork and vfork

On systems that haven't got a Memory Management Unit, fork() is unreasonably expensive to implement (and sometimes even impossible), so a less capable function called vfork() is used instead. (Using vfork() on a system with an MMU is like pounding a nail with a wrench. Not the best tool for the job, but it works.)

Busybox hides the difference between fork() and vfork() in libbb/bb_fork_exec.c. If you ever want to fork and exec, use bb_fork_exec() (which returns a pid and takes the same arguments as execve(), although in this case envp can be NULL) and don't worry about it. This description is here in case you want to know why that does what it does.

Implementing fork() depends on having a Memory Management Unit. With an MMU then you can simply set up a second set of page tables and share the physical memory via copy-on-write. So a fork() followed quickly by exec() only copies a few pages of the parent's memory, just the ones it changes before freeing them.

With a very primitive MMU (using a base pointer plus length instead of page tables, which can provide virtual addresses and protect processes from each other, but no copy on write) you can still implement fork. But it's unreasonably expensive, because you have to copy all the parent process' memory into the new process (which could easily be several megabytes per fork). And you have to do this even though that memory gets freed again as soon as the exec happens. (This is not just slow and a waste of space but causes memory usage spikes that can easily cause the system to run out of memory.)

Without even a primitive MMU, you have no virtual addresses. Every process can reach out and touch any other process' memory, because all pointers are to physical addresses with no protection. Even if you copy a process' memory to new physical addresses, all of its pointers point to the old objects in the old process. (Searching through the new copy's memory for pointers and redirect them to the new locations is not an easy problem.)

So with a primitive or missing MMU, fork() is just not a good idea.

In theory, vfork() is just a fork() that writeably shares the heap and stack rather than copying it (so what one process writes the other one sees). In practice, vfork() has to suspend the parent process until the child does exec, at which point the parent wakes up and resumes by returning from the call to vfork(). All modern kernel/libc combinations implement vfork() to put the parent to sleep until the child does its exec. There's just no other way to make it work: the parent has to know the child has done its exec() or exit() before it's safe to return from the function it's in, so it has to block until that happens. In fact without suspending the parent there's no way to even store separate copies of the return value (the pid) from the vfork() call itself: both assignments write into the same memory location.

One way to understand (and in fact implement) vfork() is this: imagine the parent does a setjmp and then continues on (pretending to be the child) until the exec() comes around, then the _exec_ does the actual fork, and the parent does a longjmp back to the original vfork call and continues on from there. (It thus becomes obvious why the child can't return, or modify local variables it doesn't want the parent to see changed when it resumes.)

Note a common mistake: the need for vfork doesn't mean you can't have two processes running at the same time. It means you can't have two processes sharing the same memory without stomping all over each other. As soon as the child calls exec(), the parent resumes.

If the child's attempt to call exec() fails, the child should call _exit() rather than a normal exit(). This avoids any atexit() code that might confuse the parent. (The parent should never call _exit(), only a vforked child that failed to exec.)

(Now in theory, a nommu system could just copy the _stack_ when it forks (which presumably is much shorter than the heap), and leave the heap shared. Even with no MMU at all In practice, you've just wound up in a multi-threaded situation and you can't do a malloc() or free() on your heap without freeing the other process' memory (and if you don't have the proper locking for being threaded, corrupting the heap if both of you try to do it at the same time and wind up stomping on each other while traversing the free memory lists). The thing about vfork is that it's a big red flag warning "there be dragons here" rather than something subtle and thus even more dangerous.)

Short reads and writes

Busybox has special functions, bb_full_read() and bb_full_write(), to check that all the data we asked for got read or written. Is this a real world consideration? Try the following:

while true; do echo hello; sleep 1; done | tee out.txt

If tee is implemented with bb_full_read(), tee doesn't display output in real time but blocks until its entire input buffer (generally a couple kilobytes) is read, then displays it all at once. In that case, we _want_ the short read, for user interface reasons. (Note that read() should never return 0 unless it has hit the end of input, and an attempt to write 0 bytes should be ignored by the OS.)

As for short writes, play around with two processes piping data to each other on the command line (cat bigfile | gzip > out.gz) and suspend and resume a few times (ctrl-z to suspend, "fg" to resume). The writer can experience short writes, which are especially dangerous because if you don't notice them you'll discard data. They can also happen when a system is under load and a fast process is piping to a slower one. (Such as an xterm waiting on x11 when the scheduler decides X is being a CPU hog with all that text console scrolling...)

So will data always be read from the far end of a pipe at the same chunk sizes it was written in? Nope. Don't rely on that. For one counterexample, see rfc 896 for Nagle's algorithm, which waits a fraction of a second or so before sending out small amounts of data through a TCP/IP connection in case more data comes in that can be merged into the same packet. (In case you were wondering why action games that use TCP/IP set TCP_NODELAY to lower the latency on their their sockets, now you know.)

Memory used by relocatable code, PIC, and static linking.

The downside of standard dynamic linking is that it results in self-modifying code. Although each executable's pages are mmaped() into a process' address space from the executable file and are thus naturally shared between processes out of the page cache, the library loader (ld-linux.so.2 or ld-uClibc.so.0) writes to these pages to supply addresses for relocatable symbols. This dirties the pages, triggering copy-on-write allocation of new memory for each processes' dirtied pages.

One solution to this is Position Independent Code (PIC), a way of linking a file so all the relocations are grouped together. This dirties fewer pages (often just a single page) for each process' relocations. The down side is this results in larger executables, which take up more space on disk (and a correspondingly larger space in memory). But when many copies of the same program are running, PIC dynamic linking trades a larger disk footprint for a smaller memory footprint, by sharing more pages.

A third solution is static linking. A statically linked program has no relocations, and thus the entire executable is shared between all running instances. This tends to have a significantly larger disk footprint, but on a system with only one or two executables, shared libraries aren't much of a win anyway.

You can tell the glibc linker to display debugging information about its relocations with the environment variable "LD_DEBUG". Try "LD_DEBUG=help /bin/true" for a list of commands. Learning to interpret "LD_DEBUG=statistics cat /proc/self/statm" could be interesting.

Who are the BusyBox developers?

The following login accounts currently exist on busybox.net. (I.E. these people can commit patches into subversion for the BusyBox, uClibc, and buildroot projects.)

aldot     :Bernhard Fischer
andersen  :Erik Andersen      <- uClibc and BuildRoot maintainer.
bug1      :Glenn McGrath
davidm    :David McCullough
gkajmowi  :Garrett Kajmowicz  <- uClibc++ maintainer
jbglaw    :Jan-Benedict Glaw
jocke     :Joakim Tjernlund
landley   :Rob Landley        <- BusyBox maintainer
lethal    :Paul Mundt
mjn3      :Manuel Novoa III
osuadmin  :osuadmin
pgf       :Paul Fox
pkj       :Peter Kjellerstedt
prpplague :David Anders
psm       :Peter S. Mazinger
russ      :Russ Dill
sandman   :Robert Griebl
sjhill    :Steven J. Hill
solar     :Ned Ludd
timr      :Tim Riker
tobiasa   :Tobias Anderberg
vapier    :Mike Frysinger

The following accounts used to exist on busybox.net, but don't anymore so I can't ask /etc/passwd for their names. (If anybody would like to make a stab at it...)

aaronl
beppu
dwhedon
erik    : Also Erik Andersen?
gfeldman
jimg
kraai
markw
miles
proski
rjune
tausq
vodz      :Vladimir N. Oleynik