From b29ceb8bd0f99134fe215eebc531dbcd7717e8ae Mon Sep 17 00:00:00 2001 From: Rob Landley Date: Thu, 9 Nov 2006 19:19:37 -0500 Subject: Web site updates, and a design document. --- www/design.html | 226 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 226 insertions(+) create mode 100644 www/design.html (limited to 'www/design.html') diff --git a/www/design.html b/www/design.html new file mode 100644 index 00000000..458d12f7 --- /dev/null +++ b/www/design.html @@ -0,0 +1,226 @@ +

Design goals

+ +

Toybox should be simple, small, and fast. Often, these things need to be +balanced off against each other. In general, simple is slightly more +important than small, and small is slightly more important than fast, but +it should be possible to get 80% of the way to each goal before they really +start to fight.

+ +

Fast

+ +

It's easy to say lots about optimizing for speed (which is why this section +is so long), but at the same time it's the one we care the least about. +The essence of speed is being as efficient as possible, which means doing as +little work as possible. A design that's small and simple gets you 90% of the +way there, and most of the rest is either fine-tuning or more trouble than +it's worth (and often actually counterproductive). Still, here's some +advice:

+ +

First, understand the darn problem you're trying to solve. You'd think +I wouldn't have to say this, but I do. Trying to find a faster sorting +algorithm is no substitute for figuring out a way to skip the sorting step +entirely. The fastest way to do anything is not to have to do it at all, +and _all_ optimization boils down to avoiding unnecessary work.

+ +

Speed is easy to measure; there are dozens of profiling tools for Linux +(although personally I find the "time" command a good starting place). +Don't waste too much time trying to optimize something you can't measure, +and there's no much point speeding up things you don't spend much time doing +anyway.

+ +

Understand the difference between throughput and latency. Faster +processors improve throughput, but don't always do much for latency. +After 30 years of Moore's Law, most of the remaining problems are latency, +not throughput. (There are of course a few exceptions, like data compression +code, encryption, rsync...) Worry about throughput inside long-running +loops, and worry about latency everywhere else. (And don't worry too much +about avoiding system calls or function calls or anything else in the name +of speed unless you are in the middle of a tight loop that's you've already +proven isn't running fast enough.)

+ +

"Locality of reference" is generally nice, in all sorts of contexts. +It's obvious that waiting for disk access is 1000x slower than doing stuff in +RAM (and making the disk seek is 10x slower than sequential reads/writes), +but it's just as true that a loop which stays in L1 cache is many times faster +than a loop that has to wait for a DRAM fetch on each iteration. Don't worry +about whether "&" is faster than "%" until your executable loop stays in L1 +cache and the data access is fetching cache lines intelligently. (To +understand DRAM, L1, and L2 cache, read Hannibal's marvelous ram guid at Ars +Technica: +part one, +part two, +part three, +plus this +article on +cacheing, and this one on +bandwidth +and latency. +And there's more where that came from.) +Running out of L1 cache can execute one instruction per clock cycle, going +to L2 cache costs a dozen or so clock cycles, and waiting for a worst case dram +fetch (round trip latency with a bank switch) can cost thousands of +clock cycles. (Historically, this disparity has gotten worse with time, +just like the speed hit for swapping to disk. These days, a _big_ L1 cache +is 128k and a big L2 cache is a couple of megabytes. A cheap low-power +embedded processor may have 8k of L1 cache and no L2.)

+ +

Learn how virtual memory and memory managment units work. Don't touch +memory you don't have to. Even just reading memory evicts stuff from L1 and L2 +cache, which may have to be read back in later. Writing memory can force the +operating system to break copy-on-write, which allocates more memory. (The +memory returned by malloc() is only a virtual allocation, filled with lots of +copy-on-write mappings of the zero page. Actual physical pages get allocated +when the copy-on-write gets broken by writing to the virtual page. This +is why checking the return value of malloc() isn't very useful anymore, it +only detects running out of virtual memory, not physical memory.)

+ +

Don't think that just because you don't have a swap file the system can't +start swap thrashing: any file backed page (ala mmap) can be evicted, and +there's a reason all running programs require an executable file (they're +mmaped, and can be flushed back to disk when memory is short). And long +before that, disk cache gets reclaimed and has to be read back in. When the +operating system really can't free up any more pages it triggers the out of +memory killer to free up pages by killing processes (the alternative is the +entire OS freezing solid). Modern operating systems seldom run out of +memory gracefully.

+ +

Also, it's better to be simple than clever. Many people think that mmap() +is faster than read() because it avoids a copy, but twiddling with the memory +management is itself slow, and can cause unnecessary CPU cache flushes. And +if a read faults in dozens of pages sequentially, but your mmap iterates +backwards through a file (causing lots of seeks, each of which your program +blocks waiting for), the read can be many times faster. On the other hand, the +mmap can sometimes use less memory, since the memory provided by mmap +comes from the page cache (allocated anyway), and it can be faster if you're +doing a lot of different updates to the same area. The moral? Measure, then +try to speed things up, and measure again to confirm it actually _did_ speed +things up rather than made them worse. (And understanding what's really going +on underneath is a big help to making it happen faster.)

+ +

In general, being simple is better than being clever. Optimization +strategies change with time. For example, decades ago precalculating a table +of results (for things like isdigit() or cosine(int degrees)) was clearly +faster because processors were so slow. Then processors got faster and grew +math coprocessors, and calculating the value each time became faster than +the table lookup (because the calculation fit in L1 cache but the lookup +had to go out to DRAM). Then cache sizes got bigger (the Pentium M has +2 megabytes of L2 cache) and the table fit in cache, so the table became +fast again... Predicting how changes in hardware will affect your algorithm +is difficult, and using ten year old optimization advice and produce +laughably bad results. But being simple and efficient is always going to +give at least a reasonable result.

+ +

The famous quote from Ken Thompson, "When in doubt, use brute force", +applies to toybox. Do the simple thing first, do as little of it as possible, +and make sure it's right. You can always speed it up later.

+ +

Small

+

Again, simple gives you most of this. An algorithm that does less work +is generally smaller. Understand the problem, treat size as a cost, and +get a good bang for the byte.

+ +

Understand the difference between binary size, heap size, and stack size. +Your binary is the executable file on disk, your heap is where malloc() memory +lives, and your stack is where local variables (and function call return +addresses) live. Optimizing for binary size is generally good: executing +fewer instructions makes your program run faster (and fits more of it in +cache). On embedded systems, binary size is especially precious because +flash is expensive (and its successor, MRAM, even more so). Small stack size +is important for nommu systems because they have to preallocate their stack +and can't make it bigger via page fault. And everybody likes a small heap.

+ +

Measure the right things. Especially with modern optimizers, expecting +something to be smaller is no guarantee it will be after the compiler's done +with it. Binary size isn't the most accurate indicator of the impact of a +given change, because lots of things get combined and rounded during +compilation and linking. Matt Mackall's bloat-o-meter is a python script +which compares two versions of a program, and shows size changes in each +symbol (using the "nm" command behind the scenes). To use this, run +"make baseline" to build a baseline version to compare against, and +then "make bloatometer" to compare that baseline version against the current +code.

+ +

Avoid special cases. Whenever you see similar chunks of code in more than +one place, it might be possible to combine them and have the users call shared +code. (This is the most commonly cited trick, which doesn't make it easy.)

+ +

Some specific advice: Using a char in place of an int when doing math +produces significantly larger code on some platforms (notably arm), +because each time the compiler has to emit code to convert it to int, do the +math, and convert it back. Bitfields have this problem on most platforms. +Because of this, using char to index a for() loop is probably not a net win, +although using char (or a bitfield) to store a value in a structure that's +repeated hundreds of times can be a good tradeoff of binary size for heap +space.

+ +

Simple

+ +

Complexity is a cost, just like code size or runtime speed. Treat it as +a cost, and spend your complexity budget wisely.

+ +

Simplicity has lots of benefits. Simple code is easy to maintain, easy to +port to new processors, easy to audit for security holes, and easy to +understand. (Comments help, but they're no substitute for simple code.)

+ +

Joel +Spolsky argues against throwing code out and starting over, and he has +good points: an existing debugged codebase contains a huge amount of baked +in knowledge about strange real-world use cases that the designers didn't +know about until users hit the bugs, and most of this knowledge is never +explicitly stated anywhere except in the source code.

+ +

That said, the Mythical Man-Month's "build one to throw away" advice points +out that until you've solved the problem you don't properly understand it, and +about the time you finish your first version is when you've finally figured +out what you _should_ have done. (The corrolary is that if you build one +expecting to throw it away, you'll actually wind up throwing away two. You +don't understand the problem until you _have_ solved it.)

+ +

Joel is talking about what closed source software can afford to do: Code +that works and has been paid for is a corporate asset not lightly abandoned. +Open source software can afford to re-implement code that works, over and +over from scratch, for incremental gains. Before toybox, the unix command line +has already been reimplemented from scratch several in a row (the +original Unix and BSD tools, the GNU tools, BusyBox...) +but maybe toybox can do a better job. :)

+ +

P.S. How could I resist linking to an article about +why +programmers should strive to be lazy and dumb?

+ +

Portability issues

+ +

Platforms

+

Toybox should run on every hardware platform Linux runs on. Other +posix/susv3 environments (perhaps MacOS X or newlib+libgloss) are vaguely +interesting but only if they're easy to support, I'm not going to spend much +effort on them.

+ +

I don't do windows.

+ +

32/64 bit

+

Toybox should work on both 32 bit and 64 bit systems. By the end of 2008 +64 bit hardware will be the new desktop standard, but 32 bit hardware will +continue to be important in embedded devices for years to come.

+ +

Toybox relies on the fact that on any Unix-like platform, pointer and long +are always the same size (on both 32 and 64 bit). Pointer and int are _not_ +the same size on 64 bit systems, but pointer and long are.

+ +

This is guaranteed by the LP64 memory model, a Unix standard (which Linux +and MacOS X implements). See +the LP64 standard and +the LP64 +rationale for details.

+ +

Note that Windows doesn't work like this, and I don't care. +The +insane legacy reasons why this is broken on Windows are explained here.

+ +

Signedness of char

+

On platforms like x86, variables of type char default to unsigned. On +platforms like arm, char defaults to signed. This difference can lead to +subtle portability bugs, and to avoid them we specify which one we want by +feeding the compiler -funsigned-char.

+ +

The reason to pick "unsigned" is that way we're 8-bit clean by default.

-- cgit v1.2.3