- There are several design levels that can be optimized, with varying degrees of results:
- System structures: The architecture of the software. How do modules communicate with each other, etc.
- Intramodular structure: choice of algorithms, data structures, etc. Large speedups can often be had at this level.
- Writing efficient code: Usually gives about 5x to 10x improvement.
- Assembly: In general, good compilers produce as good assembly as people do, but specific cases can be improved.
- System calls: Probably not changeable by the programmer.
- Hardware: Go to more specialized/expensive/etc. hardware
if necessary.

- Trading space for time:

- Data structure augmentation: “The time required for common operations on data can often be reduced by augmenting the structure with extra information or by changing the information within the structure so that it can be accessed more easily.” (e.g. reference counters, hints [if hint applies, use quick algorithm with assumptions otherwise revert to slow algorithm])
- Store precomputed results: take the computation out of the loop, store it in a table (e.g. fast sin() table, or a character mapping, cache precomputed results [i.e. when computing Fibonacci(5), store Fib(0), Fib(1), ... Fib(4)])
- Caching: “Data that is accessed most often should be the cheapest to access.”
- Lazy evaluation: Don’t evaluate something until it is necessary. You could make a fast sin() function that on startup computes all the possible sine values and stores them. This would be a waste of time if sin() isn’t called very frequently. (In this case, it would be better to hardcode them and use rule 2)

- Trading time for space:

- Packing: “Dense storage representations can decrease storage costs by increasing the time required to store and retrieve data” (e.g. bit vectors, a compressed filesystem, overlaying (e.g. C unions) etc.)
- Interpreters: “The space required to represent a program can often be decreased by the use of interpreters in which common sequences of operations are represented compactly.” Bentley gives the example of someone writing a terminal emulator. Space was at a premium but time wasn’t (people don’t interact frequently on a CPU time scale). He made it very small by building an interpreter for the emulator (i.e. and interpreter for the interpreter).

- Of course, these are reversible, too: getting rid of those redundant speed improving fields in rule 1 will shrink up the data structure.
- Loop rules

- Move code out of the loops:

for (i = 0; i < n; ++i)

x[i] = x[i] + sqrt(Pi/2); // Compute sqrt(Pi/2) once before the loop

x[i] = x[i] + sqrt(Pi/2); // Compute sqrt(Pi/2) once before the loop

- Combining tests: “An efficient inner loop should contain as few tests as possible, and preferably only one. The programmer should therefore try to simulate some of the exit conditions of the loop by other exit conditions”

i =
1; x[n+1] = t;

while (i < n && x[i] != t) i = 1;

i++; while (x[i] != t) i++;

if (i <= n) if (i <= n)

... ...

else else

... ...

while (i < n && x[i] != t) i = 1;

i++; while (x[i] != t) i++;

if (i <= n) if (i <= n)

... ...

else else

... ...

- Loop unrolling: “A large cost of some short loops is in modifying the loop indices. That cost can often be reduced by unrolling the loop” Might want to do for i=0 to N/5; { A; A; A; A; A; } instead of for i=0 to N; { A; }.
- Transfer-driven loop unrolling: “If a large cost of an inner loop is devoted to trivial assignments, then those assignments can often be removed by repeating the code and changing the use of variables. Specifically, to remove the assignment i = j, the subsequent code must treat j as though it were i.
- Unconditional branch removal: “A fast loop should contain no unconditional branches. An unconditional branch at the end of the loop can be removed by “rotating” the loop to have a conditional branch at the bottom” (this is most applicable in low-level languages)
- Loop fusion: “If two nearby loops operate on the same set of elements, then combine their operational parts and use only one set of loop control operations.”

- Logic rules

- Exploit algebraic identities
- Short-circuiting monotonic functions: “If we wish to test whether some [monotonically increasing] function of several variables is over a certain threshold, then we need not evaluate any of the variables once the threshold has been reached.”
- Reordering tests: “Logical tests should be arranged such that the inexpensive and often successful tests precede expensive and rarely successful tests.”
- Precompute logical functions: “A logical function over a small finite domain can be replaced by a lookup table that represents the functions.” (e.g. a character type function)
- Boolean variable elimination:

S1; if (test)

if (test) { S1; S2; }

S2; else

else { S1; S2; }

S3;

if (test) { S1; S2; }

S2; else

else { S1; S2; }

S3;

- Procedure rules

- Collapsing procedure hierarchies: “The run times of the elements of a set of procedures that (nonrecursively) call themselves can often be reduced by rewriting procedures in line and binding the passed variables.”
- Exploit common cases: “Procedures should be organized to handle all cases correctly and common cases efficiently”
- Transformations on recursive procedures: tail recursion can be often be replaced with a loop. “If the procedure contains only one recursive call on itself, then it is not necessary to store the return address on the stack ... It is necessary, thought to keep track of the call depth in some other way.” “It is often more efficient to solve small subproblems by use of an auxiliary procedure, rather than by recurring down to problems of size zero or one.”
- Parallelism: “A program should be structured to exploit as much of the parallelism as possible in the underlying hardware.” (In modern hardware, making sure that arrays are organized to efficiently exploit the cache is helpful)

- Expression rules

- Compile time initialization (a.ka. constant propogation): (i.e. C++ global const variables) This tries to let the compiler replace expressions with a constant.
- Exploit algebraic identities
- Common subexpression elimination: “If the same expression is evaluated twice with none of its variables latered between evaluations, then the second evaluation can be avoided by storing the result of the first and using that in place of the second.”
- Pairing computation: “If two similar expressions ar efrequently evaluated together, then we should make a new procedure that evaluates them as a pair.” He gives an example that sine and cosine are expensive to compute separately, but that if you are already calculating one, the other can be calculated for a little extra cost.
- Exploit word parallelism: “Use the full word width of the underlying computer architecture to evaluate expensive expressions” (i.e. packing)

- Applying the techniques:

- Code simplification: “Most fast programs are simple. Therefore, keep code simple to make it faster.”
- Problem simplification: “To increase the efficiency of a program, simplify the problem it solves.”
- Relentless suspicion: “Question the necessity of each instruction in a time-critical piece of code and each field in a space-critical data structure.”
- Early binding: “Move work forward in time. Specifically do work now just once in hope of avoiding doing it many times later.”

Review: 8

Well written but a little on the
simplistic side. The rules are useful, but for anyone who
understands what the rules mean, the explanations are probably too
simplistic to help much.