## CSx35 Computer Architecture Sheet 4 **1.** Computationally efficient approximations for the magnitude function $\sqrt{(a^2+b^2)}$ are presented in Table 1. (a) Show that approximation 2 remains within $\pm 3\%$ of the correct result for any values of a and b. (b) Give three alternative architectures that implement the algorithm and compare them in terms of datapath resources, cycles per data item, longest path, and control overhead. Assume input data remain valid as long as you need them, but plan for a registered output. Begin by drawing the DDG. | Name | aka | Formula | |-----------------------|------------------------|--------------------------------------------------------| | lesser | $\ell^{-\infty}$ -norm | $l = \min( a , b )$ | | sum | $\ell^1$ -norm | s = a + b | | magnitude (reference) | $\ell^2$ -norm | $m = \sqrt{a^2 + b^2}$ | | greater | $\ell^{\infty}$ -norm | $g = \max( a , b )$ | | approximation 1 | | $m \approx m_1 = \frac{3}{8}s + \frac{5}{8}g$ | | approximation 2 [35] | | $m \approx m_2 = \max(g, \frac{7}{8}g + \frac{1}{2}l)$ | Table 1: Approximations for computing magnitudes **2.** Reconsider the third-order correlator of Figure 1.a. (a) To boost performance, try to retime and pipeline the isomorphic architecture without prior reversal of the adder chain. How does the circuit so obtained compare with Figure 1.d. Give estimates for datapath resources, cycles per data item, longest path, latency, and control overhead. (b) Next assume your prime concern is area occupation. What architectures qualify? Figure 1: Nonlinear time-invariant third-order correlator. Original DDG (a), with adder chain reversed by associativity transform (b), after retiming (c), with pipelining added on top so as to obtain a systolic architecture (d). **3.** Figure 2 shows a viable architecture for a transversal filter. Before this architecture can be coded using an HDL, one must work out the missing details about clocking, register clear, register enable, and multiplexed control signals. Establish a schedule that lists clock cycle by clock cycle what data items the various computational units are supposed to work on, what data items or states the various registers are supposed to hold, and what logic values the various control signals must assume to marshal the interplay of all those hardware items. Samples are to be processed as specified by fig.2.11a. Figure 2: Third-order transversal filter. Isomorphic architecture (a) and a more economic alternative obtained by combining time-sharing with iterative decomposition (b) (simplified). **4.** Arithmetic mean x and standard deviation $\sigma$ are defined as $$\overline{x} = \frac{1}{N} \sum_{n=1}^{N} x_n$$ $$\sigma^2 = \frac{1}{N-1} \sum_{n=1}^{N} (x_n - \overline{x})^2$$ Assume samples $x_n$ arrive sequentially one at a time. More specifically, each clock cycle sees a new w-bit data item appear. Find a dedicated architecture that computes $\bar{x}$ and $\sigma^2$ after N clock cycles and where N is some integer power of two, say 32. Definitions in the above equations suggest one needs to store up to N - 1 past values of x. Can you make do with less? What mathematical properties do you call on? What is the impact on datapath word width? This is actually an old problem the solution of which has been made popular by early scientific pocket calculators such as the HP-45, for instance. Yet, it nicely shows the difference between a crude and a more elaborate way of organizing a computation.