CSE 30 -- Lecture 8 -- Oct 23

Assignment 4:

Write an MIPS assembly program to do base conversion. The program should loop, repeatedly asking for the input base (entered in base 10), an input value (entered in the input base), an output base (entered in base 10), and output the entered input value in the desired output base. The program terminates when any of the base values are less than 2, or the input value is less than 0, or the input value contains a character outside of the range of the input base. You should use 0,1,...,9,A,B,...,Z for the ``numerals'', up to base 36; if the input base is 12, the valid numerals are 0,1,...,9,A,B.

A C program that performs this base conversion is provided. You do not have to use it as the starting point (i.e., convert it into assembler), but your program should essentially behave the same way wrt the I/O. Efficiency matters in this assignment -- while you don't have to shave every instruction that you can out of the code, you should not be doing anything really inefficient. (The original program allowed digits not in the base to be entered; this is something you should look for and disallow, according the the homework specification, so I've modified the C program to match.)

This assignment is due on Oct 30, before class. Use electronic handin.

Clarification: You are required to write a MIPS assembly language program which performs with the given I/O specification. The sample C program is a program that satisfies the requirements, except for the fact that it is in C and not MIPS assembler. I make no requirements on the internal structure of your program, so you can code it in whatever way you like, with the following proviso: clarity and efficiency matters. Your code should not be spaghetti code unless you have a very good reason for it, and you must clearly comment the code. If a smart but unknowledgeable reader can not figure out what your code does, points will be taken off. The hypothetical reader knows (superficially) the MIPS assembly language, but does not know what algorithm you are using.

The midterm date has been announced. It will be on Nov 4th, and on Oct 30 we will review the material during discussion section. Everything in lecture, in the Web-based ``handouts'', and in the readings from the textbook will be covered. The midterm will be open book.

I said in class that there are primarily two reasons most people write code in assembly:

speed, and
using special instructions / registers that the compiler won't let you use -- this is for low level operating system code, such as context switching, exception handling, and hardware register accesses in device drivers.

We're primarily covering the speed issue and hand optimizing code thus far. Note that the best way to improve the speed of code is by algorithmic improvements: for example, a search through a sorted table can be improved from the naive linear search, which takes O(n) time, to a binary search, which takes O(log(n)) time. Recoding the same algorithm from C/C++ to assembly gets you a factor of 2 or so improvement in run time -- when the dataset is not trivially small, that O(n) versus O(log(n)) gap will dominate the linear factor speed improvement -- this often means that you can use a high-level language implementation of the improved algorithm and still be fast enough, avoiding assembly language altogether (and thus your code will remain portable). Sometimes, of course, you'll still need to squeeze more speed out of the program, and after algorithmic improvements you'll have to recode in assembly to save those last few cycles out of your program.

The new material covered in this lecture was strength reduction. I showed an algebraic example, starting with this code:

int	iarr[4096];

void	fn(void)
{
	int	i;

	for (i = 0; i < 4096; i++) {
		iarr[i] = i * i;
	}
}

Note that the multiplication operation requires several clock cycles to complete. On the MIPS, it can take anywhere from 10 to 18 clock cycles. (Some processors take 5-7, others take a variable number that depends on the number being multiplied -- the 286/386 is variable.) See Kane, 3-10, table 3-7. So, our goal will be to get rid of the i * i operation. To do this, we note that (i+1)²=i² + 2i + 1, so we transform our code into:

int	iarr[4096];

void	fn(void)
{
	register int	i, j;

	for (i = j = 0; i < 4096; i++) {
		iarr[i] = j;
		j += (i << 1) + 1;
	}
}

The expression (i << 1) is an arithmetic shift of i to the left by 1 bit, effectively multiplying by 2.

To see that this is correct, we check the loop invariant: j = i², which holds when the test i < 4096 is performed.

The loop invariant proof works just like induction proofs in mathematics: the base case, when i = j = 0 is obviously true; the inductive step work by assuming that j = i² at the top of the loop body, tracking the changes made to all the values through the execution of the body (and the increment of the loop index), and verifying that the expression is still true when we reach the test again:

Assume j=i² at the test. We add primes to the new values of i and j to avoid confusion. So, we have j' = j + 2i + 1 and i' = i+1 after going through the body of the loop. Well, we substitute i² for j, and get j' = i² + 2i + 1 = (i+1)² = i'², which is what we wanted to show.

A notationally cleaner way to see this is to be more careful about the variable names and their contents. So, we would name the contents: j contains x², and i contains x. The effect of the loop body is to change the contents of j to x²+2x+1, and to change the contents of i to x+1. Since x²+2x+1 = (x+1)², the contents of i and j maintains the invariant of j=i².

This is an algebraic strength reduction that most compilers can not do for you. One that it can do is the following:

int	iarr[4096];

void	fn(void)
{
	register int	i, j, *ip;

	for (i = j = 0, ip = iarr; i < 4096; i++) {
		*ip++ = j;
		j += (i << 1) + 1;
	}
}

This saves the array indexing operation. Let's see how expensive is computing the address of iarr[i] -- where the result is to be stored. iarr[i]'s (byte) address is iarr + 4 * i (not C addition -- otherwise the type of iarr causes the addition to scale -- but byte address addition). So, this means that a naive compiler's generated code will need to take i, multiply it by 4, then add it to the base address iarr to obtain the byte address of the i^th entry in the array. The multiply will be done using an arithmetic shift, which takes one cycle, and so the whole thing requires two cycles. The pointer version uses only one. This strength reduction optimization relies on the fact that array elements are stored consecutively in memory.

Even though a compiler often can do this for you, you'll need to be able to do this strength reduction yourself when you are recoding in assembler and don't have the benefit of compiler optimizations. I also mentioned in class that it is not always beneficial to apply all the strength reduction optimizations: it will depend on the expense of the operations avoided and the expense of the cheaper code, as well as whether you will force register spills to occur (where too many variables are used, so they can not all be kept in registers, and the assembly code must swap variables into/out of memory [in stack frame] from the registers; this memory traffic can also be expensive).

I started to go over a two dimensional array example, with the original C code:

int	tarr[4096][100];	/* in lecture I used 128 instead of 100 */
int	colsum[100], rowsum[4096];

void	do_sums(void)
{
	int	r, c;

	for (r = 0; r < 4096; r++) rowsum[r] = 0;
	for (c = 0; c < 100; c++) colsum[c] = 0;
	for (r = 0; r < 4096; r++) {
		for (c = 0; c < 100; c++) {
			colsum[c] += tarr[r][c];
			rowsum[r] += tarr[r][c];
		}
	}
}

The two dimensional array is laid out in memory thus:

	 addr			  memory
	           fffffffc fffffffd fffffffe ffffffff
		  +--------+--------+--------+--------+
	ffff ffff |	   |	    |	     |	      |
		  |	   |	    |	     |	      |
		  /	   /	    /	     /	      /
		  \	   \	    \	     \	      \
		  /	   /	    /	     /	      /
tarr[2] abcd efc8 |	   |	    |	     |	      |
		  +--------+--------+--------+--------+
                  |        |        |        |        |
                  |        |        |        |        |
tarr[1]	abcd ef64 |        |        |        |        |
		  +--------+--------+--------+--------+
                  |        |        |        |        |
                  |        |        |        |        |
tarr[0]	abcd ef00 |        |        |        |        |
		  +--------+--------+--------+--------+
		  |	   |	    |	     |	      |
		  /	   /	    /	     /	      /
		  \	   \	    \	     \	      \
		  /	   /	    /	     /	      /
	0000 0004 |	   |	    |	     |	      |
	0000 0000 |	   |	    |	     |	      |
		  +--------+--------+--------+--------+
		   00000000 00000001 00000002 00000003

The type of tarr[r] is an array of 100 elements.

To compute the address of tarr[r][c], the machine must compute the expression tarr + 4 * 100 * r + 4 * c . The multiplication by 4 is cheap, but multiplication by 100 is more expensive: either the really expensive mult instruction is used, or the compiler computes (r << 6) + (r << 5) + (r << 2) -- since 100 = 64(16) = 2^6 + 2^5 + 2^2, we can do the multiplication by 100 in 5 cycles instead of 10-18. (Actually, the compiler will multiply by 400, so it will mostly likely compute (r << 8) + (r << 7) + (r << 4) instead.)

In any case, this address computation is expensive, and must be done 40960 times (the same array element is used twice in the loop body, so we assume that the compiler is smart enough to load the value into a register and use it twice).

In the next lecture, I will show the strength reduced version of this program.

[ CSE 80 | ACS home | CSE home | CSE calendar | bsy's home page ]

bsy@cse.ucsd.edu, last updated Mon Oct 28 16:16:58 PST 1996.

email bsy