CSE 30 -- Lecture 17 -- Nov 26

In this lecture, I gave some hints on assignment 6 (look for and eliminate repeated work), reviewed kernel and user threads, and started on a user-level threads design.

A thread is just a virtual CPU running within some address space. The operating system context switches among the virtual CPUs to provide multitasking. A traditional process is an address space, access rights (what files are open, under whose privileges is the program running, etc), and a single thread. Multithreaded programs are processes with more than one thread. (See also Lecture 15 notes.)

We're going to design a very spartan user-level threads system. The user threads package will not have thread IDs nor will there be ways to manipulate threads, e.g., for one thread to abort another. We start by writing a C-style interface.

thread_spawn(void	(*fn)(void *),
	     void	*arg,
	     void	*stack);

thread_yield(void);

thread_exit(void);

thread_wait(void);

We will go into more detail over the next couple of lectures on how to implement this.

How to Measure Performance for Assignment 6

(This was not part of the lecture.)

To measure how fast your code is, run it on a standard size, e.g.,

Width	Height	Description
70	20	slightly smaller than a standard xterm
160	160	PalmPilot
480	240	WinCE display
1024	768	Desktop display
1280	1024	Large desktop display

for a few repetitions (e.g. 10 reps for PalmPilot/WinCE sizes), with no intermediate output, and record the instruction count printed (it's a 64-bit hexidecimal value). Then run a null routine in place of life_xform -- see nolife.asm (just run "spim -file nolifeall.asm") -- and see what that number is. Use a hex calculator to compute the difference, and divide by the number of pixels and number of repetitions to get the instruction per pixel value. Note: do not measure your code using the desktop sizes, since it will take too long.

In Unix, you can do the following ( these numbers are changed because I modified the driver to make measurements more precise -- the output code, which is always invoked at the end, used to use a different number of instructions depending on how many cells were alive or dead):

$ spim -file nolifeall.asm
SPIM Version 6.0 of July 21, 1997
Copyright 1990-1997 by James R. Larus (larus@cs.wisc.edu).
All Rights Reserved.
See the file README for a full copyright notice.
Loaded: /home/solaris/ieng9/cs30f/public/lib/trap.handler
Enter numbers, <= 0 to use default
Width? 480
Height? 240
Xterm control sequences? (0/1) 0
Show intermediate states? (0/1) 0
Seed? 1
Rep? 10
[ ... output of last generation omitted ... ]
Total instructions since program load = 0x00000000 00d6c103
$ spim -file all.asm
SPIM Version 6.0 of July 21, 1997
Copyright 1990-1997 by James R. Larus (larus@cs.wisc.edu).
All Rights Reserved.
See the file README for a full copyright notice.
Loaded: /home/solaris/ieng9/cs30f/public/lib/trap.handler
Enter numbers, <= 0 to use default
Width? 480
Height? 240
Xterm control sequences? (0/1) 0
Show intermediate states? (0/1) 0
Seed? 1
Rep? 10
[ ... output of last generation omitted ... ]
Total instructions since program load = 0x00000000 18a349f0

Then, you do:

$ bc
scale=5
ibase=16
18A349F0 - D6C103 Note that you must capitalize the letters
399280365
ibase=A
./10/480/240
346.59753
^D
$

which tells you that the original slow code that I gave out uses about three hundred and forty six (346) instructions per pixel, which is awfully slow.

I have a version that achieves 3.4 instructions per pixel when run with the WinCE screen dimensions. I don't expect you to get this level of speedup -- if you get around 15 instructions per pixel, you'll be really well. I include this number here only to let you know what is achievable.

I asked Stephane to implement a sample solution using only a little thought and the speed-up techniques that you've been taught in class. His solution will be used as a metric for what is an ``A'' grade -- he's getting fractionally over 15 instructions per pixel.

Observation: to simply copy pixels, you can copy 32 pixels using only two instructions (a lw / sw pair), plus loop overhead. Ignoring loop overhead and pointer bumping, this is 1/16 of an instruction per pixel.

bsy@cse.ucsd.edu, last updated Fri Dec 5 00:32:17 PST 1997.

email bsy