Computer Architecture 1
charpter1
- What is computer architecture - 4 classes of computers
- personal computer
- server computer
- embedded computer
- supercomputer
 
- What does “PostPC era” mean - personal mobile device(PMD)
- cloud computing
- server are replace by cloud computing
- Software as a Service(Saas)
 
- Portion of software run on a PMD, and a portion(part of) run in the cloud
 
- What factors could affect performance - algorithm
- programming language, compiler, architecture
- processor and memory system
- I/O system(including OS and hardware)
 
- Eight great ideas – Know what they mean - 1.design for moore’s law
- 2.use abstraction to simplify design
- 3.make the common case fast
- 4.performance via parallelism
- 5.performance via pipelining
- 6.performance via prediction
- 7.hierarchy of memory
- 8.dependability via redundancy
 
 
- 4 classes of computers
- Different levels of program code - HLL –> Assembly language –> ML
- 5 components of a computer
- input
- output
- control
- tells datapath, memory, I/O what to do according to the instructions.
 
- datapath
- performs the arithmetic operations
 
- memory
 
- input –>(control, datapath)–>output 
- What is ISA? - instruction set architecture
- the hardware/software interface
 
 
- instruction set architecture
- Understand “Yield” in terms of chip manufacturing? - Proportion of working dies per wafer晶圆(硅半导体集成电路制作所用的硅晶片)
 
- Response time vs Throughput - Response time
- How long it takes to do one task
 
- Throughput
- Total work done per unit time
 
 
- Response time
- Elapsed time vs CPU time - elapsed time
- wall clock time
- total response time, inkling everything
- can be used to determine the system performance
 
 
- wall clock time
- cpu time
- the time that cpu spent for a given process
- can be divide into user CPU time and system CPU time
 
 
- elapsed time
- Know how to calculate CPU time, CPI_avg, IPC, Clock rate, Clock Cycle Time, and Performance - 1 cycle per second –> 1 Hz
- 1 GHz = 109 hz
 
- clock rate(frequency)
- cycle per second
 
- clock cycle time(period)
- duration of a clock cycle
 
- CPU time
- CPU time = CPU clock cycle * clock cycle time = CPU Clock cycle / Clock Rate
 
- Instruction count
- determine by program, ISA, compiler
 
- Average cycle per instruction(CPI)
- IPC = 1/CPI
 
- Clock Cycle = Instruction * Average Cycle per instruction
- CPU time = Instruction * CPI * Clock Cycle Time
- = Instruction Count * CPI / Clock Rate
- = Instruction Count / (Clock Rate * IPC)
- = (Instruction/program) * (Cycles/Instruction) * (Seconds/Cycle)
- = IC * PCI * CC
 
- Clock Cycles = sum(CPI * Instruction Count)
- how to measure performance
- use instruction/second
- so we can use the format clock rate/CPI
 
 
- use instruction/second
- Performance depends on
- Algorithm:affects IC, CPI
- programming language: affects IC, CPI
- Compiler:affects IC, CPI
- ISA:affects IC, CPI, CC
 
- if you want an improvement in the execution time
- you need to deduce the percentage of old time
- for example
- if you want to improve 50%, then you should use (100%-50%)=50% rather than (100% + 50%) = 150%
 
 
- for example
 
- you need to deduce the percentage of old time
- speed up
- use the (old execute time) / (new execute time)
- old time = old instruction count * old average cycle per instruction / clock rate
- new time = new instruction count * new average cycle per instruction / clock rate
- old time / new time = old instruction count * old average cycle per instruction / (new instruction count * new average cycle per instruction)
 
 
 
- use the (old execute time) / (new execute time)
 
- 1 cycle per second –> 1 Hz
 
- Could explain why “Power Wall” is a problem? - situation - can not reduce voltage further, and will make transistor more leaky
 
- Power = Capacitive Load * Voltage * Frequency 
- Understand what are the challenges on multicore processors - Multicore processors: more than one core per chip
- Hard to do
- Programming for performance (not only for correctness)
- Load balancing
- Optimizing communication and synchronization
 
 
- Fallacies and Pitfalls - Fallacy: Some commonly held misconceptions that you might encounter
- Computers at low utilization use little bit power
- Designing for performance and designing for energy efficiency are unrelated
 
- PiGall: easily made mistakes
- They are only true in a limited context
- Can help you avoid making the same mistakes
- example:
- If you improve one aspect of a computer, then you would expect a proportional improvement in the overall performance
- Using a subset of the performance equal on as a performance metric.
- MIPS as a Performance Metric
 
 
- example:
 
 
- Fallacy: Some commonly held misconceptions that you might encounter
- e.g. MIPS, Amdahl’s Law, etc. 
 
chapter 2
- Understand what is instruction and instruction set?
- instruction set: the vocabulary of all commands understood
- different computer have different instruction set
 
- instruction: words of computer’s language
- instruction set architecture(ISA): ISA serves as the interface between software and hardware
- provide the mechanism by which software tells hardware what should be done
 
 
- instruction set: the vocabulary of all commands understood
- Differences between RISC and CISC
- RISC
- reduced instruction set computer
- difference
- fixed instruction lengths 32 bits
- load store instruction sets
- limited addressing modes
- limited operations
- simpler, cheaper
- MIPS: typical of RISC ISAs
 
 
- CISC
- x86
 
 
- RISC
- MIPS ISA - MIPS has a number 32 32-bit registers - 32 bit data called a word
- memory is byte addressed
- each address identifies a byte
 
- MIPS is big endian
 
- R and I types of instruction format - R-format
- op(6 bits)-rs(5 bits)-rt(5 bits)-rd(5 bits)-shamt(5 bits)-funct(6 bits)
- add, sub
 
- I-format
- op(6 bits)-rs(5 bits)-rt(5 bits)-constant or offset address(16 bits)
- addi, lw, sw, beq, bne
 
 
- R-format
- Big endian, little endian - little endian: least significant byte store at least-address of memory
- big endian: Most significant byte store at least-address of memory
 
-  Memory alignment
- although computer are byte-addressable
- memory typically organised in n-byte lines
- only the char[N] are the same in both big endian and little endian
 
 
- How to represent unsigned and signed integers - What is 2s-complement, how, why
- Most-negative: 1000 0000 … 0000
- Most-positive: 0111 1111 … 1111
- Bit 31 is called “sign bit”
- 求负:取反加一
 
 
- What is 2s-complement, how, why
- Sign extension - leading bit = 1 –> negative
- range
- –2,147,483,648t o +2,147,483,647
 
- Sign extension: replicate the sign bit to the left
- 231 – 1 = 0x7FFFFFFF
- -231 = 0x80000000
- overflow problem
- for add instruciton
- 128 + x > 231 – 1 the upper bound
- 128 + x < -231 the lower bound
 
- so if you want to overflow, you need to bigger than 231 – 1 and smaller than -231
 
- for add instruciton
 
- Logical operations: sll, srl, and, or, nor 
- Conditional operations: beq, bne - beq: branch equal
- bne: branch not equal
 
- Concept of “basic block” - a basic block is a sequence of instructions with no embedded branch, no branch target
 
- How “Procedure calling” is supported - procedure is used to structure program
- each procedure performs a specific task
- working like a black box
 
- Know jal (for calling), and jr (for return) - jal procedureName
- puts address of following instruction in $ra
- jump to target address
- procedure call
 
- jr $ra
- jump-register
- copy $ra to PC
- procedure return
 
 
- jal procedureName
- Understand the fact(n) assembly example 
- The memory layout of a program 
- J-type instruction format, e.g., j and jal - op(6 bits)-instruction address(26 bits)
 
- Know how to calculate the target of PC-relative addressing, and target of (pseudo) direct jump addressing - target address = PC + offset * 4
- pseudo instructions: not real instruction
- move $t0, $t1 –> add $t0, $zero, $t1
- blt $t0, $t1, L –> slt $at, $t0, $t1
- bne $at, $zero, L
 
- $at : assembler temporary
 
 
- Hardware synchronisation instructions - ll rt, offset(rs) - load linked
 
- sc rt, offset(rs) - store conditional
 
- rt is both input and output
 
- Know what information is stored in object modules - the assembler translate program into machine instructions which are stored in object modules
 
- Know what are Compiler, Assembler, Linker, Loader used for? - C program compile through compiler
- the compiler come up with assembly language program
- then assembler generate object(Machine language module)
- object(machine language module and library routine) go to linker
- the linker static link and generate executable machine language program
- then go to loader and load into memory
 
- A few fallacies and pithalls - instruction count and CPI are not good performance indicators in isolation
- compiler optimisation are sensitive to algorithm
- java compiled code s significantly faster than JVM interpreted
- nothing can fix a dumb algorithm
- use assembly code for high performance
- powerful instruction –> high performance
- importance of binary compatibility => instruction set does not change
- sequential words are at sequential byte addresses
- increment by 4
 
- using a pointer to an automatic variable outside its defining procedure
- e.g passing pointer back via returning result
- because pointer becomes invalid when stack popped
 
 
- e.g passing pointer back via returning result
 
- MIPS instruction 
- MIPS instructions - Name - format - add - add - R - subtract - sub - R - add immediate - addi - I - load word - lw - I - store word - sw - I - load half - lh - I - load half unsigned - lhu - I - store half - sh - I - load byte - lb - I - load byte unsigned - lbu - I - store byte - sb - I - load linked - ll - I - store conditional - sc - I - load upper immediate - lui - I - and - and - R - or - or - R - nor - nor - R - and immediate - andi - I - or immediate - ori - I - shift left logical - all - R - shift right logical - srl - R - branch on equal - beq - I - branch on not equal - bne - I - set less than - slt - R - set less than immediate - slti - I - set less than immediate unsigned - sltiu - I - jump - j - J - jump register - jr - R - jump and link - jal - J 
- thus we can see that R instruction contain the kind - add, sub, and, or , nor, slt, shift, jump register, move, multi
 
- the I instruction contain
- contain the load, store, immediate command, branch
 
- J instruction only have j and jal 
- 汇编代码示例 - 1.if (i == j) f = g + h; else f = g - h- beq $s3, $s4, Then
- sub $s0, $s1, $s2
- J Exit
- Then: add $s0, $s1, $s2
- Exit: …
 
- 2.while(array[i] == k) i+=1- Loop:sll $t1, $s3, 2 //t1 = i * 4
- add $t1, $t1, $s6
- lw $t0, $t1
- beq $s5, $t0, EXIT
- addi $s3, $s3, 1
- j Loop
- EXIT:…
 
- 3.leaf procedure example - int leaf_example(int, g, h, i, j){ int f; f = (g + h) - (i - j); return f; }
- MIPS code: - addi $sp, $sp, -4 //save s0 on stack
- sw $s0, 0($sp)
- add $t0, $a0, $a1
- add $t1, $a2, $a3
- sub $s0, $t0, $t1
- add $v0, $s0, $zero //store result
- lw $s0, 0($sp) //store s0
- addi $sp, $sp, 4
- jr $ra //return
 
 
- 4.no leaf procedure example - int fact(int n){ if(n < 1){ return 1; }else{ return n * fact(n - 1); } } int main(){ int n = 10; fact(n); printf(n); }
- MIPS code: -  fact:
- addi $sp, $sp, -8
- lw $ra, 4($sp)
- lw $a0, 0($sp)
- slti $t0, $a0, 1 //n < 1, t0 = 1
- beq $t0, $zero, L1 //if t0 == 0(means n >= 1) go to L1
- addi $v0, $zer0, 1
- addi $sp, $sp, 8
- jr $ra //return address
 
-  L1:
- addi $a0, $a0, -1 //n = n – 1
- jal fact //递归
- sw $a0, 0($sp)
- sw $ra, 4($sp)
- addi $sp, $sp, 8
- multi $v0, $a0, $v0 //n*fact(n-1)
- jr $ra //return address
 
 
-  fact:
 
 
- 1.
- segment place - code segment –> data segment –> heap segment –> stack segment
chapter 3
 
- code segment –> data segment –> heap segment –> stack segment
- Given a logic function, know how to draw its logic gate diagram 
- Given a logic gate diagram, know how to write down its logic function 
- Know how a full adder is implemented
- Know what are decoder and multiplexer and how they work
- Understand clock, register, SRAM, DRAM
- SRAM:static Random Access Memory
- DRAM:dynamic Random Access Memory