computer architecture final exam review

memory

Memory Hierarchy
- Understand what is memory wall? Why it is a bottleneck?瓶颈
  - 内存壁垒：处理器和内存的悬殊持续增长
  - 瓶颈：好的内存架构设计对系统的性能有越来越高的重要性
- Illusion provided from a memory hierarchy
  - 大的内存都是很慢的，但是caching可以give 错觉他们很快
  - 所以这个错觉可以是
    - caching + virtual memory 能够是的内存变快
  - 传统的内存层级
    - 通过使用局部性的优势
      - 能够尽可能的利用磁盘上能够使用的内存
      - 速度也会很快
- Principle of locality (two types of locality)
  - 定义:程序每次会使用一小部分的地址空间
    - temporal locality
      - 曾经使用过的程序，再次被使用的可能会很大
      - 比如循环中的指令和数据
    - spatial locality
      - 在执行过的程序附近的程序被使用的可能性很大
      - 例如顺序执行中的数组
- DDR RAM? Relationship between memory clock rate and memory bandwidth
  - mobile system require 2X bandwidth every 2 years
- DDR DRAM 叫double data rate DRAM
  - transfer data on rising and falling clock edges
  - E.g., 800MHz DDR –> BW?
- Know how to calculate Average Disk Read Time
  - average disk read time = seek time + rotation latency + transfer time + controller delay
  - rpm = revolution per minutes
  - e.g
    - 512B sector, 15000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk
    - time = 4ms seek time +
      - (½)/(15000/60) = 2ms rotationl latency
      - 512/100MB/s = 5.12 s = 0.00512ms transfer time
      - 0.2ms controller delay
      - = 6.2 ms
Difference between three types of caches:
- Direct-mapped cache
  - 根据索引表映射哪个cache块用来存数据
  - 一个block只能有一个选择存哪里
  - 内存块的那一个位置分配给cache的某个位置，由后三位决定（假设有8个块（2³=8））
    - 001：00001，01001，10001，11001
  - 用valid bit来判断cache的数据是否存在内存那个位置
  - tag用来判断cache的数据在内存哪个位置
  - 重点
    - index不存在cache里面，就像数组不会存索引
    - 假设有一个word：22
      - 首先转成二进制10110
      - 然后后三位位index
      - 前两位为tag
      - 然后把数据存入对应位置
      - 因为没存进来的时候，对应位置没有数据，所以为miss（第一次访问那个内存地址），存入之后，才会变成hit
      - 然后数据valid bit变成Y
  - 假设一个cache有32 bit的地址
    - tag:22bits
    - index:4bits
    - offsets:6bits
    - 那么 cache block size 为2⁶ byte
    - 那么 cache有2⁴ entries
- n-way Set-associative cache
  - 每个set都有n个成员
  - 用块的号数来决定哪个set
  - 在给定的set中搜索全部成员（因为set里面的成员少，所以搜索起来快）
    - 那么就只要比较n次，更加省时间
  - 假设一个cache有2^s个sets，并且每个block有2ⁿ个bytes。
    - 那么block offset为n
    - index为s
    - 假设address长度为m
      - 则tag长度为m-s-n
    - 计算set index
      - block address = memory address / 2ⁿ
      - set index = block address % 2^s
- Fully associative cache
  - cache可以映射到内存中任何一个位置
  - 所有成员都要搜索一次
  - 耗费时间长，每个成员都要对比, 每个block的tag都要对比
For each of the above cache type
- How to calculate a block(or set) index give an address?
  - miss: load 的数据不再memory里面
    - every miss always load the 64 byte
  - 如何选block size
    - 用两种方式
      - early restart
        
        CPU在等第一个byte，但是64byte还没load进去
        
        或者可以先读第一个byte，不用等他读完
      - critical word first
- How to subdivide an address given a cache design, or vice versa?
Write back vs write through（write hit）
- write through
  - 如果write hit，则更新缓存里面的block和memory里面的block
  - 这样缓存和内存里面的数据就同步了
  - 但是会使得数据的写更长时间
    - 解决方案：使用write buffer
      - 把要写入内存的数据hold waiting
- write back
  - 当你更新数据的时候，只改cache里面的数据
  - 速度快
- 两者使用的差别
  - 如果写的频率高的时候，write back快
Write allocate vs no write allocate (on write miss)
- allocate
  - fetch the block
- no allocate
  - do not fetch the block
condition
- if read hit: cpu continue
- if write hit: write through or write back
- if read miss: store pipeline, read data to pipeline and continue
- if write miss: write through (allocate or no allocate)
Know how to calculate AMAT, actual CPI
- AMAT: average memory access time
  - hit time + miss rate * miss penalty
  - 如果是两层
    - AMAT = hit time1 + miss rate1 * (hit time2 + miss rate2 * miss penalty2)
- actual CPI : cpu time divide into 2 parts
  - cpu execution clock cycles
  - memory stall clock cycles
    - (instruction / program)* (misses/introduction) * miss penalty
  - miss penalty = main memory access time / clock cycle
  - Effective CPI = bass CPI + L1 Miss rate * L2 Hit Time + L2 global miss rate * L2 Miss time
  - AMAT = L1 Hit time + L1 miss rate * L2 Hit time + L2 global miss rate * L2 Miss time
  - performance = 慢的/快的
  - e.g
    - CPU base CPI = 1, clock rate = 4GHz
    - Miss rate / instruction = 2%
    - Main memory access time = 100ms
      - then if there is only primary memory
        
        Miss penalty = 100ns / clock cycle
        
        clock cycle = 1/clock rate = 0.25
        
        Miss penalty = 400 cycle
        
        Effective CPI = 1 + 2% * 400 = 9 cycle
    - then suppose we add L-2 cache
      - L2 cache access time = 5ns
      - L2 global miss rate = 0.5%
        
        L2 Hit time = 5ns / clock cycle = 5ns / 0.25 = 20 cycles
        
        L2 Miss time = 400 cycles
        
        Effective CPI = 1 + 2% * 20 + 0.5% * 400 = 3.4 cycle
- E.g. on a machine with one or two level caches
- 3 formulas (two from lectures, one from the HW 6 solution)
Know what is LRU(least recent use) replacement policy
Virtual memory
- 1K = 2¹⁰
  - 1K * 1K = 1M
- What is page table
  - 用来在物理内存里面找一个页的
  - PTE:page table entries
    - indexed by virtual page
  - 如果page不在内存里面
    - PTE存物理page的数字
  - 如果page在内存里面
    - PTE refer to一个磁盘的位置
  - dirty bit set to 1 当一个page被改动过
- Address subdivision
- How to translate a virtual address to a physical address
  - CPU and OS translate virtual address to physical address
  - The CPU sends virtual address to MMU
  - The MMU send physical address to the memory
- Purpose of MMU
  - 转换虚拟地址和物理地址
- What is page fault
  - 其实相当于cache miss
  - 当你有一个page fault，这个page就要被fetch from磁盘到内存
  - 如何减少page fault？
    - 全关联placement
    - 小页面替换算法
- How a page fault handler works?
  - use faulting virtual address to find PTE
  - locate page on disk
  - choose page to replace
    - if dirty, write the page to disk first
  - read page into memory and update page table
  - make process runnable again
    - restart from faulting instruction
假设virtual address = 43 bits, page size = 4KB, PTE size = 4 byte
- 那么对于单层 page table，需要多少个PTES？
  - page size = 4KB = 2¹²
  - 那么最差要2^43-12
- 那么需要存这个page table要多少物理内存？
  - 2^43-12*4 byte
  - 2³¹*2² = 2³³
Know what is TLB
- Translate look-aside Buffer
  - TLB的miss由硬件或者软件控制
  - a spacial cache of page table entries within CPU
- Understand the interaction between TLB and caches
  - cache rage use physical address
    - need to translate before cache
- TLB cache is a page table entries
Know the sources of cache misses (3 Cs model) 必考
- 三种类型的miss
  - compulsory miss
    - 第一次访问会导致这种miss
    - 因为第一次访问的时候数据不会在内存里面
    - 解决方案
      - 假设一个block只能存一个byte，然后我们要load 128 byte的数据
        
        那么我们就要load 128次
        
        但是如果把block增大成128 byte，那么我们只要存一次
  - capacity miss
    - 理论上存在，实际的机器是不存在的
    - 只有full associate cache
    - 空间不足
    - 解决方案
      - 增大cache size
  - conflict miss
    - 假设有两个entry，但是有100个要进去
      - 这个时候就会发生这种miss
    - 也就是资源竞争
    - 解决方案
      - 增加关联性
    - 不存在于full associate cache
- What about the fourth C?
Software optimisation techniques: AVX, Loop Unrolling, Blocking?
- blocking
  - maximise accesses to data before it is replaced
  - 应该就是访问数据之前，先把空间申请了
A few fallacies and pitfalls
- 1.在多核CPU里里面，并且有shared的L2和L3cache
  - 如果关联性不够core多的话，会导致 conflict miss
    - e.g
      - 如果有一个8核CPU，那么至少需要一个8-way的set 关联cache
- 2.用AMAT来评估一个out-of-order processor的性能
  - 忽略effect of non-blocking accesses(continuing execution on misses)
  - instead, evaluate performance by simulation

parallel processor

Amdahl’s law
- 知道公式以及知道如何计算
- 公式
  - speedup = 1/[(1-F)+F/P]
  - F为并行化的百分比
  - P为处理器数量
  - e.g
    - 我们有100个处理器，想要90倍的speedup
    - 1/[(1-F) + F/100] = 90
      - F = 89*100/(99*90) = 99.9%
      - 所以非并行的部分必须要<=0.1%
strong scaling(缩放) and weak scaling
- scaling
  - e.g1 10*10 matrix
    - 从10个处理器加速到100个处理器，speedup是多少
      - 首先在单个处理器：time=(10+100)*Tadd=110*Tadd
      - 10个处理器
        
        time=10*Tadd + (10*10)/10*Tadd = 20*Tadd
        
        speedup = 110/20=5.5
      - 100个处理器
        
        time=10*Tadd + (10*10)/100*Tadd = 11*Tadd
        
        speedup = 110/11 = 10
- strong scaling
  - 数据规模确定
- weak scaling
  - 数据规模和处理器的数量有关
  - 例如
    - 我们有10个处理器，那么数组的规模就是10*10
    - 如果有100个处理器，那么数组的规模就是100*100
good understanding of hardware threading
- 增加单个核的资源利用率
- 在thread里面，dependency handled by scheduling and register renaming
- 不同的thread里面的指令在函数单元available 的时候执行
- 每个cycle都没有thread switching

KING

Do more, say less

Computer Architecture Final Exam Review

memory

parallel processor

Comments