Introduction — Why Computer Architecture Matters
Computer architecture bridges hardware and software: it determines how instruction streams map to physical execution, how memory access latency affects performance, and how parallelism is exposed to programmers. Understanding architecture lets you write faster code, optimize hot paths, and design scalable systems.
This guide covers core concepts interviewers expect: pipelining, caches, branch prediction, parallelism, and performance reasoning — with problems and revealable solutions.
Instruction Set Architectures (ISA) — RISC vs CISC
ISA is the programmer-visible contract: registers, instructions, addressing modes. Two historical camps:
- CISC (Complex Instruction Set Computer): many complex instructions (x86). Dense instruction encodings; microcode often translates complex ops to micro-ops.
- RISC (Reduced Instruction Set Computer): small, fixed-length instructions (ARM, RISC-V). Simpler decode, easier pipelining and parallelism.
Modern x86 internally translates into micro-ops; RISC-V is gaining adoption for its simplicity and extensibility.
Performance Metrics & Amdahl's Law
Key metrics: latency (time to complete a task), throughput (work per unit time), CPI (cycles per instruction), IPC (instructions per cycle), and clock frequency. Use Amdahl's law to reason about speedup:
Speedup = 1 / ((1 - f) + f / S) # f = fraction of execution time improved, S = speedup of that part
Example: if 40% of runtime can be sped up by 2x, overall speedup = 1 / (0.6 + 0.4/2) = 1 / (0.6 + 0.2) = 1.25x.
Pipelining & Hazards
Pipelining splits instruction execution into stages (fetch, decode, execute, memory, writeback) to increase throughput. Ideal pipeline throughput approaches one instruction per cycle, but hazards limit performance.
Hazards
- Structural hazards: resource conflicts (e.g., single memory port).
- Data hazards: RAW, WAR, WAW dependencies — solved with forwarding or stalls.
- Control hazards: branches that change PC — cause pipeline flushes.
Example: Stall Calculation
// Load-use hazard: load takes extra cycle before value available I1: LOAD R1, [R2] I2: ADD R3, R1, R4 // dependent // If load latency causes 1-cycle stall, pipeline inserts bubble between I1 and I2
Superscalar, ILP & Out-of-Order Execution
Superscalar processors issue multiple instructions per cycle using multiple execution units. Instruction-level parallelism (ILP) is exploited via out-of-order execution, where instructions are executed as operands become available and later committed in-order to preserve semantics.
Register renaming avoids false dependencies (WAR, WAW). Reorder buffer (ROB) and reservation stations are used to track instruction status.
Memory Hierarchy & Caches — The Real Bottleneck
Memory latency is many CPU cycles; caches bridge CPU and DRAM. Typical hierarchy: registers → L1 (I/D) → L2 → L3 (shared) → DRAM → SSD/HDD. Cache hit rates dominate performance.
Cache Metrics
- Hit rate, miss rate, miss penalty, and average memory access time (AMAT).
- AMAT = HitTime + MissRate * MissPenalty.
Cache Organization
- Direct-mapped, set-associative, fully-associative.
- Block/line size affects spatial locality.
- Write-through vs write-back policies.
// Example: HitTime=1 cycle, MissRate=2%, MissPenalty=100 cycles AMAT = 1 + 0.02 * 100 = 3 cycles
Branch Prediction & Speculation
Branches disrupt pipelines—predicting their outcome reduces stalls. Simple predictors: static predict-not-taken, 1-bit/2-bit saturating counters, two-level adaptive predictors, and more complex neural predictors in high-end CPUs. Misprediction penalty includes flushing speculative work.
Branch Predictor Example
// 2-bit saturating counter state machine 00 (strong not-taken) -> on taken -> 01 -> 10 -> 11 (strong taken) // requires two mispredictions to flip from strongly taken to strongly not-taken
Parallelism: Multicore, SIMD & GPUs
Parallelism occurs at many levels: instruction-level (ILP), data-level (SIMD), thread-level (multicore), and task-level (distributed systems). GPUs provide massive SIMD parallelism for data-parallel workloads.
SIMD
Single Instruction Multiple Data (SSE, AVX) processes multiple data elements per instruction. Use for vectorizable loops—align data and avoid branching inside vectors.
NUMA
Non-uniform memory access means memory latency depends on which socket holds the memory. Optimize thread placement and memory allocation to prefer local nodes.
CPU-Memory-Storage Interaction & NUMA
Storage performance (SSD/HDD) impacts IO-bound workloads. Use NVMe for low-latency storage. OS and architecture interact: DMA, interrupts, and memory-mapped IO affect CPU utilization and latency.
DMA & Device Offload
Direct Memory Access moves data between devices and memory without CPU intervention. Offload operations to NIC (RDMA) or GPUs to reduce CPU cycles for data movement.
Benchmarks & Profiling
Common benchmarks: SPEC CPU for single-threaded CPU performance, STREAM for memory bandwidth, LINPACK for floating point, and microbenchmarks for cache and branch behavior. Profilers like perf sample hotspots and call stacks to guide optimization.
Microarchitecture Design Trade-offs
Design involves balancing frequency vs IPC, power vs performance, and complexity vs predictability. Increasing pipeline depth may increase frequency but exacerbate branch penalties; larger caches reduce miss rate but increase latency and power.
Placement Problems — Practice & Reveal
Solve these architecture problems; reveal solutions when ready. These mirror questions asked in interviews at systems-focused teams.
Q1 — AMAT & Cache Improvements (Easy)
A program has memory accesses with HitTime=1ns, MissRate=5%, MissPenalty=100ns. You can improve cache to reduce MissRate to 2% but increase HitTime to 1.2ns. Compute AMAT before and after. Is it worth it?
Q2 — Pipeline Hazards (Medium)
Consider a 5-stage pipeline with forwarding except load-use hazards which cause one-cycle stall. Given instruction sequence: LOAD R1, 0(R2); ADD R3, R1, R4; SUB R5, R6, R7. How many cycles for these three instructions?
Q3 — Amdahl for Parallel Speedup (Medium)
A task has 30% inherently sequential work and 70% perfectly parallelizable. On 8 cores, what's the speedup? What's theoretical max as cores -> infinity?
Q4 — Cache Miss Types (Classic)
Define compulsory, capacity, and conflict misses. Give an example access pattern that causes conflict misses in a direct-mapped cache but not in a 4-way associative cache.
Q5 — Branch Prediction (Advanced)
A loop branches taken 9 out of 10 times. A 2-bit predictor starts strongly not-taken (00). After first 20 iterations, what's steady-state prediction accuracy and why?
Cheat Sheet & Quick Formulas
# AMAT: AMAT = HitTime + MissRate * MissPenalty # IPC: IPC = Instructions / Cycles # CPI: CPI = 1 / IPC (approx for single-issue) # Effective CPI = sum_i (CPI_i * InstrFrac_i) # Speedup by frequency: Speedup = F2 / F1 (if IPC constant) # Amdahl: S = 1 / ((1-f) + f/Sf)
Summary & Next Steps
Computer architecture knowledge lets you reason quantitatively about performance. Practice solving pipeline, cache, and parallelism problems, profile real code, and connect software patterns to hardware behaviors. Next: Probability & Analytics page will provide formula-tweak problems and revealable solutions for placement prep.