Computer Architecture (COD) — Placement-Ready Guide

Introduction — Why Computer Architecture Matters

Computer architecture bridges hardware and software: it determines how instruction streams map to physical execution, how memory access latency affects performance, and how parallelism is exposed to programmers. Understanding architecture lets you write faster code, optimize hot paths, and design scalable systems.

This guide covers core concepts interviewers expect: pipelining, caches, branch prediction, parallelism, and performance reasoning — with problems and revealable solutions.

Instruction Set Architectures (ISA) — RISC vs CISC

ISA is the programmer-visible contract: registers, instructions, addressing modes. Two historical camps:

CISC (Complex Instruction Set Computer): many complex instructions (x86). Dense instruction encodings; microcode often translates complex ops to micro-ops.
RISC (Reduced Instruction Set Computer): small, fixed-length instructions (ARM, RISC-V). Simpler decode, easier pipelining and parallelism.

Modern x86 internally translates into micro-ops; RISC-V is gaining adoption for its simplicity and extensibility.

Performance Metrics & Amdahl's Law

Key metrics: latency (time to complete a task), throughput (work per unit time), CPI (cycles per instruction), IPC (instructions per cycle), and clock frequency. Use Amdahl's law to reason about speedup:

Speedup = 1 / ((1 - f) + f / S)
# f = fraction of execution time improved, S = speedup of that part

Example: if 40% of runtime can be sped up by 2x, overall speedup = 1 / (0.6 + 0.4/2) = 1 / (0.6 + 0.2) = 1.25x.

Pipelining & Hazards

Pipelining splits instruction execution into stages (fetch, decode, execute, memory, writeback) to increase throughput. Ideal pipeline throughput approaches one instruction per cycle, but hazards limit performance.

Hazards

Structural hazards: resource conflicts (e.g., single memory port).
Data hazards: RAW, WAR, WAW dependencies — solved with forwarding or stalls.
Control hazards: branches that change PC — cause pipeline flushes.

Example: Stall Calculation

// Load-use hazard: load takes extra cycle before value available
I1: LOAD R1, [R2]
I2: ADD R3, R1, R4  // dependent
// If load latency causes 1-cycle stall, pipeline inserts bubble between I1 and I2

Superscalar, ILP & Out-of-Order Execution

Superscalar processors issue multiple instructions per cycle using multiple execution units. Instruction-level parallelism (ILP) is exploited via out-of-order execution, where instructions are executed as operands become available and later committed in-order to preserve semantics.

Register renaming avoids false dependencies (WAR, WAW). Reorder buffer (ROB) and reservation stations are used to track instruction status.

Memory Hierarchy & Caches — The Real Bottleneck

Memory latency is many CPU cycles; caches bridge CPU and DRAM. Typical hierarchy: registers → L1 (I/D) → L2 → L3 (shared) → DRAM → SSD/HDD. Cache hit rates dominate performance.

Cache Metrics

Hit rate, miss rate, miss penalty, and average memory access time (AMAT).
AMAT = HitTime + MissRate * MissPenalty.

Cache Organization

Direct-mapped, set-associative, fully-associative.
Block/line size affects spatial locality.
Write-through vs write-back policies.

// Example: HitTime=1 cycle, MissRate=2%, MissPenalty=100 cycles
AMAT = 1 + 0.02 * 100 = 3 cycles

Branch Prediction & Speculation

Branches disrupt pipelines—predicting their outcome reduces stalls. Simple predictors: static predict-not-taken, 1-bit/2-bit saturating counters, two-level adaptive predictors, and more complex neural predictors in high-end CPUs. Misprediction penalty includes flushing speculative work.

Branch Predictor Example

// 2-bit saturating counter state machine
00 (strong not-taken) -> on taken -> 01 -> 10 -> 11 (strong taken)
// requires two mispredictions to flip from strongly taken to strongly not-taken

Parallelism: Multicore, SIMD & GPUs

Parallelism occurs at many levels: instruction-level (ILP), data-level (SIMD), thread-level (multicore), and task-level (distributed systems). GPUs provide massive SIMD parallelism for data-parallel workloads.

SIMD

Single Instruction Multiple Data (SSE, AVX) processes multiple data elements per instruction. Use for vectorizable loops—align data and avoid branching inside vectors.

NUMA

Non-uniform memory access means memory latency depends on which socket holds the memory. Optimize thread placement and memory allocation to prefer local nodes.

CPU-Memory-Storage Interaction & NUMA

Storage performance (SSD/HDD) impacts IO-bound workloads. Use NVMe for low-latency storage. OS and architecture interact: DMA, interrupts, and memory-mapped IO affect CPU utilization and latency.

DMA & Device Offload

Direct Memory Access moves data between devices and memory without CPU intervention. Offload operations to NIC (RDMA) or GPUs to reduce CPU cycles for data movement.

Benchmarks & Profiling

Common benchmarks: SPEC CPU for single-threaded CPU performance, STREAM for memory bandwidth, LINPACK for floating point, and microbenchmarks for cache and branch behavior. Profilers like perf sample hotspots and call stacks to guide optimization.

Microarchitecture Design Trade-offs

Design involves balancing frequency vs IPC, power vs performance, and complexity vs predictability. Increasing pipeline depth may increase frequency but exacerbate branch penalties; larger caches reduce miss rate but increase latency and power.

Placement Problems — Practice & Reveal

Solve these architecture problems; reveal solutions when ready. These mirror questions asked in interviews at systems-focused teams.

Q1 — AMAT & Cache Improvements (Easy)

A program has memory accesses with HitTime=1ns, MissRate=5%, MissPenalty=100ns. You can improve cache to reduce MissRate to 2% but increase HitTime to 1.2ns. Compute AMAT before and after. Is it worth it?

Before: AMAT = 1 + 0.05*100 = 6 ns. After: AMAT = 1.2 + 0.02*100 = 3.2 ns. Yes — significant improvement (6 -> 3.2 ns).

Q2 — Pipeline Hazards (Medium)

Consider a 5-stage pipeline with forwarding except load-use hazards which cause one-cycle stall. Given instruction sequence: LOAD R1, 0(R2); ADD R3, R1, R4; SUB R5, R6, R7. How many cycles for these three instructions?

Ideal pipeline without stalls: 5 + (n-1) = 7 cycles. Load-use causes 1 stall between LOAD and ADD -> total cycles = 8. The SUB can be overlapped; so 8 cycles.

Q3 — Amdahl for Parallel Speedup (Medium)

A task has 30% inherently sequential work and 70% perfectly parallelizable. On 8 cores, what's the speedup? What's theoretical max as cores -> infinity?

Speedup = 1 / (0.3 + 0.7/8) = 1 / (0.3 + 0.0875) = 1 / 0.3875 = 2.58x approx. Max as cores->inf = 1 / 0.3 = 3.33x.

Q4 — Cache Miss Types (Classic)

Define compulsory, capacity, and conflict misses. Give an example access pattern that causes conflict misses in a direct-mapped cache but not in a 4-way associative cache.

Compulsory: first-time references. Capacity: cache too small to hold working set. Conflict: mapping collisions. Example: accessing addresses that are spaced by cache size so they map to same line (e.g., stride equal to cache size) causes conflict in direct-mapped but 4-way set-assoc can place them in different ways.

Q5 — Branch Prediction (Advanced)

A loop branches taken 9 out of 10 times. A 2-bit predictor starts strongly not-taken (00). After first 20 iterations, what's steady-state prediction accuracy and why?

2-bit predictor adapts to pattern: after a few updates it moves to strong-taken state and will predict taken. With occasional not-taken (1/10), the predictor may drop to weak-taken but quickly recover. Accuracy close to branch taken frequency (90%) minus transient mispredictions during state changes.

Cheat Sheet & Quick Formulas

# AMAT: AMAT = HitTime + MissRate * MissPenalty
# IPC: IPC = Instructions / Cycles
# CPI: CPI = 1 / IPC (approx for single-issue)
# Effective CPI = sum_i (CPI_i * InstrFrac_i)
# Speedup by frequency: Speedup = F2 / F1 (if IPC constant)
# Amdahl: S = 1 / ((1-f) + f/Sf)

Summary & Next Steps

Computer architecture knowledge lets you reason quantitatively about performance. Practice solving pipeline, cache, and parallelism problems, profile real code, and connect software patterns to hardware behaviors. Next: Probability & Analytics page will provide formula-tweak problems and revealable solutions for placement prep.