1. From C Code to Hardware: The Big Picture

Before designing any hardware, we need to understand what a processor actually does. When you write and run a C program, a multi-stage pipeline silently transforms your human-readable code into binary instructions that a processor can execute, one by one.

Key insight: A processor doesn't understand C, Python, or any high-level language. It only understands binary machine code — sequences of 0s and 1s that encode specific operations. Everything else is an abstraction built on top of this foundation.
hello.c Source file Preprocessing Expands #include & #define macros → file.i Compilation C code → assembly Optimisation & type checking → file.s Assembly Converts to binary opcodes → file.o Linking Combines objects Resolves symbols → a.out / .exe OS / RAM Binary loaded into memory PC → main() PROCESSOR Fetch Read instruction Decode Interpret opcode Execute ALU operation Mem Access Load / Store Write Back Result → register
Figure 1 — The full journey: from C source to processor execution. In a single-cycle processor, the entire five-stage pipeline completes in one clock cycle.

The critical handoff happens when the OS loads the binary into RAM and sets the Program Counter (PC) to the start of main(). From that moment, the processor takes over — fetching, decoding, and executing instructions in a tight loop until the program terminates.

2. The RV32I Instruction Formats

Every instruction processed by our RV32I core is exactly 32 bits long. To handle the wide variety of operations cleanly, RISC-V organises instructions into six fundamental formats. Think of it like a standardised form — the fields are always in the same positions, making hardware decoding simple and fast.

Classroom analogy: To identify a specific instruction, follow three steps — (1) the opcode (bits [6:0]) identifies the broad class, (2) funct7 (bits [31:25]) narrows it to a sub-group, and (3) funct3 (bits [14:12]) pinpoints the exact operation. Like a school where the opcode is the grade, funct7 is the section, and funct3 is the student's roll number.
Type 31–25 24–20 19–15 14–12 11–7 6–0
R funct7 rs2 rs1 funct3 rd opcode
I imm[11:0] rs1 funct3 rd opcode
S imm[11:5] rs2 rs1 funct3 imm[4:0] opcode
B imm[12,10:5] rs2 rs1 funct3 imm[4:1,11] opcode
U imm[31:12] rd opcode
J imm[20,10:1,11,19:12] rd opcode

The three register fields appear in fixed bit positions across all formats — rs1 is always at bits [19:15], rs2 at [24:20], and rd at [11:7]. This means the register file can begin reading operands before decoding is even complete, which is a key reason RISC-V decodes so efficiently in hardware.

The immediate field (imm) is a constant value embedded in the instruction. For types that use it (I, S, B, U, J), the immediate is scattered across different bit ranges and must be reconstructed and sign-extended to 32 bits by a dedicated Sign Extender block.

Format Opcode (hex) Typical Use Example Instructions
R-type 0x33 Register ↔ register arithmetic/logic add, sub, and, or, xor, sll, srl, sra, slt
I-type 0x13, 0x03 Immediate arithmetic, loads, jalr addi, lw, lh, lb, jalr
S-type 0x23 Memory store sw, sh, sb
B-type 0x63 Conditional branch beq, bne, blt, bge, bltu, bgeu
U-type 0x37, 0x17 Large immediates lui, auipc
J-type 0x6F Unconditional jump jal

3. High-Level Architecture: Datapath & Control Unit

A single-cycle RISC-V processor is built from two cooperating subsystems. The Datapath is the "muscle" — it moves, computes, and stores data. The Control Unit is the "brain" — it reads each instruction and generates the signals that tell the datapath what to do.

Control Unit Main CU + ALU Control RegWrite · ALUSrc · MemtoReg · Branch · ALUOp Program Counter Instruction Memory Register File Sign Extender ALU Data Memory MUX DATAPATH opcode[6:0] Data flow Control signals Instruction bits to Control Unit
Figure 2 — High-level view of the processor. The Control Unit reads the opcode and drives control signals (dashed purple) to all datapath components. Data flows left to right (solid blue).

4. The Datapath: Component by Component

4.1 The Program Counter

The Program Counter (PC) is the processor's "bookmark" — a 32-bit register that holds the address of the instruction currently being executed. It is implemented as a simple positive-edge-triggered D flip-flop with an asynchronous active-high reset. On every rising clock edge, it captures the next PC address computed by the datapath.

Normally, PC_next = PC + 4, since instructions are 4 bytes wide (word-aligned). But when a branch or jump is taken, the PC is loaded with a target address instead.

module programCounter(
  input  clk, rst,
  input  [31:0] pc_in,       // Next PC (PC+4, branch target, or jump target)
  output reg [31:0] pc_out   // Current PC
);
  always @(posedge clk or posedge rst) begin
    if (rst) pc_out <= 32'b0;   // Asynchronous reset to address 0
    else     pc_out <= pc_in;
  end
endmodule

4.2 Instruction Memory

The Instruction Memory is a Read-Only Memory (ROM) that stores the program's binary instructions. It is indexed using PC[9:2] — the upper 8 bits of the PC within the 1 KB address space — because the lower 2 bits of a word-aligned address are always 00. Instructions are read asynchronously (combinationally), meaning any new PC value immediately produces the corresponding instruction with zero latency. The memory contents are loaded at simulation start using Verilog's $readmemh directive from a .hex file.

4.3 The Register File

The Register File contains 32 general-purpose 32-bit registers, named x0 through x31. It supports two simultaneous asynchronous reads (from rs1 and rs2) and one synchronous write (to rd, gated by the RegWrite control signal). Register x0 is hardwired to zero — any write to it is silently discarded, and any read from it always returns 0.

module registerFile(
  input clk, RegWrite,
  input  [4:0]  rs1, rs2, rd,
  input  [31:0] writeData,
  output [31:0] readData1, readData2
);
  reg [31:0] registers [0:31];

  // Asynchronous read — x0 is hardwired to 0
  assign readData1 = (rs1 != 5'd0) ? registers[rs1] : 32'b0;
  assign readData2 = (rs2 != 5'd0) ? registers[rs2] : 32'b0;

  // Synchronous write — x0 is write-protected
  always @(posedge clk) begin
    if (RegWrite && rd != 5'd0)
      registers[rd] <= writeData;
  end
endmodule

4.4 The Sign Extender

Many instructions embed a small constant (an immediate) directly in the instruction encoding. Because the ALU operates on 32-bit values, this immediate must be sign-extended — its most significant bit is replicated to fill the upper bits. The Sign Extender reads the opcode to determine which format the instruction uses, then reconstructs and sign-extends the appropriate bits.

For example, an I-type instruction stores a 12-bit immediate in instr[31:20]. The Sign Extender extends this to 32 bits by replicating bit 31 (the sign bit) across the upper 20 positions: imm = { {20{instr[31]}}, instr[31:20] }.

4.5 The ALU

The Arithmetic Logic Unit is the computational heart of the processor. It takes two 32-bit operands — always RD1 from the register file on the A port, and either RD2 or the sign-extended immediate (selected by the ALUSrc MUX) on the B port — and performs the operation specified by the 4-bit ALUControl signal. Its two outputs are the 32-bit Result and a 1-bit Zero flag (asserted when Result is zero, used by branch instructions).

0000 Bitwise AND
0001 Bitwise OR
0010 Addition
0011 Bitwise XOR
0100 Shift Left Logical
0101 Shift Right Logical
0110 Subtraction
0111 Signed Comparison (SLT)
1000 Unsigned Comparison (SLTU)
1101 Shift Right Arithmetic
module alu(
  input  [31:0] A, B,
  input  [3:0]  ALUControl,
  output reg [31:0] Result,
  output Zero
);
  always @(*) begin
    case (ALUControl)
      4'b0000: Result = A & B;
      4'b0001: Result = A | B;
      4'b0010: Result = A + B;
      4'b0011: Result = A ^ B;
      4'b0100: Result = A << B[4:0];
      4'b0101: Result = A >> B[4:0];
      4'b0110: Result = A - B;
      4'b0111: Result = ($signed(A) < $signed(B)) ? 32'd1 : 32'd0;
      4'b1000: Result = (A < B) ? 32'd1 : 32'd0;
      4'b1101: Result = $signed(A) >>> A[4:0];
      default: Result = 32'h00000000;
    endcase
  end
  assign Zero = (Result == 32'b0);
endmodule

4.6 Data Memory

The Data Memory is a 1 KB RAM used by load (lw, lh, lb) and store (sw, sh, sb) instructions. It accepts the ALU's computed address as its index, uses word alignment (addr[9:2]), reads asynchronously when MemRead is high, and writes synchronously on the rising clock edge when MemWrite is high.

5. The Control Unit

The Control Unit decodes each instruction and generates the binary control signals that orchestrate every component in the datapath. It is split into two hierarchical blocks to keep the logic manageable.

instruction [31:0] [6:0] funct3[14:12] · funct7[31:25] Main Control Unit reads: opcode[6:0] RegWrite ALUSrc MemtoReg MemRead / MemWrite Branch / Jal / Jalr → ALUOp [1:0] ALUOp ALU Control reads: ALUOp + funct3 + funct7 → ALUControl [3:0] ALUCtrl ALU executes op ALUOp: 00 = ADD (load/store) · 01 = SUB/CMP (branch) · 10 = decode via funct3 + funct7 (R/I-type)
Figure 3 — The two-level Control Unit. The Main CU resolves the broad instruction class from the opcode; the ALU Control then resolves the exact operation from ALUOp + funct3 + funct7.

5.1 Main Control Unit

This block reads only the 7-bit opcode and produces the high-level control signals listed below. It uses a Verilog case statement inside an always @(*) block, with all outputs defaulting to zero for safety.

RegWrite

Enables the register file to write the result into destination register rd at the next clock edge.

ALUSrc

Selects the second ALU operand: 0 → use RD2 from the register file; 1 → use the sign-extended immediate.

MemtoReg

Selects what is written back to rd: 0 → ALU result; 1 → data read from Data Memory (for load instructions).

MemRead

Enables Data Memory to drive its output for load instructions (lw, lh, lb).

MemWrite

Enables Data Memory to write RD2 at the ALU-computed address for store instructions.

Branch

Asserted for B-type instructions. Combined with the ALU Zero flag and funct3 to determine whether to take the branch.

Jal

Asserted for the jal instruction. Causes PC to load PC + imm and writes PC + 4 into rd.

Jalr

Asserted for jalr. Causes PC to load (RD1 + imm) & ~1 (word-aligned), and writes PC + 4 into rd.

ALUOp [1:0]

00 = force ADD (loads/stores), 01 = force SUB/CMP (branches), 10 = decode via funct3/funct7 (R and I-type arithmetic).

5.2 ALU Control Block

The ALU Control block refines the coarse ALUOp signal into the precise 4-bit ALUControl code that drives the ALU. When ALUOp = 2'b10 (R or I-type), it reads funct3 and funct7 together. The key distinction: funct7 = 7'b0000000 selects normal operations (add, sll, xor, etc.), while funct7 = 7'b0100000 selects sub or sra.

6. Putting It All Together: The Top Module

The Top Module instantiates every sub-module and wires them together to form the complete processor. Its only inputs are clk and rst. Internally, the wiring follows this sequence:

  1. The Program Counter drives the Instruction Memory with the current PC. The PC Adder (a simple combinational adder) computes PC + 4 in parallel.
  2. The 32-bit instruction is sliced into its fields: opcode, funct3, funct7, rs1, rs2, rd — all wired to the Register File, Sign Extender, and both levels of the Control Unit simultaneously.
  3. The Register File asynchronously outputs RD1 and RD2. The Sign Extender produces the sign-extended immediate based on the opcode.
  4. The Main Control Unit and ALU Control decode the instruction and assert the correct signals. The ALU MUX (controlled by ALUSrc) selects between RD2 and the immediate for the ALU's B input.
  5. The ALU computes the result. Its output drives both the Data Memory address and (for non-load instructions) the write-back path.
  6. The Memory MUX (controlled by MemtoReg) selects between the ALU result and the Data Memory read data for write-back to the register file.
  7. Special-case write-back: jal/jalr write PC + 4; lui writes the sign-extended immediate; auipc writes PC + immediate. A priority MUX at the write-back point handles these cases.
  8. The PC MUX selects the next PC: PC_jalr_target if Jalr, PC_jal_target if Jal, PC + imm if a branch is taken, or PC + 4 otherwise.

The branch-taken decision combines the Branch signal from the Control Unit with the ALU output and funct3: beq is taken when Zero=1; bne when Zero=0; blt/bltu when the comparison result's LSB is 1; bge/bgeu when it is 0.

8. Simulation & Synthesis

The design has been simulated on EDA Playground and synthesised in Vivado. You can run the processor directly in your browser — the testbench, all Verilog modules, and sample .hex programs are pre-loaded:

EDA Playground link: https://edaplayground.com/x/XeZr

  1. Write the program in RISC-V assembly.
  2. Compile with the RISC-V toolchain (riscv32-unknown-elf-as) to obtain a binary, then use objcopy to produce a .hex file.
  3. Place the .hex file next to the Verilog sources. The instructionMemory module loads it automatically.
  4. Run the simulation in a tool such as EDA Playground, ModelSim, or Verilator. Inspect the final register values in the testbench output.

The Fibonacci program below is an excellent end-to-end test because it exercises addi (I-type), add (R-type), and bne (B-type branch) — covering the three most important instruction classes. When correct, register x10 should hold 55 (which is F(10)).

# Fibonacci Program — computes F(10) = 55
# Registers: x1=counter, x2=F(n-2), x3=F(n-1), x4=temp, x5=exit value

addi x1, x0, 11    # x1 = 11  (loop runs until x1 == 2, i.e. 9 iterations after F(0),F(1))
addi x2, x0, 0     # x2 = 0   (F(0))
addi x3, x0, 1     # x3 = 1   (F(1))
addi x5, x0, 2     # x5 = 2   (loop exit condition)

loop:
  add  x4, x2, x3  # x4  = F(n-2) + F(n-1) = next Fibonacci number
  add  x2, x3, x0  # x2  = old x3 (shift window)
  add  x3, x4, x0  # x3  = new Fibonacci number
  addi x1, x1, -1  # x1-- (decrement counter)
  bne  x1, x5, loop # if x1 != 2, continue loop

add  x10, x3, x0   # x10 = F(10)  →  expected: 55

done:
  jal x0, done      # Infinite loop — halts the processor

The testbench dumps all 32 register values at the end of simulation. The expected final state (simplified) is: x2 = 34, x3 = 55, x10 = 55, and all other registers zero.

Where to find the full source: The complete Verilog implementation, including testbench and sample .hex programs (Fibonacci, GCD, Bubble Sort, Sum of N), is available on GitHub at Anish-Rooj-cpu / Single-Cycle-RISCV-Processor. An interactive Digital-software version that lets you visually trace signal values cycle by cycle is also linked from that repository.

References

  • David A. Patterson & John L. Hennessy — Computer Organization and Design: The Hardware/Software Interface, 5th Edition (RISC-V Edition)
  • RISC-V International — RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.2riscv.org/specifications
  • Samir Palnitkar — Verilog HDL: A Guide to Digital Design and Synthesis