1. From C Code to Hardware: The Big Picture
Before designing any hardware, we need to understand what a processor actually does. When you write and run a C program, a multi-stage pipeline silently transforms your human-readable code into binary instructions that a processor can execute, one by one.
The critical handoff happens when the OS loads the binary into RAM and sets the Program Counter (PC) to the
start of main(). From that moment, the processor takes over — fetching, decoding, and executing
instructions in a tight loop until the program terminates.
2. The RV32I Instruction Formats
Every instruction processed by our RV32I core is exactly 32 bits long. To handle the wide variety of operations cleanly, RISC-V organises instructions into six fundamental formats. Think of it like a standardised form — the fields are always in the same positions, making hardware decoding simple and fast.
| Type | 31–25 | 24–20 | 19–15 | 14–12 | 11–7 | 6–0 |
|---|---|---|---|---|---|---|
| R | funct7 | rs2 | rs1 | funct3 | rd | opcode |
| I | imm[11:0] | rs1 | funct3 | rd | opcode | |
| S | imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode |
| B | imm[12,10:5] | rs2 | rs1 | funct3 | imm[4:1,11] | opcode |
| U | imm[31:12] | rd | opcode | |||
| J | imm[20,10:1,11,19:12] | rd | opcode | |||
The three register fields appear in fixed bit positions across all formats — rs1 is always at
bits [19:15], rs2 at [24:20], and rd at [11:7]. This means the register file can
begin reading operands before decoding is even complete, which is a key reason RISC-V decodes so efficiently
in hardware.
The immediate field (imm) is a constant value embedded in the instruction. For
types that use it (I, S, B, U, J), the immediate is scattered across different bit ranges and must be
reconstructed and sign-extended to 32 bits by a dedicated Sign Extender block.
| Format | Opcode (hex) | Typical Use | Example Instructions |
|---|---|---|---|
| R-type | 0x33 |
Register ↔ register arithmetic/logic | add, sub, and, or, xor, sll, srl, sra, slt |
| I-type | 0x13, 0x03 |
Immediate arithmetic, loads, jalr | addi, lw, lh, lb, jalr |
| S-type | 0x23 |
Memory store | sw, sh, sb |
| B-type | 0x63 |
Conditional branch | beq, bne, blt, bge, bltu, bgeu |
| U-type | 0x37, 0x17 |
Large immediates | lui, auipc |
| J-type | 0x6F |
Unconditional jump | jal |
3. High-Level Architecture: Datapath & Control Unit
A single-cycle RISC-V processor is built from two cooperating subsystems. The Datapath is the "muscle" — it moves, computes, and stores data. The Control Unit is the "brain" — it reads each instruction and generates the signals that tell the datapath what to do.
4. The Datapath: Component by Component
4.1 The Program Counter
The Program Counter (PC) is the processor's "bookmark" — a 32-bit register that holds the address of the instruction currently being executed. It is implemented as a simple positive-edge-triggered D flip-flop with an asynchronous active-high reset. On every rising clock edge, it captures the next PC address computed by the datapath.
Normally, PC_next = PC + 4, since instructions are 4 bytes wide (word-aligned). But when a
branch or jump is taken, the PC is loaded with a target address instead.
module programCounter(
input clk, rst,
input [31:0] pc_in, // Next PC (PC+4, branch target, or jump target)
output reg [31:0] pc_out // Current PC
);
always @(posedge clk or posedge rst) begin
if (rst) pc_out <= 32'b0; // Asynchronous reset to address 0
else pc_out <= pc_in;
end
endmodule
4.2 Instruction Memory
The Instruction Memory is a Read-Only Memory (ROM) that stores the program's binary instructions. It is
indexed using PC[9:2] — the upper 8 bits of the PC within the 1 KB address space — because the
lower 2 bits of a word-aligned address are always 00. Instructions are read asynchronously
(combinationally), meaning any new PC value immediately produces the corresponding instruction with zero
latency. The memory contents are loaded at simulation start using Verilog's $readmemh directive
from a .hex file.
4.3 The Register File
The Register File contains 32 general-purpose 32-bit registers, named x0 through
x31. It supports two simultaneous asynchronous reads (from rs1 and rs2)
and one synchronous write (to rd, gated by the RegWrite control signal). Register
x0 is hardwired to zero — any write to it is silently discarded, and any read from it always
returns 0.
module registerFile(
input clk, RegWrite,
input [4:0] rs1, rs2, rd,
input [31:0] writeData,
output [31:0] readData1, readData2
);
reg [31:0] registers [0:31];
// Asynchronous read — x0 is hardwired to 0
assign readData1 = (rs1 != 5'd0) ? registers[rs1] : 32'b0;
assign readData2 = (rs2 != 5'd0) ? registers[rs2] : 32'b0;
// Synchronous write — x0 is write-protected
always @(posedge clk) begin
if (RegWrite && rd != 5'd0)
registers[rd] <= writeData;
end
endmodule
4.4 The Sign Extender
Many instructions embed a small constant (an immediate) directly in the instruction encoding. Because the ALU operates on 32-bit values, this immediate must be sign-extended — its most significant bit is replicated to fill the upper bits. The Sign Extender reads the opcode to determine which format the instruction uses, then reconstructs and sign-extends the appropriate bits.
For example, an I-type instruction stores a 12-bit immediate in instr[31:20]. The Sign Extender
extends this to 32 bits by replicating bit 31 (the sign bit) across the upper 20 positions:
imm = { {20{instr[31]}}, instr[31:20] }.
4.5 The ALU
The Arithmetic Logic Unit is the computational heart of the processor. It takes two 32-bit operands — always
RD1 from the register file on the A port, and either RD2 or the sign-extended
immediate (selected by the ALUSrc MUX) on the B port — and performs the operation specified by
the 4-bit ALUControl signal. Its two outputs are the 32-bit Result and a 1-bit
Zero flag (asserted when Result is zero, used by branch instructions).
module alu(
input [31:0] A, B,
input [3:0] ALUControl,
output reg [31:0] Result,
output Zero
);
always @(*) begin
case (ALUControl)
4'b0000: Result = A & B;
4'b0001: Result = A | B;
4'b0010: Result = A + B;
4'b0011: Result = A ^ B;
4'b0100: Result = A << B[4:0];
4'b0101: Result = A >> B[4:0];
4'b0110: Result = A - B;
4'b0111: Result = ($signed(A) < $signed(B)) ? 32'd1 : 32'd0;
4'b1000: Result = (A < B) ? 32'd1 : 32'd0;
4'b1101: Result = $signed(A) >>> A[4:0];
default: Result = 32'h00000000;
endcase
end
assign Zero = (Result == 32'b0);
endmodule
4.6 Data Memory
The Data Memory is a 1 KB RAM used by load (lw, lh, lb) and store
(sw, sh, sb) instructions. It accepts the ALU's computed address as its
index, uses word alignment (addr[9:2]), reads asynchronously when MemRead is high,
and writes synchronously on the rising clock edge when MemWrite is high.
5. The Control Unit
The Control Unit decodes each instruction and generates the binary control signals that orchestrate every component in the datapath. It is split into two hierarchical blocks to keep the logic manageable.
5.1 Main Control Unit
This block reads only the 7-bit opcode and produces the high-level control signals listed below.
It uses a Verilog case statement inside an always @(*) block, with all outputs
defaulting to zero for safety.
RegWrite
Enables the register file to write the result into destination register rd at the next clock
edge.
ALUSrc
Selects the second ALU operand: 0 → use RD2 from the register file;
1 → use the sign-extended immediate.
MemtoReg
Selects what is written back to rd: 0 → ALU result; 1 → data read
from Data Memory (for load instructions).
MemRead
Enables Data Memory to drive its output for load instructions (lw, lh,
lb).
MemWrite
Enables Data Memory to write RD2 at the ALU-computed address for store instructions.
Branch
Asserted for B-type instructions. Combined with the ALU Zero flag and funct3 to
determine whether to take the branch.
Jal
Asserted for the jal instruction. Causes PC to load PC + imm and writes
PC + 4 into rd.
Jalr
Asserted for jalr. Causes PC to load (RD1 + imm) & ~1 (word-aligned), and
writes PC + 4 into rd.
ALUOp [1:0]
00 = force ADD (loads/stores), 01 = force SUB/CMP (branches), 10 =
decode via funct3/funct7 (R and I-type arithmetic).
5.2 ALU Control Block
The ALU Control block refines the coarse ALUOp signal into the precise 4-bit
ALUControl code that drives the ALU. When ALUOp = 2'b10 (R or I-type), it reads
funct3 and funct7 together. The key distinction: funct7 = 7'b0000000
selects normal operations (add, sll, xor, etc.), while funct7 = 7'b0100000 selects
sub or sra.
6. Putting It All Together: The Top Module
The Top Module instantiates every sub-module and wires them together to form the complete processor. Its only
inputs are clk and rst. Internally, the wiring follows this sequence:
- The Program
Counter drives the Instruction Memory with the current PC. The PC Adder (a
simple combinational adder) computes
PC + 4in parallel. - The 32-bit instruction
is sliced into its fields:
opcode,funct3,funct7,rs1,rs2,rd— all wired to the Register File, Sign Extender, and both levels of the Control Unit simultaneously. - The Register
File asynchronously outputs
RD1andRD2. The Sign Extender produces the sign-extended immediate based on the opcode. - The Main
Control Unit and ALU Control decode the instruction and assert the correct
signals. The ALU MUX (controlled by
ALUSrc) selects betweenRD2and the immediate for the ALU's B input. - The ALU computes the result. Its output drives both the Data Memory address and (for non-load instructions) the write-back path.
- The Memory
MUX (controlled by
MemtoReg) selects between the ALU result and the Data Memory read data for write-back to the register file. - Special-case
write-back:
jal/jalrwritePC + 4;luiwrites the sign-extended immediate;auipcwritesPC + immediate. A priority MUX at the write-back point handles these cases. - The PC
MUX selects the next PC:
PC_jalr_targetifJalr,PC_jal_targetifJal,PC + immif a branch is taken, orPC + 4otherwise.
The branch-taken decision combines the Branch signal from the Control Unit with the ALU output
and funct3: beq is taken when Zero=1; bne when
Zero=0; blt/bltu when the comparison result's LSB is 1;
bge/bgeu when it is 0.
8. Simulation & Synthesis
The design has been simulated on EDA Playground and synthesised in Vivado.
You can run the processor directly in your browser — the testbench, all Verilog modules, and sample
.hex programs are pre-loaded:
EDA Playground link: https://edaplayground.com/x/XeZr
- Write the program in RISC-V assembly.
- Compile with the RISC-V toolchain (
riscv32-unknown-elf-as) to obtain a binary, then useobjcopyto produce a.hexfile. - Place the
.hexfile next to the Verilog sources. TheinstructionMemorymodule loads it automatically. - Run the simulation in a tool such as EDA Playground, ModelSim, or Verilator. Inspect the final register values in the testbench output.
The Fibonacci program below is an excellent end-to-end test because it exercises addi (I-type),
add (R-type), and bne (B-type branch) — covering the three most important
instruction classes. When correct, register x10 should hold 55 (which is F(10)).
# Fibonacci Program — computes F(10) = 55
# Registers: x1=counter, x2=F(n-2), x3=F(n-1), x4=temp, x5=exit value
addi x1, x0, 11 # x1 = 11 (loop runs until x1 == 2, i.e. 9 iterations after F(0),F(1))
addi x2, x0, 0 # x2 = 0 (F(0))
addi x3, x0, 1 # x3 = 1 (F(1))
addi x5, x0, 2 # x5 = 2 (loop exit condition)
loop:
add x4, x2, x3 # x4 = F(n-2) + F(n-1) = next Fibonacci number
add x2, x3, x0 # x2 = old x3 (shift window)
add x3, x4, x0 # x3 = new Fibonacci number
addi x1, x1, -1 # x1-- (decrement counter)
bne x1, x5, loop # if x1 != 2, continue loop
add x10, x3, x0 # x10 = F(10) → expected: 55
done:
jal x0, done # Infinite loop — halts the processor
The testbench dumps all 32 register values at the end of simulation. The expected final state (simplified)
is: x2 = 34, x3 = 55, x10 = 55, and all other registers zero.
.hex programs (Fibonacci, GCD, Bubble Sort, Sum of N), is available on GitHub at Anish-Rooj-cpu / Single-Cycle-RISCV-Processor. An interactive Digital-software version
that lets you visually trace signal values cycle by cycle is also linked from that repository.
References
- David A. Patterson & John L. Hennessy — Computer Organization and Design: The Hardware/Software Interface, 5th Edition (RISC-V Edition)
- RISC-V International — RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.2 — riscv.org/specifications
- Samir Palnitkar — Verilog HDL: A Guide to Digital Design and Synthesis