Multiplication and Division Unit (M-Extension) Documentation
Overview
This hardware module implements the RISC-V M-Extension, providing support for standard 32-bit integer multiplication, division, and remainder operations.
The core architecture shares a single 32x32-bit unsigned multiplier core to handle all variations of signed and unsigned multiplication (mul, mulh, mulhsu, mulhu). Signed operations are achieved via a lightweight, low-overhead post-correction pipeline rather than relying on multiple hardware multiplier arrays or pre-inversion logic.
Signal Interfaces
Input Ports
dataA(32-bit): Multiplicand / Dividend input bus.dataB(32-bit): Multiplier / Divisor input bus.sel(3-bit): Control bus mapped directly to the instruction'sfunct3fields.
Output Ports
out(32-bit): The final calculated result, routed to the CPU register file.
Control Mapping (sel / funct3)
The module uses the 3-bit sel bus to decode the exact operation and coordinate both the post-processing stages and the final output selection mux.
sel (funct3) |
Instruction | Operation Type | Output Source |
|---|---|---|---|
3'b000 |
mul |
Signed/Unsigned Lower Half | Raw Multiplier [31:0] |
3'b001 |
mulh |
Signed High Half | Stage 2 Corrected [63:32] |
3'b010 |
mulhsu |
Signed/Unsigned High Half | Stage 1 Corrected [63:32] |
3'b011 |
mulhu |
Unsigned High Half | Raw Multiplier [63:32] |
3'b100 |
div |
Signed Division | Division Unit quotient |
3'b101 |
divu |
Unsigned Division | Division Unit quotient |
3'b110 |
rem |
Signed Remainder | Division Unit remainder |
3'b111 |
remu |
Unsigned Remainder | Division Unit remainder |
Multiplication Pipeline & Post-Correction Logic
When processing signed inputs through an unsigned multiplier, the hardware structurally introduces an error term proportional to the inputs due to two's complement representation (\((2^{32} + A) \times B\)). The high-half output bits (mul_high_out) are corrected dynamically across two sequential post-processing stages.
1. Stage 1 Correction (Handling signA)
This stage neutralizes the error introduced when dataA is a negative number. It utilizes a dedicated 32-bit adder to conditionally perform a two's complement subtraction of dataB from the raw upper 32 bits of the multiplier.
- Sign Extraction & Extension:
signAExtis generated by replicating the most significant bit ofdataA(dataA[31]) across a 32-bit bus. - Conditional Logic: The second input to the adder is masked using bitwise operations:
signAExt AND (dataB XOR signAExt). - If
dataAis positive (signAExtis all0s), the second input evaluates to zero, passing the value through with no operation. - If
dataAis negative (signAExtis all1s), the second input evaluates toNOT(dataB). Combined with the Adder's carry-in set todataA[31](which is1), this performs a perfect two's complement subtraction (\(+ \text{ NOT}(dataB) + 1 \equiv -dataB\)).
2. Stage 2 Correction (Handling signB)
This stage mirrors Stage 1 to neutralize the error introduced if dataB is a negative number. It takes the output of Stage 1 and applies the same conditional subtraction logic using dataA as the operand masked by signBExt.
- If
dataBis positive, it passes the Stage 1 result unmodified (ideal formulhsu). - If
dataBis negative, it subtractsdataAfrom the Stage 1 result (completing the calculation formulh).
Output Interconnection & Multiplexing
The module collects all intermediate results and routes them to a final 3-way select multiplexer structure to drive the single 32-bit out bus:
- Lower Multiplication Mux Path: Extracts the raw lower 32 bits
[31:0]directly from the unsigned multiplier core for the standardmulinstruction. - High Multiplication Mux Path: Selects between the processing tiers based on the execution target:
mulhuselects the raw, unprocessed high 32 bits[63:32]directly from the multiplier output.mulhsuselects the output from Stage 1 Correction.mulhselects the output from Stage 2 Correction.- Division/Remainder Path: Selects the output from the independent execution lanes of the Division Unit (
quotientvs.remainderfor signed or unsigned variants).
This shared topology minimizes multiplexing depth and limits propagation delays along the critical path to the register file writeback stage.
Mathematical Proof of Post-Correction Logic
This section provides the algebraic proof and a step-by-step numerical breakdown using generalized boundaries to demonstrate why conditional subtraction across two stages perfectly recovers the signed high-half product from an unsigned hardware multiplier core.
1. General Algebraic Proof
Let \(n = 32\) represent the bit-width of our data ports. In our hardware system, any negative value is represented via two's complement, which adds a base weight of \(2^n\) to wrap around the boundary. We can write the raw unsigned port representations for any input \(X\) as: $\(\text{Port}(X) = 2^n + X \quad (\text{if } X \text{ is negative})\)$ $\(\text{Port}(X) = 0 + X \quad (\text{if } X \text{ is positive})\)$
When the unsigned multiplier processes dataA and dataB, assuming worst-case scenarios where both inputs happen to be negative, it computes the product of these two hardware representations:
$\(\text{Output}_{\text{raw}} = (2^n + A) \times (2^n + B)\)$
Expanding this equation via the distributive property (FOIL) yields: $\(\text{Output}_{\text{raw}} = (2^n \times 2^n) + (2^n \times B) + (2^n \times A) + (A \times B)\)$ $\(\text{Output}_{\text{raw}} = 2^{2n} + (2^n \times B) + (2^n \times A) + (A \times B)\)$
Since our multiplier creates a \(2n\)-bit total output register, any value scaled by \(2^{2n}\) (the \(2n\)-th bit position or higher) naturally overflows outside the physical bus boundaries and is dropped. This eliminates the leading \(2^{2n}\) term entirely, leaving the structural output as: $\(\text{Output}_{\text{raw}} = (2^n \times B) + (2^n \times A) + (A \times B)\)$
Because the remaining error terms are scaled by \(2^n\), they are shifted precisely into the upper half (mul_high_out). To isolate the true signed product \((A \times B)\), we must subtract these errors based on the input sign bits:
- If \(A\) is negative, it brings a \(2^n \times B\) error that must be stripped out by subtracting \(B\) from the high half.
- If \(B\) is negative, it brings a \(2^n \times A\) error that must be stripped out by subtracting \(A\) from the high half.
2. Step-by-Step Verification Example
To verify this math, we can scale down the boundaries to an \(n = 4\) bit system (\(2^4 = 16\)), using the most extreme negative boundary for \(A\) and maximum positive value for \(B\) to simulate a mulhsu path.
1. The Input Setup
-
Hardware View of Port A: The negative number wraps around the 4-bit boundary (\(16\)). $\(\text{Port A Value} = 16 + A = 16 - 8 = \mathbf{8}\)$ (In 4-bit binary, 8 is
1000, which is exactly how -8 is stored). -
Hardware View of Port B: Passes straight through as its unsigned value. $\(\text{Port B Value} = B = \mathbf{15}\)$ (In 4-bit binary, 15 is
1111).
2. The Unsigned Multiplication (FOIL Distribution)
The 4-bit unsigned hardware multiplier takes the raw port values (\(8\) and \(15\)) and multiplies them together: $\(\text{Multiplier Output} = (16 + A) \times B\)$ $\(\text{Multiplier Output} = (16 \times B) + (A \times B)\)$
- The Error Term \((16 \times B)\): This shifts into the upper 4 bits (Overflow Pin). Since \(B = 15\), the error added to the overflow scale is exactly \(15 \times 16\) (or \(15 \ll 4\)).
- The True Product \((A \times B)\): This is our target calculation: \(-8 \times 15 = \mathbf{-120}\). This true product spills past a single 4-bit container.
3. Plugging in the Numbers
Let's look at the raw values generated by this equation: $\(\text{Multiplier Output} = (16 \times 15) + (-8 \times 15)\)$ $\(\text{Multiplier Output} = 240 + (-120)\)$ $\(\text{Multiplier Output} = \mathbf{120}\)$
4. Slicing Across the 4-Bit Output Pins
Because each unit on the Overflow Pin is worth \(16\) in a 4-bit system, we find how the value 120 splits across the physical pins by dividing it by 16: $\(120 \div 16 = 7 \text{ with a remainder of } 8\)$ $\(120 = (\mathbf{7} \times 16) + \mathbf{8}\)$
This maps directly onto the physical pins:
- Raw Overflow Pin (Upper 4 bits): 7 (Binary:
0111) - Raw Primary Pin (Lower 4 bits): 8 (Binary:
1000)
5. Applying the Correction Formula
We want the final 8-bit output to represent our true product, \(-120\).
- Check the Primary Pin: In 8-bit signed two's complement, \(-120\) is
1000_1000. The lower 4 bits are1000(8). The Primary Pin is already completely correct. - Check the Overflow Pin: the upper 4 bits of the true product \(-120\) (
1000_1000) should be \(-8\) (1000). - What we have on the Overflow Pin: 7 (
0111) - What we want on the Overflow Pin: -8 (
1000)
Let's pass the raw overflow pin through our subtractor and apply the rule (Subtract Input B, which is 15): $\(\text{Corrected Overflow Pin} = \text{Raw Overflow Pin} - B\)$ $\(\text{Corrected Overflow Pin} = 7 - 15 = \mathbf{-8}\)$
In a 4-bit hardware container, performing the unsigned subtraction \(7 - 15\) underflows past zero and lands perfectly on 8 (1000), which represents signed \(-8\).
The combined 8-bit output bus reads as 1000 on the high side and 1000 on the low side (1000_1000), which is the exact, flawless two's complement representation of \(-120\).
Architectural Comparison: Why Pre/Post-Inversion Was Rejected
An alternative approach to handling signed math in an unsigned multiplier is Pre/Post-Inversion. This method converts any negative input into a positive magnitude before multiplication, runs them through the core, and conditionally negates the product. This approach was rejected for our implementation due to structural inefficiencies:
- The 64-bit Carry Chain Constraint: When negating a two's complement number (
NOT(X) + 1), the+1carry bit enters at Bit 0. Because a carry can propagate all the way from the lowest bit to the highest bit, the upper and lower 32-bit halves cannot be inverted independently. Slicing them into two separate 32-bit operations breaks the carry chain. Therefore, Pre/Post-Inversion forces the use of a wide, high-latency 64-bit conditional inverter/adder tree on the output bus. - Critical Path Impact on
mul: The standardmulinstruction only requires the lower 32 bits of the product, which are identical for both signed and unsigned operations. Our selected Post-Correction architecture leaves the lower 32 bits completely untouched, meaningmulexperiences zero gate delay from correction logic. Pre-inversion would forcemulto wait for the inputs to clear the pre-processing negation adders, creating unnecessary latency for the most common multiplication operations.
1. Multiplier Unit Test Matrix
| Test Case | Operation | Instruction | Input dataA (Binary) |
Input dataB (Binary) |
Control sel (funct3) |
Expected Output out (Binary) |
Decimal Interpretation |
|---|---|---|---|---|---|---|---|
| 1 | Lower Half | mul |
1101 (\(-3\)) |
0101 (\(5\)) |
3'b000 |
0001 |
Lower 4-bits of \(-15\) |
| 2 | Signed \(\times\) Unsigned High | mulhsu |
1000 (\(-8\)) |
1111 (\(15\)) |
3'b010 |
1000 |
\(-8\) (High half of \(-120\)) |
| 3 | Signed \(\times\) Signed High | mulh |
1101 (\(-3\)) |
1011 (\(-5\)) |
3'b001 |
0000 |
\(0\) (High half of \(+15\)) |
| 4 | Unsigned \(\times\) Unsigned High | mulhu |
1111 (\(15\)) |
1111 (\(15\)) |
3'b011 |
1110 |
\(14\) (High half of \(225\)) |
2. Divider Unit Test Matrix
| Test Case | Operation | Instruction | Input dataA (Binary) |
Input dataB (Binary) |
Control sel (funct3) |
Expected Output out (Binary) |
Decimal Interpretation |
|---|---|---|---|---|---|---|---|
| 1 | Signed Div | div |
1001 (\(-7\)) |
0011 (\(3\)) |
3'b100 |
1110 |
\(-2\) |
| 2 | Signed Rem | rem |
1001 (\(-7\)) |
0011 (\(3\)) |
3'b110 |
1111 |
\(-1\) |
| 3 | Signed Div | div |
1000 (\(-8\)) |
1101 (\(-3\)) |
3'b100 |
0010 |
\(2\) |
| 4 | Signed Rem | rem |
1000 (\(-8\)) |
1101 (\(-3\)) |
3'b110 |
1110 |
\(-2\) |
| 5 | Unsigned Div | divu |
1001 (\(9\)) |
0011 (\(3\)) |
3'b101 |
0011 |
\(3\) |
| 6 | Unsigned Rem | remu |
1001 (\(9\)) |
0011 (\(3\)) |
3'b111 |
0000 |
\(0\) |