Multiplication and Division Unit (M-Extension) Documentation

Overview

This hardware module implements the RISC-V M-Extension, providing support for standard 32-bit integer multiplication, division, and remainder operations.

The core architecture shares a single 32x32-bit unsigned multiplier core to handle all variations of signed and unsigned multiplication (mul, mulh, mulhsu, mulhu). Signed operations are achieved via a lightweight, low-overhead post-correction pipeline rather than relying on multiple hardware multiplier arrays or pre-inversion logic.

Signal Interfaces

Input Ports

dataA (32-bit): Multiplicand / Dividend input bus.
dataB (32-bit): Multiplier / Divisor input bus.
sel (3-bit): Control bus mapped directly to the instruction's funct3 fields.

Output Ports

out (32-bit): The final calculated result, routed to the CPU register file.

Control Mapping (`sel` / `funct3`)

The module uses the 3-bit sel bus to decode the exact operation and coordinate both the post-processing stages and the final output selection mux.

`sel` (`funct3`)	Instruction	Operation Type	Output Source
`3'b000`	`mul`	Signed/Unsigned Lower Half	Raw Multiplier `[31:0]`
`3'b001`	`mulh`	Signed High Half	Stage 2 Corrected `[63:32]`
`3'b010`	`mulhsu`	Signed/Unsigned High Half	Stage 1 Corrected `[63:32]`
`3'b011`	`mulhu`	Unsigned High Half	Raw Multiplier `[63:32]`
`3'b100`	`div`	Signed Division	Division Unit `quotient`
`3'b101`	`divu`	Unsigned Division	Division Unit `quotient`
`3'b110`	`rem`	Signed Remainder	Division Unit `remainder`
`3'b111`	`remu`	Unsigned Remainder	Division Unit `remainder`

Multiplication Pipeline & Post-Correction Logic

When processing signed inputs through an unsigned multiplier, the hardware structurally introduces an error term proportional to the inputs due to two's complement representation ($(2^{32} + A) \times B$). The high-half output bits (mul_high_out) are corrected dynamically across two sequential post-processing stages.

1. Stage 1 Correction (Handling `signA`)

This stage neutralizes the error introduced when dataA is a negative number. It utilizes a dedicated 32-bit adder to conditionally perform a two's complement subtraction of dataB from the raw upper 32 bits of the multiplier.

Sign Extraction & Extension: signAExt is generated by replicating the most significant bit of dataA (dataA[31]) across a 32-bit bus.
Conditional Logic: The second input to the adder is masked using bitwise operations: signAExt AND (dataB XOR signAExt).
If dataA is positive (signAExt is all 0s), the second input evaluates to zero, passing the value through with no operation.
If dataA is negative (signAExt is all 1s), the second input evaluates to NOT(dataB). Combined with the Adder's carry-in set to dataA[31] (which is 1), this performs a perfect two's complement subtraction ($+ \text{ NOT}(dataB) + 1 \equiv -dataB$).

\[\text{Stage 1 Output} = \text{mul\_high\_out} + (\text{signAExt} \cdot (\text{dataB} \oplus \text{signAExt})) + \text{dataA}[31]\]

2. Stage 2 Correction (Handling `signB`)

This stage mirrors Stage 1 to neutralize the error introduced if dataB is a negative number. It takes the output of Stage 1 and applies the same conditional subtraction logic using dataA as the operand masked by signBExt.

If dataB is positive, it passes the Stage 1 result unmodified (ideal for mulhsu).
If dataB is negative, it subtracts dataA from the Stage 1 result (completing the calculation for mulh).

Output Interconnection & Multiplexing

The module collects all intermediate results and routes them to a final 3-way select multiplexer structure to drive the single 32-bit out bus:

Lower Multiplication Mux Path: Extracts the raw lower 32 bits [31:0] directly from the unsigned multiplier core for the standard mul instruction.
High Multiplication Mux Path: Selects between the processing tiers based on the execution target:
mulhu selects the raw, unprocessed high 32 bits [63:32] directly from the multiplier output.
mulhsu selects the output from Stage 1 Correction.
mulh selects the output from Stage 2 Correction.
Division/Remainder Path: Selects the output from the independent execution lanes of the Division Unit (quotient vs. remainder for signed or unsigned variants).

This shared topology minimizes multiplexing depth and limits propagation delays along the critical path to the register file writeback stage.

Mathematical Proof of Post-Correction Logic

This section provides the algebraic proof and a step-by-step numerical breakdown using generalized boundaries to demonstrate why conditional subtraction across two stages perfectly recovers the signed high-half product from an unsigned hardware multiplier core.

1. General Algebraic Proof

Let $n = 32$ represent the bit-width of our data ports. In our hardware system, any negative value is represented via two's complement, which adds a base weight of $2^n$ to wrap around the boundary. We can write the raw unsigned port representations for any input $X$ as: $$\text{Port}(X) = 2^n + X \quad (\text{if } X \text{ is negative})$$ $$\text{Port}(X) = 0 + X \quad (\text{if } X \text{ is positive})$$

When the unsigned multiplier processes dataA and dataB, assuming worst-case scenarios where both inputs happen to be negative, it computes the product of these two hardware representations: $$\text{Output}_{\text{raw}} = (2^n + A) \times (2^n + B)$$

Expanding this equation via the distributive property (FOIL) yields: $$\text{Output}_{\text{raw}} = (2^n \times 2^n) + (2^n \times B) + (2^n \times A) + (A \times B)$$ $$\text{Output}_{\text{raw}} = 2^{2n} + (2^n \times B) + (2^n \times A) + (A \times B)$$

Since our multiplier creates a $2n$-bit total output register, any value scaled by $2^{2n}$ (the $2n$-th bit position or higher) naturally overflows outside the physical bus boundaries and is dropped. This eliminates the leading $2^{2n}$ term entirely, leaving the structural output as: $$\text{Output}_{\text{raw}} = (2^n \times B) + (2^n \times A) + (A \times B)$$

Because the remaining error terms are scaled by $2^n$, they are shifted precisely into the upper half (mul_high_out). To isolate the true signed product $(A \times B)$, we must subtract these errors based on the input sign bits:

If $A$ is negative, it brings a $2^n \times B$ error that must be stripped out by subtracting $B$ from the high half.
If $B$ is negative, it brings a $2^n \times A$ error that must be stripped out by subtracting $A$ from the high half.

2. Step-by-Step Verification Example

To verify this math, we can scale down the boundaries to an $n = 4$ bit system ($2^4 = 16$), using the most extreme negative boundary for $A$ and maximum positive value for $B$ to simulate a mulhsu path.

1. The Input Setup

\[\text{Let } A = -8 \quad (\text{Signed Input})$$ $$\text{Let } B = 15 \quad (\text{Unsigned Input})\]

Hardware View of Port A: The negative number wraps around the 4-bit boundary ($16$). $$\text{Port A Value} = 16 + A = 16 - 8 = \mathbf{8}$$ (In 4-bit binary, 8 is 1000, which is exactly how -8 is stored).
Hardware View of Port B: Passes straight through as its unsigned value. $$\text{Port B Value} = B = \mathbf{15}$$ (In 4-bit binary, 15 is 1111).

2. The Unsigned Multiplication (FOIL Distribution)

The 4-bit unsigned hardware multiplier takes the raw port values ($8$ and $15$) and multiplies them together: $$\text{Multiplier Output} = (16 + A) \times B$$ $$\text{Multiplier Output} = (16 \times B) + (A \times B)$$

The Error Term $(16 \times B)$: This shifts into the upper 4 bits (Overflow Pin). Since $B = 15$, the error added to the overflow scale is exactly $15 \times 16$ (or $15 \ll 4$).
The True Product $(A \times B)$: This is our target calculation: $-8 \times 15 = \mathbf{-120}$. This true product spills past a single 4-bit container.

3. Plugging in the Numbers

Let's look at the raw values generated by this equation: $$\text{Multiplier Output} = (16 \times 15) + (-8 \times 15)$$ $$\text{Multiplier Output} = 240 + (-120)$$ $$\text{Multiplier Output} = \mathbf{120}$$

4. Slicing Across the 4-Bit Output Pins

Because each unit on the Overflow Pin is worth $16$ in a 4-bit system, we find how the value 120 splits across the physical pins by dividing it by 16: $$120 \div 16 = 7 \text{ with a remainder of } 8$$ $$120 = (\mathbf{7} \times 16) + \mathbf{8}$$

This maps directly onto the physical pins:

Raw Overflow Pin (Upper 4 bits): 7 (Binary: 0111)
Raw Primary Pin (Lower 4 bits): 8 (Binary: 1000)

5. Applying the Correction Formula

We want the final 8-bit output to represent our true product, $-120$.

Check the Primary Pin: In 8-bit signed two's complement, $-120$ is 1000_1000. The lower 4 bits are 1000 (8). The Primary Pin is already completely correct.
Check the Overflow Pin: the upper 4 bits of the true product $-120$ (1000_1000) should be $-8$ (1000).
What we have on the Overflow Pin: 7 (0111)
What we want on the Overflow Pin: -8 (1000)

Let's pass the raw overflow pin through our subtractor and apply the rule (Subtract Input B, which is 15): $$\text{Corrected Overflow Pin} = \text{Raw Overflow Pin} - B$$ $$\text{Corrected Overflow Pin} = 7 - 15 = \mathbf{-8}$$

In a 4-bit hardware container, performing the unsigned subtraction $7 - 15$ underflows past zero and lands perfectly on 8 (1000), which represents signed $-8$.

The combined 8-bit output bus reads as 1000 on the high side and 1000 on the low side (1000_1000), which is the exact, flawless two's complement representation of $-120$.

Architectural Comparison: Why Pre/Post-Inversion Was Rejected

An alternative approach to handling signed math in an unsigned multiplier is Pre/Post-Inversion. This method converts any negative input into a positive magnitude before multiplication, runs them through the core, and conditionally negates the product. This approach was rejected for our implementation due to structural inefficiencies:

The 64-bit Carry Chain Constraint: When negating a two's complement number (NOT(X) + 1), the +1 carry bit enters at Bit 0. Because a carry can propagate all the way from the lowest bit to the highest bit, the upper and lower 32-bit halves cannot be inverted independently. Slicing them into two separate 32-bit operations breaks the carry chain. Therefore, Pre/Post-Inversion forces the use of a wide, high-latency 64-bit conditional inverter/adder tree on the output bus.
Critical Path Impact on mul: The standard mul instruction only requires the lower 32 bits of the product, which are identical for both signed and unsigned operations. Our selected Post-Correction architecture leaves the lower 32 bits completely untouched, meaning mul experiences zero gate delay from correction logic. Pre-inversion would force mul to wait for the inputs to clear the pre-processing negation adders, creating unnecessary latency for the most common multiplication operations.

1. Multiplier Unit Test Matrix

Test Case	Operation	Instruction	Input `dataA` (Binary)	Input `dataB` (Binary)	Control `sel` (`funct3`)	Expected Output `out` (Binary)	Decimal Interpretation
1	Lower Half	`mul`	`1101` ($-3$)	`0101` ($5$)	`3'b000`	`0001`	Lower 4-bits of $-15$
2	Signed $\times$ Unsigned High	`mulhsu`	`1000` ($-8$)	`1111` ($15$)	`3'b010`	`1000`	$-8$ (High half of $-120$)
3	Signed $\times$ Signed High	`mulh`	`1101` ($-3$)	`1011` ($-5$)	`3'b001`	`0000`	$0$ (High half of $+15$)
4	Unsigned $\times$ Unsigned High	`mulhu`	`1111` ($15$)	`1111` ($15$)	`3'b011`	`1110`	$14$ (High half of $225$)

2. Divider Unit Test Matrix

Test Case	Operation	Instruction	Input `dataA` (Binary)	Input `dataB` (Binary)	Control `sel` (`funct3`)	Expected Output `out` (Binary)	Decimal Interpretation
1	Signed Div	`div`	`1001` ($-7$)	`0011` ($3$)	`3'b100`	`1110`	$-2$
2	Signed Rem	`rem`	`1001` ($-7$)	`0011` ($3$)	`3'b110`	`1111`	$-1$
3	Signed Div	`div`	`1000` ($-8$)	`1101` ($-3$)	`3'b100`	`0010`	$2$
4	Signed Rem	`rem`	`1000` ($-8$)	`1101` ($-3$)	`3'b110`	`1110`	$-2$
5	Unsigned Div	`divu`	`1001` ($9$)	`0011` ($3$)	`3'b101`	`0011`	$3$
6	Unsigned Rem	`remu`	`1001` ($9$)	`0011` ($3$)	`3'b111`	`0000`	$0$

Test Case	Operation	Instruction	Input `dataA` (Binary)	Input `dataB` (Binary)	Control `sel` (`funct3`)	Expected Output `out` (Binary)	Decimal Interpretation
1	Lower Half	`mul`	`1101` (\(-3\))	`0101` (\(5\))	`3'b000`	`0001`	Lower 4-bits of \(-15\)
2	Signed \(\times\) Unsigned High	`mulhsu`	`1000` (\(-8\))	`1111` (\(15\))	`3'b010`	`1000`	\(-8\) (High half of \(-120\))
3	Signed \(\times\) Signed High	`mulh`	`1101` (\(-3\))	`1011` (\(-5\))	`3'b001`	`0000`	\(0\) (High half of \(+15\))
4	Unsigned \(\times\) Unsigned High	`mulhu`	`1111` (\(15\))	`1111` (\(15\))	`3'b011`	`1110`	\(14\) (High half of \(225\))

Test Case	Operation	Instruction	Input `dataA` (Binary)	Input `dataB` (Binary)	Control `sel` (`funct3`)	Expected Output `out` (Binary)	Decimal Interpretation
1	Signed Div	`div`	`1001` (\(-7\))	`0011` (\(3\))	`3'b100`	`1110`	\(-2\)
2	Signed Rem	`rem`	`1001` (\(-7\))	`0011` (\(3\))	`3'b110`	`1111`	\(-1\)
3	Signed Div	`div`	`1000` (\(-8\))	`1101` (\(-3\))	`3'b100`	`0010`	\(2\)
4	Signed Rem	`rem`	`1000` (\(-8\))	`1101` (\(-3\))	`3'b110`	`1110`	\(-2\)
5	Unsigned Div	`divu`	`1001` (\(9\))	`0011` (\(3\))	`3'b101`	`0011`	\(3\)
6	Unsigned Rem	`remu`	`1001` (\(9\))	`0011` (\(3\))	`3'b111`	`0000`	\(0\)