This exact circuit will be automatically inferred by XST for ternary adders in RTL. Below is an example that calculates sum=x+y+z:
(* shreg_extract = "no" *) module adder (
input clk,
input [7:0] x ,
input [7:0] y ,
input [7:0] z ,
output reg [7:0]sum
);
input clk,
input [7:0] x ,
input [7:0] y ,
input [7:0] z ,
output reg [7:0]sum
);
reg [7:0] sum_r, x_r, y_r, z_r;
always @(posedge clk) begin
x_r <= x;
y_r <= y;
z_r <= z;
sum_r <= x_r + y_r + z_r;
sum <= sum_r;
x_r <= x;
y_r <= y;
z_r <= z;
sum_r <= x_r + y_r + z_r;
sum <= sum_r;
end
endmodule
endmodule
The snapshot below shows the technology view of the synthesized netlist targeting a Spartan6 device :
The MAP utilization shows exactly 8 LUTs used for the logic:
Slice Logic Utilization:
Number of Slice Registers: 40 out of 11,440 1%
Number used as Flip Flops: 40
Number used as Latches: 0
Number used as Latch-thrus: 0
Number used as AND/OR logics: 0
Number of Slice LUTs: 12 out of 5,720 1%
Number used as logic: 8 out of 5,720 1%
Number using O6 output only: 2
Number using O5 output only: 0
Number using O5 and O6: 6
Number used as ROM: 0
Number used as Memory: 0 out of 1,440 0%
Number used exclusively as route-thrus: 4
Number with same-slice register load: 4
Number with same-slice carry load: 0
Number with other load: 0
The two snapshots below shows the first two bits in a slice and all 4-bits packed in the same slice:
Nice! But do you really need the added 8b of pipeline at the output as sum_r is already DFFs.
ReplyDelete> sum <= sum_r;
I know registers are "cheap" but why have a latency of 3 cycles; when 2 may be OK?
The purpose of this blog is to show how LUT6_2's are used for ternary adders. I surely hope nobody is going to instantiate the entire module as it is in their design. I threw in an extra pipeline stage for "sum" so that I can have a nice showing of sum_r registers packed in the same slice as LUT's in FPGA_EDITOR. Otherwise, the tool would pack the sum registers in IOBs.
ReplyDeleteThank you for posting this, and in such detail! I was never able to get xst+map to pack all four bits into a single slice; I guess if you give it the space to "sprawl out" it will...
ReplyDeleteAnother question: the carry chain mux input can come from either {A,B,C,D}X or O5 of the *same* LUT that the mux select. For a ternary adder the mux input is calculated by the *previous* LUT. Why didn't Xilinx use the previous-LUT O5 instead of the same-LUT O5?
Having to route the previous-LUT O5 out of the slice, through the switchbox, and back in through {A,B,C,D}X is incredibly slow... something like 500ps if I remember my last trce run correctly. It's a very significant part of the critical path for a ternary-adder-cycle-time dominated design.
Yes, I noticed that xst+map don't always pack all bits into a single slice even though they can be. If it's really critical to have them packed together, you can use LUTNM or HLUTNM constraint to force map to pack two LUTs for the same bit into one LUT6_2.
ReplyDeleteHopefully I understand your question. To use both LUT6 O6 and O5 outputs there can only be 5 independent inputs, so LUT6_2 can't be used to calculate the output for the current bit AND the carry for the previous bit. In addition, the previous LUT-O5 can't go directly to the mux for the carry chain of the current bit.
ReplyDelete*In addition, the previous LUT-O5 can't go directly to the mux for the carry chain of the current bit.*
ReplyDeleteYes, correct. I was suggesting that it might have been better if the Spartan-6 slice had been designed so that this was possible.
I see. This is a great suggestion. I will send it in. No promise, but at least it is something to be considered for future generation of FPGAs.
ReplyDeleteI tried to synthesize your reference design in ISE 13.2 Design Suite on a Virtex-6 target, but I don't get a ternary adder. XST implements two 8 bit adders, using 16 LUTs for logic.
ReplyDeleteIs there a synthesis switch or something else I am missing to get this working?
On further inspection of the reports it seems that a ternary adder is in fact implemented later in the mapping process. Everything is good.
ReplyDeleteYes, XST will infer 16 LUTs (LUT4's and LUT3's). One LUT4 and LUT3 share the same inputs and can be packed into LUT6_2. The actual packing happens in map. As Adam mentioned, map may not always pack the 16 LUTs nicely into 8 LUT6_2's.
ReplyDeleteHello, this is a really nice post. However, I have a question. In your example the sum signal is 8 bits as well as all the inputs, won't this create overflow problems?
ReplyDelete...and here the discussion stops, what a pity :(
ReplyDeleteDoes anyone know if it's possible to coerce compact ternary addition into a Virtex-5 through XST/Map? It work's fine for V6, but change target to V5 and size doubles. Is there any coding style, attribute, or tool switch that will work for V5?
ReplyDelete