Document [original]

This version is available at https://doi.org/10.14279/depositonce-7061

all other uses, in any current or future media, including reprinting/republishing this material for

advertising or promotional purposes, creating new collective works, for resale or redistribution to

servers or lists, or reuse of any copyrighted component of this work in other works.

Lucas, J., Lal, S., & Juurlink, B. (2018). Optimal DC/AC data bus inversion coding. In 2018 Design,

Automation & Test in Europe Conference & Exhibition (DATE). IEEE.

https://doi.org/10.23919/date.2018.8342169

Lucas, J., Lal, S., & Juurlink, B.

Optimal DC/AC Data Bus Inversion Codin

Accepted manuscript (Postprint)Conference paper |

Optimal DC/AC Data Bus Inversion Coding

Jan Lucas, Sohan Lal, Ben Juurlink

Embedded Systems Architecture

TU Berlin

Berlin, Germany

{j.lucas,sohan.lal,b.juurlink}@tu-berlin.de

Abstract—GDDR5 and DDR4 memories use data bus inversion

(DBI) coding to reduce termination power and decrease the

number of output transitions. Two main strategies exist for

encoding data using DBI: DBI DC minimizes the number of

outputs transmitting a zero, while DBI AC minimizes the number

of signal transitions. We show that neither of these strategies is

optimal and reduction of interface power of up to 6% can be

achieved by taking both the number of zeros and the number

of signal transitions into account when encoding the data. We

then demonstrate that a hardware implementation of optimal

DBI coding is feasible, results in a reduction of system power

and requires only an insignificant additional die area.

Index Terms—Data bus inversion, DDR4, GDDR5, power

consumption, termination power

I. INTRODUCTION

Up to 50% of the power used by the memory is con-

sumed by the external interconnect [1]. GDDR4/5/5X [2]–

[4] as well as DDR4 [5] memories use a pseudo open

drain (POD) electrical interface [6]. While the previously

used SSTL interfaces terminate to a voltage at 0.5VDDQ, the

POD interface is terminated to VDDQ. In a terminated SSTL

interface, DC current is always flowing, transmitting a zero

or a one just changes the path of the current flow. In the

POD interface, also illustrated in Fig.1, DC current through the

termination resistors is only flowing when transmitting a zero,

while transmitting a one does not cause DC current through

the termination. Memory using POD signalling reduces the

termination current by employing data bus inversion (DBI) [7].

For every 8 DQ (data) lines, a ninth DBI line is added.

Transmitting a zero on this line signals that the 8 DQs lines

contain an inverted data byte, while a one on the DBI wire

indicates transmission of the non-inverted byte. The simplest

DBI scheme is called DBI DC and simply counts the number

of zeros in each byte and transmits the byte in its non-inverted

form, if it contains 4 or fewer zeros. If the byte contains

5 or more zeros, the byte will be inverted. A byte with 5

zeros, will contain 3 zeros after inversion, however, the DBI

bit will contain an additional zero indicating the inversion.

This scheme guarantees that never more than 4 zeros per byte

are transmitted.

In addition to the interface energy consumed by DC ter-

mination current, transitions from zero to one or one to zero

consume dynamic power by charging and discharging of load

This work has received funding from the European Union’s Horizon

2020 research and innovation programme under grant agreement No 688759

(Project LPGPU2).

VDDQ

Driver Receiver

Fig. 1. Pseudo open drain (POD) interface

capacities. The importance of the load capacities can also be

seen in the design of the POD output driver: A regular open

drain output would rely solely on the resistor to VDDQ to

generate high output state, but the pseudo open drain output

actively drives the output to high to provide a faster recharging

of the load and thus also a faster signal transitions than what

could be achieved by the termination pull-up alone. Instead of

reducing the number of transmitted zeros, the DBI signalling

can also be used to reduce the number of signal transitions.

In the DBI AC scheme, each transmitted byte is inverted, if

the inversion reduces the number of signal transitions.

In this paper, we present a novel DBI encoding scheme. It

finds a minimum energy DBI encoding of a burst, if given the

ratio between the energy for transmitting a zero and the energy

per transition. The paper is organized as follows: We first

provide an overview of related work, then we introduce our

optimal encoding algorithm and a simplified variant. Then in

Section IV we explain how power was modelled and explain a

hardware design that is able to perform the new DBI encoding

at the required data rates. In the next section, we present our

experimental results and finally we conclude our paper.

II. RELATED WORK

Hollis [8] described the DBI DC and DBI AC schemes

and recognized that both the number of transmitted zeros and

the number of signal transitions are important for the power

consumption of the memory interface. The slight increase

of the signal transitions in DBI DC and the slight increase

of transmitted zeros in DBI AC was also described in the

same paper. Hollis proposes to combine DBI AC and DC by

switching between DBI DC and DBI AC encoding modes. The

proposed DBI ACDC scheme encodes the first byte of a group

Start End

Byte 0 Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7

10001110 10000110 10010110 11101001 01111101 10110111 01010111 11000100

01110001 01111001 01101001 00010110 10000010 01001000 10101000 00111011

DC: 26 AC: 42 DC: 27 AC: 28 DC: 28 AC: 24

DC: 29 AC: 23 DC: 43 AC: 22

Fig. 2. Optimal DBI encoding as a shortest path problem

of bytes using DBI DC and then encodes the remaining bytes

using DBI AC. We found that this scheme indeed provides a

slight improvement compared to pure DBI AC. However, the

encoding proposed in this paper outperforms the DBI ACDC

scheme. In this paper we assume that all lines transmitted ones

prior to transmitting the evaluated burst. Due to this boundary

condition DBI AC performs identical to DBI ACDC.

Chang et al. [9] propose schemes that aim to reduce both

zeros and transitions per burst. However, instead of finding the

minimal energy encoding for each burst, they propose heuristic

schemes that find good but not necessary optimal encodings.

In a patent Hollis [10] proposes a technique to target both

signal transition and zeros. This technique uses additional sig-

nal lines and requires a different and more complex decoding

process than regular DBI schemes.

Ihm et al. propose an analog circuit for DBI DC en-

coding [11]. Analog implementation could also reduce the

overhead of the technique proposed in this paper and DBI

encoding seems to be well suited for analog implementation

as rare inaccurate encoding decision are unlikely to causes

application errors.

Stan and Burleson [12] provide theoretical background on

DBI encoding, however, they only consider the reduction of

signal transition and do not consider the reduction of zeros.

Narayanan et al. [13] describe additional coding schemes

that can reduce the number of signal transitions beyond DBI,

but require an even higher number of lines and more complex

encoding and decoding.

Kim et al. describe DBI DC in GDDR4 and show how it

reduces simultaneous switching output noise [14].

III. OPTIMAL ENCODING

To reduce the power consumption, every burst should be

transmitted using as little energy as possible. Each burst of 8

bytes can be encoded using 28different DBI patterns. A naive

algorithm would search through all possible encoding options

and pick the cheapest one. But the cheapest encoding option

can be found much more efficiently as we can reformulate the

problem as a shortest path problem on a directed graph with

nonnegative weights. This is illustrated in Fig. 2. The topology

of the graph only depends on burst length. Two nodes exist for

each byte, one node represents the transmission of the byte in

its non-inverted representation, while the other node represents

the inverted transmission of this byte. The cost of transmitting

each byte depends only on the previous byte as well as the

byte itself. Only two different previous bytes can exist, either

the previous byte was inverted or not. The weight of the edges

represents the cost of encoding each byte based on the previous

byte. The shortest path from the start to the end node is the

encoding with the minimum total energy. Three factors control

the weights of the edges: The data that should be transmitted

and the coefficients αand β. The αcoefficient configures the

cost of each signal transition, while the βcoefficient sets the

cost of each transmitted zero bit. As the shortest path does

not change by a uniform scaling of the edge weights, we can

freely scale the coefficients as long as the ratio α

βdoes not

change. This allows us to use small integer coefficients without

a significant loss of encoding efficiency. Our top example

shows the shortest path and edge weights for α=β= 1.

This choice of αand βin the example implies that the energy

cost of transmitting a zero is identical to the energy cost of

a transition. If we vary the coefficients without changing the

data, we find 5 other pareto optimal encoding options. The

DBI DC algorithm finds an encoding with 26 zeros, but 42

transitions. The DBI AC algorithm finds the encoding with

22 transitions but 43 zeros. But neither of these two previous

algorithms are able to identify the three encodings with a more

balanced trade-off between zeros and transitions. If we assume

α=β= 1, then the optimal encoding has energy cost of

28 + 24 = 52, while DBI DC choose an encoding with a cost

of 26 + 42 = 68 and DBI AC selects an encoding with a cost

of 43 + 22 = 65.

0 0.2 0.4 0.6 0.8 1

Energy per Burst

AC cost

DC cost

OPT

RAW

Fig. 3. Energy per Burst using different DBI schemes

We simulated the different DBI encoding schemes on 10000

random bursts. We varied the cost αper signal transition from

0 to 1 and set the cost β= 1 −α. The result is shown in

Fig. 3. DBI DC behaves identical to optimal DBI (DBI OPT)

encoding when the AC cost is 0. This is no surprise as DBI

OPT with α= 0 and β= 1 is identical to DBI DC. DBI DC

works almost as well as the optimum encoding until the AC

cost reaches 0.15. Similar results can be seen for DBI AC.

As expected DBI AC performs identical to DBI OPT when

the DC cost is 0 and the performance stays close until the

DC cost reaches 0.15. Both DBI AC and DBI DC perform

worse than unencoded (RAW) data, when used together with

high DC cost or AC cost, respectively. DBI AC encoding is

cheaper than DBI DC encoding starting from α= 0.56. The

biggest advantage of optimal DBI encoding is also offered

at this point, where the average cost per burst is 2 points or

6.75% lower than with DBI AC or DBI DC. The shaded area

in Fig. 3 shows the advantage of DBI OPT encoding compared

to the best conventional encoding scheme (DBI DC or AC).

One problem with DBI OPT encoding is the accuracy

required for the coefficients. However, as we already saw

with DBI AC and DBI DC, the coefficients do not need

to be very accurate to still enable almost perfect encoding

results. We fixed α=β= 1 and named this encoding

scheme DBI OPT (Fixed). Fig. 4 shows the results. The

shaded area indicates the small reduction of performance due

to the fixed coefficient. The encoding with fixed coefficients

performs better than previous scheme from an AC cost of 0.23

to 0.79. The maximum energy reduction from this encoding

is nearly identical at 6.58%.

IV. EXPERIMENTAL SETUP

A. Power Model

We estimated the energy consumption based on a model

derived from the CACTI-IO model presented by Jouppi et

al [1], [15]. We unified all load capacities into a single load

capacity and reformulated the equations from power to energy

per activity.

0 0.2 0.4 0.6 0.8 1

Energy per Burst

AC cost

DC cost

OPT

OPT (Fixed)

Fig. 4. Energy per Burst for different DBI schemes, shaded area shows loss

of efficiency from fixed coefficients

Ezero is the energy consumed by transmitting a single zero.

Ezero =V2

DDQ

Rpullup +Rpulldown

f(1)

Etransition is the energy consumed by a single transition

from zero to one or one to zero.

Etransition =1

2VDDQVswingcload (2)

Vswing is the signal swing, it is calculated from the out-

put resitance of the pulldown driver (Rpulldown) and on-die

termination resistor. (Rpullup)

Vswing =VDDQ

Rpullup

Rpullup +Rpulldown

(3)

The total interface energy per burst is calculated as follows:

Eburst =nzerosEzero +ntransitionsEtransition (4)

cload is the total load capacity. We tested a wide range of

values from 1 pF to 8 pF total load. It should be the sum

of the effective capacities of the driver in the CPU or GPU,

the capacities of the memory devices added to the DQ lines,

the capacity of the transmission line connecting memory and

CPU/GPU. If a system uses DIMM or similar sockets the extra

load of those should also be considered. Amirkhany et al. state

a 1.3 pF load for an GDDR5 output driver [16]. CACTI-IO

assumes 2 pF for an DDR4 output driver and 1 pF per memory

device [1]. Vuong lists a maximum capacity of 1.3 pF for

DDR4 [17]. IBIS files from Micron also list similar values

per DDR4 input. DIMM sockets and the PCB trace can add a

few additional pF.

B. Hardware

To validate that the proposed DBI encoding can be done

at the required data rates and add only a small overhead to

a CPU or GPU using this scheme, we developed a hardware

Byte(0)

DBI(0)

Byte(1)

DBI(1)

Byte(2)

DBI(2)

Byte(3)

DBI(3)

Byte(4)

DBI(4)

Byte(5)

DBI(5)

Byte(6)

DBI(6)

Byte(7)

DBI(7)

∞

F F

Byte(−1)

0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1

cost inv(i)

cost inv(i+ 1)

0 1

cost(i+ 1)

cost(i)

POPCNT POPCNT

Byte(i−1) ⊕Byte(i)Byte(i)

·αα·(9 −x)β·(8 −x)β·(x+ 1)

ac cost0ac cost1dc cost0dc cost1

ac cost0

dc cost0

cost(i)

ac cost1

dc cost0

cost inv(i)

ac cost1

dc cost1

cost(i)

ac cost0

dc cost1

cost inv(i)

m0m1

Fig. 5. Hardware architecture of improved DBI encoder

TABLE I

SYNTHESIS RESULTS (32NM)

Scheme Area

(µm2)

Static Power (µW) Dynamic Power

(µW)

Burst Rate (GHz) Total (µW) Energy per Burst

(pJ)

DBI DC 275 105 111 1.5 216 0.14

DBI AC 578 170 250 1.5 420 0.28

DBI OPT (Fixed Coeff.) 3807 257 2233 1.5 2490 1.66

DBI OPT (3-Bit Coeff.) 16584 5200 3600 0.5 8800 17.6

cost(i)

cost inv(i)

cost(i+ 1)

cost inv(i+ 1)

ac cost0+

dc cost0

ac cost0+

dc cost1

ac cost1+

dc cost0

ac cost1+

dc cost1

Fig. 6. Mapping of shortest path search to signals

implementation. Our proposed hardware architecture is shown

in Fig. 5. Each byte of the burst is processed by one processing

block. Each block receives two minimum costs: cost(i)is

minimum cost of transmitting bytes 0 to i-1 with the last

byte transmitted in non-inverted encoding, while cost inv(i)

is minimum cost of transmitting those bytes with the last

byte inverted. If we consider the problem as a shortest path

problem, cost(i)is the cost of the shortest path from the

start to the node of the ith byte and cost inv(i)is the cost

of the shortest path to corresponding inverted node. Each of

the processing blocks receives the byte itself as well as the

exclusive-or of this byte and the previous byte. Within each

processing block, two population count units (POPCNT) count

the number of set bits in each of the two inputs on the top.

dc cost0is the cost of transmitting the current byte without

inversion, i.e., the number of zero bits multiplied with the cost

βof each zero. dc cost1is the DC cost of transmitting the

current byte inverted. In this case the extra zero transmitted

on the DBI signal also needs to be considered, which results

in the +1 term. Two options also exist for number of signal

transitions: Either both the previous byte and the current byte

are transmitted in the same way (ac cost0) or the DBI bit

changed between the two bytes (ac cost1). Now the cost of

four different encoding options can be calculated (from the

top to the bottom): 1. Previous byte was not inverted, current

byte also not inverted. 2. Previous byte was inverted, current

byte is not inverted. 3. Previous byte was not inverted, current

byte is inverted. 4. Previous byte is inverted, current byte also

inverted.

The relationship to the graph is also shown in Fig. 6. To

calculate the cost of reaching a node via one edge, we need

to consider the cost of the edge as well as the minimum cost

of reaching the source node of the edge. Two edges lead to

each node and we compare their cost and store which of the

edges provided the cheapest path. The cheapest path is then

forwarded to the next block.

At the last block, we compare which of the two end nodes

provides the shortest overall path. This path is backtracked to

find the DBI pattern using the muxes below the blocks. This is

the same technique that is also used in the Dijkstra’s algorithm

to reconstruct the shortest path.

0.8

0.85

0.9

0.95

1.05

1.1

0 5 10 15 20

Normalized Energy

Data Rate [Gbps]

OPT

OPT (Fixed)

Fig. 7. Interface energy per burst normalized to unencoded transmission for various DBI encoding schemes

0.92

0.94

0.96

0.98

1.02

1.04

1.06

0 5 10 15 20

Normalized Energy

Data Rate [Gbps]

1 pF

2 pF

3 pF

4 pF

6 pF

8 pF

Fig. 8. Energy per burst using optimized encoding, including encoding energy, normalized to best of DBI DC or AC

We described our designs in VHDL and synthesized the

designs using Synopsys Design Compiler Ultra K-2015.06-

SP4 together with the Synopsys 32nm generic libraries in order

to estimate the required die area, power and throughput. We

synthesized two variants of our proposed design: One design

used configurable 3-bit coefficients for αand β, while the

other design fixes α=β= 1. The fixed coefficients remove

multipliers from the design and reduce the bit width of the data

path. We added 8 pipeline stages to the output of our design

and used the retime option of the synthesis tool to move the

registers to an appropriate location. Current GDDR5X uses up

to 12 Gbps data rate per pin. Our design encodes 8 bytes per

clock cycle, thus a clock frequency of 1.5 GHz is required

to meet the required throughput using a single encoding

unit. Whether this design adds additional latency, depends

on the design of the memory controller, often it should be

possible to perform the encoding in parallel with other memory

controller tasks. If extra latency is added, this can still be

acceptable for GPUs: GPUs already have memory subsystems

with hundreds of cycles of latency and their performance is

relatively insensitive to additional latency [18].

V. RESULTS

Table I shows the results of our synthesis. DBI DC, DBI

AC and DBI OPT with fixed coefficients could meet the 1.5

GHz timing, equivalent to a data rate of 12 Gbps. DBI OPT

with 3-Bit configurable coefficients was significantly slower

and could only run at 500 MHz (equivalent to 4 Gbps). It also

required 4.5x more area than the design with fixed coefficients

and used 10.6x more energy per encoded burst than the design

with fixed coefficients. Due to the lower frequency, 3 units

are required to reach the same throughput, increasing the area

requirements even further.

Fig. 7 displays the interface energy per burst normalized to

the cost of transmitting the data without any DBI encoding

using POD135 (used by GDDR5X) and 3 pF load. However,

results for DDR4 with POD12 are almost identical. DBI DC

performs better than DBI OPT (Fixed) until 3.8 Gbps. DBI AC

would require a significantly higher frequency than 20 Gbps

to perform better than this scheme. The maximum gain from

this optimized encoding can be found around 14 Gbps.

The previous Fig. 7 does not include the energy required

for encoding. If we also consider the energy for encoding,

the picture changes. DBI OPT encoding with configurable

coefficients encodes the data only slightly better than the

fixed coefficient version, however, it uses significantly more

energy for encoding each burst. For this reason it always

consumes more power than the DBI DC and DBI AC schemes.

However, further optimization of the hardware might change

this. We used a relatively old 32nm process node for estimating

the power consumption and an optimized implementation

in a more recent process could provide a significant power

reduction, that could make configurable coefficients beneficial.

Fig. 8 shows the energy per burst for DBI OPT with fixed

coefficients normalized to the best conventional DBI encoding.

Higher capacitive load reduces the frequency where the highest

reduction of energy is achieved. At 3 to 8 pF load, the energy is

reduced between 5-6% at the operating points with the highest

gains.

VI. CONCLUSIONS

A novel DBI encoding scheme was presented, it reduces the

link power consumption by up to 6%. It has been shown that

the problem of finding an DBI encoding with the smallest

link energy is equivalent to finding the shortest path in a

graph. We presented a hardware design that performs the

encoding at the required data rates using an insignificant

extra area and energy. Additional optimization to reduce the

hardware overhead including partially analog implementation

are possible. A design with fixed coefficients provides a

very good trade-off between the energy required for encoding

and the saved link energy. It can be used without changing

existing DDR4, GDDR5 and GDDR5X memories to reduce

the interface energy during writes and could be integrated into

future memories to also reduce read interface energy.

REFERENCES

[1] N. P. Jouppi, A. B. Kahng, N. Muralimanohar, and V. Srinivas, “CACTI-

IO: CACTI with off-chip power-area-timing models,” IEEE Transactions

on VLSI Systems, 2015.

[2] JEDEC Standard, “Graphics Double Data Rate (GDDR4) SGRAM

Standard,” SDRAM3.11.5.8, May, 2006.

[3] ——, “Graphics Double Data Rate (GDDR5) SGRAM Standard,”

JESD212C, February, 2016.

[4] ——, “Graphics Double Data Rate (GDDR5X) SGRAM Standard,”

JESD232A, August, 2016.

[5] ——, “DDR4 SDRAM Standard,” JESD79-4B, June, 2017.

[6] JEDEC Standard , “POD15 - 1.5 V PSEUDO OPEN DRAIN I/O,”

JESD8-20A, 2009.

[7] S. J. Bae, Y. S. Sohn, K. I. Park, K. H. Kim, D. H. Chung, J. G.

Kim, S. H. Kim et al., “A 60nm 6Gb/s/pin GDDR5 graphics DRAM

with multifaceted clocking and ISI/SSN-reduction techniques,” in IEEE

ISSCC Digest of Technical Papers, 2008.

[8] T. M. Hollis, “Data bus inversion in high-speed memory applications,”

IEEE Transactions on Circuits and Systems II: Express Briefs, 2009.

[9] N. Chang, K. Kim, and J. Cho, “Bus encoding for low-power high-

performance memory systems,” in Design Automation Conference

(DAC). ACM, 2000.

[10] T. M. Hollis, “Devices and methods for facilitating data inversion to

limit both instantaneous current and signal transitions,” 2016, US Patent

9,270,417.

[11] J. D. Ihm, S. J. Bae, K. I. Park, H. Y. Song, W. J. Lee, H. J. Kim,

K. H. Kim et al., “An 80nm 4Gb/s/pin 32b 512Mb GDDR4 graphics

DRAM with low-power and low-noise data-bus inversion,” in IEEE

ISSCC Digest of Technical Papers, 2007.

[12] M. R. Stan and W. P. Burleson, “Bus-invert coding for low-power I/O,”

IEEE Transactions on VLSI systems, 1995.

[13] U. Narayanan, K.-S. Chung, and T. Kim, “Enhanced bus invert encod-

ings for low-power,” in IEEE International Symposium on Circuits and

Systems (ISCAS), 2002.

[14] J.-H. Kim, W. Kim, D. Oh, R. Schmitt, J. Feng, C. Yuan, L. Luo, and

J. Wilson, “Performance impact of simultaneous switching output noise

on graphic memory systems,” in Electrical Performance of Electronic

Packaging. IEEE, 2007.

[15] N. P. Jouppi, A. B. Kahng, N. Muralimanohar, and V. Srinivas, CACTI-

IO Technical Report. Department of Computer Science and Engineer-

ing, University of California, San Diego, 2012.

[16] A. Amirkhany, J. Wei, N. K. Mishra, J. Shen, W. T. Beyene, C. Chen,

T. Chin, D. Dressler, C. Huang, V. P. Gadde et al., “A 12.8-Gb/s/link

tri-modal single-ended memory interface,” IEEE Journal of Solid-State

Circuits, 2012.

[17] H. Vuong, “Mobile memory technology roadmap,” in JEDEC’s Mobile

Forum, 2013.

[18] M. Andersch, J. Lucas, M. ´

Alvarez Mesa, and B. Juurlink, “On latency in

GPU throughput microarchitectures,” in IEEE International Symposium

on Performance Analysis of Systems and Software (ISPASS), 2015.