Embedded Systems

Fixed-Point DSP and Algorithm Implementation

By RC Cofer and Ben Harding and Avnet, October 25, 2006

This article provides a primer on the use of fixed-point arithmetic in DSP algorithms. It covers concepts such as two's complement representation, dynamic range, overflow, truncation, and saturation. The article also introduces key filtering concepts.

Signed number representation
In many DSP algorithms it is necessary to represent both positive and negative numbers, also called signed numbers. There are three conventional methods of representing signed fixed point values. These are sign and magnitude, one's complement and two's complement. All three of these formats utilize the MSB bit to indicate sign, leaving (16-1) or 15 bits to represent the numeric magnitude value. Sign and magnitude encoding simply uses the MSB to represent sign, with 0 indicating a positive number and 1 indicating a negative number. The remaining 15 bits represent the magnitude of the value.

One's complement representation follows the sign and magnitude format for positive numbers with a 0 value in the MSB position indicating a positive number. Negative numbers are represented by a 1 in the MSB location. The one's complement name comes from the fact that to obtain a negative number's representation you subtract the positive number's representation from a 16 bit all ones number. A common positive to negative conversion shortcut simply inverts each bit in the equivalent magnitude positive number representation. Note that with one's complement the addition of opposite signed numbers is not straightforward and that one's complement encoding implements two representations for 0; (+0 and -0).

Two's complement representation again follows the sign and magnitude format for positive numbers. The two's complement name comes from the fact that to obtain the negative representation of a number subtract the positive number's representation from a 17-bit number with a leading (MSB) value of 1 followed by 16 zeros. The shortcut for converting a two's complement number from positive to negative is to invert each bit in the equivalent magnitude positive number representation and then add one to the result.

Two's complement implementation has only one representation for zero within the data range (rather than the redundant +0 and -0 implemented by one's complement and sign magnitude encoding). Two's complement encoding has the additional benefit of allowing a single hardware implementation of mixed positive and negative number addition. A significant implementation note is that converter (ADC) output data may not be provided in two's complement format, requiring user conversion of negative numbers from sign magnitude into two's complement encoding.

In a 16 bit system with two's complement representation, the range of integer numbers which can be represented is (2^16-1) to (2^16-1-1) or 32,767 to -32,768. It is important to note that in fractional two's complement mode the value -1 is represented while +1 is not.

Detailed discussion of each of these formats can be found in DSP texts. This paper will deal primarily with two's complement data encoding.

Numeric Operations
Problems arise when two integer fixed length binary values are multiplied together. The result of multiplying two 16 bit integer binary values is a 32 bit integer binary value. Yet this result must ultimately be stored in a fixed length 16 bit word. The least significant bits cannot simply be truncated off the end of the number since they represent the magnitude of the number, an essential part of its representation.

With fixed point representation the answer to this problem is the scaling of the numerical values in the system so that they are fractions between the values of -1 and 1. Note that a previously mentioned two's complement limitation applies here. Since positive 1 is not represented, the high end of the range only goes up to (1-ε), where ε is the smallest number which can be represented by the number of bits in the system. Thus, the maximum positive value for a 16 bit is the binary fractional value of 0.999969482 and not actually 1. For simplicity this range is typically referred to as -1 to 1, however the designer should maintain awareness of this exception.

Scaling down to the -1 to 1 range requires representing all of the numbers including input data and algorithm coefficients as fractions. Numbers can be normalized to fractional representations by moving the implied radix point position to the left in the word. Moving the radix point to the left one place in a binary word is the equivalent of dividing by 2, while moving to the right one place is the equivalent of multiplying by 2. As long as all of the values are equally scaled the operational results are equivalent.

Since two fractions multiplied always result in a fraction (a value less than one), and since the magnitude of a fractional number is not significantly represented by it's LSBs, the LSBs can be truncated allowing storage of the result in the required fixed word length. This resolves the magnitude growth problem of multiply operations with fixed point integer representation. An example of multiplying two maximum value binary fractions together is:

0^111 1111 1111 1111 (7FFF)_H [Q15] × 0^111 1111 1111 1111 (7FFF)_H [Q15] 00^11 1111 1111 1111 0000 0000 0000 0001 (3FFE 0001)_H [Q30]

However, even though multiplying two fractions cannot result in a value greater than one, adding two fractions can. Note that two binary numbers must have the same radix point location in order to be added. This is shown in the following example:

0^111 1111 1111 1111 (7FFF)_H [Q15] + 0^000 0000 0000 0101 (0005)_H [Q15] 1^000 0000 0000 0100 (8004)_H [Q15]

If an operation's results are greater than one an overflow condition occurs and the algorithm's results are invalid. Overflow conditions tend to wreak havoc in a system. The solution to this problem is to require input data values to be small enough to avoid any overflow conditions (or saturation if operating with saturation mode arithmetic).

Thus, normalizing values to the -1 to 1 range adds several significant burdens to the fixed point algorithm implementer (programmer). The programmer must review the system implementation and scale the maximum absolute value input magnitude range so that overflows associated with additions or MACs are avoided where necessary. The programmer must also keep track of the radix point since the hardware is not aware of the radix point location.

Previous 1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Fixed-Point DSP and Algorithm Implementation

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

Fixed-Point DSP and Algorithm Implementation

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content