Implementing IEEE 754

IEEE 754 stands for IEEE Standard for Floating-Point Arithmetic, which is used to express floating point values. It's name may seems daunting, but actually its just basic scientific notation with base 2 and some bits operations.

We all know that, in computer everything stored in bits, and bits are limited. Normally, we could use 16/32/64/128 or more bits to express a number. Say we decide to use 32 bits, then what? How can we use this 32 bits to express numbers?

Then comes the IEEE 754, which is a standard to express floating numbers. We could follow this standard, and implement the encoding and decoding from raw bits to target numbers. In this article, let's try to follow this standard and build a 32 bits floating point encoder and decoder by hand.

Before we get to the actual implementation, let consider the classic question 0.1 + 0.2 = 0.30000000000000004. Why this could happen? Is this a flaw? Actually no, this is just the way it works. Image how many number beween 0 and 1? Yes, Infinite. But how many value we can express in only 32 bits? Its only Math.pow(2, 32). So when we try to use this limited resources to express infinite values, we just unable to do it precisely. So floating point values in computers is just an approximation to the actual value. That is why the classic question happens. We will write the encoding/decoding code soon, then you will know what is the actual meaning of approximation.

IEEE 754 use scientific notation to express numbers. Basic formula looks like below.

sign * 2^exponent * (1+mantissa)

You can see there are 3 key parts in this formula.

Sign is a 1 bit value, which is used to indicate if the number is positive.

Exponent is a variable to indicate range. Note that it is 2 based. Let's see some range to get feel of it.

exponent value
...
-8       1/256
-7       1/128
-6       1/64
-5       1/32
-4       1/16
-3       1/8
-2       1/4
-1       1/2
 0       1
 1       2
 2       4
 3       8
 4       16
 5       32
 6       64
 7       128
 8       256
...

As you can see, negative exponent can be used to express small number, and positive exponent can be used to express big number.

Mantissa is a variable to express precision. Before, we have used exponent to fit a number in a range, and then we can use this mantissa to express the precision.

Now we know the formula. Let's take a example 12.23 to see how to express it in IEEE 754 format in 32 bits step by step.

According to the standard, we need to use 1 bit to express sign, 8 bits to express exponent, and 23 bits to express mantissa.

const EXP_BITS = 8;
const MANTISSA_BITS = 23;
const NON_SIGN_BITS = EXP_BITS + MANTISSA_BITS;

First, the most easy part, we check if the number is positive, and store result in variable sign.

const sign = Math.sign(n) === -1 ? 1 : 0;

Next, we need to search which range this number should fit in. Look at the above table, we know that 12.23 is between 8 to 16, so exponent range is 3 and 4. We choose the lower one as the value of exponent. This step can be calculated using log operation.

const exponent = Math.floor(Math.log2(n));

Next, we need to calculate mantissa. We already know that the value is in range [8-16], precision is just a point in this range. We can think mantissa is a percentage above lower bound. For example, this point could be in 8 * percentage. So, how to calculate this percentage. This is easy. We know the lower and upper bound, we know actual value, we can just calculate the proportion as the percentage.

const lower = 2 ** exponent;
const upper = 2 ** (exponent + 1);
const percentage = (n - lower) / (upper - lower);

But percentage is not the actually value, we need to express it in binary format. Because we know we should use exactly 23 bits to express this part. So using same logic of proportion, we can convert percentage into a value.

const mantissa = MAX_MANTISSA_VALUE * percentage;

This MAX_MANTISSA_VALUE is just the maximum value can be expressed in 23 bits.

const MAX_MANTISSA_VALUE = 2 ** MANTISSA_BITS;

Last thing about encoding is that, exponent can be negative, but there is no sign bit in exponent part. So we need to add a bias to exponent, to make it like from [-n, n] to [0, 2n], so it can be expressed in positive interge. This is just a small trick, later we will subtrack this bias in decoding part.

const BIAS = Math.pow(2, EXP_BITS) / 2 - 1;

So, finally, we can encode all the above values in a variable in 32 bits.

(sign << NON_SIGN_BITS) | ((exponent + BIAS) << MANTISSA_BITS) | mantissa;

And the whole encoding code.

const EXP_BITS = 8;
const MANTISSA_BITS = 23;
const NON_SIGN_BITS = EXP_BITS + MANTISSA_BITS;
const BIAS = Math.pow(2, EXP_BITS) / 2 - 1;
const MAX_MANTISSA_VALUE = 2 ** MANTISSA_BITS;

const encode = n => {
    const sign = Math.sign(n) === -1 ? 1 : 0;
    n = Math.abs(n);
    const exponent = Math.floor(Math.log2(n));
    const lower = 2 ** exponent;
    const upper = 2 ** (exponent + 1);
    const percentage = (n - lower) / (upper - lower);
    const mantissa = MAX_MANTISSA_VALUE * percentage;
    return (sign << NON_SIGN_BITS) | ((exponent + BIAS) << MANTISSA_BITS) | mantissa;
};

Now let's do the decoding part.

First, we test the first bit to know the sign value.

const sign = (n >>> NON_SIGN_BITS);

Then we get the next 8 bits to know the exponent value.

const exponent = (n >>> MANTISSA_BITS) & ~(1 << EXP_BITS);

And then we get the least 23 bits to know mantissa value.

const mantissa = (n & ~(~0 << MANTISSA_BITS));

With mantissa, we can calculate percentage backwards.

const percentage = mantissa / MAX_MANTISSA_VALUE;

And then calculate final result.

(-1) ** sign * (1 + percentage) * 2 ** (exponent - BIAS);

Let's see the whole decoding part.

const decode = n => {
    const sign = (n >>> NON_SIGN_BITS);
    const exponent = (n >>> MANTISSA_BITS) & ~(1 << EXP_BITS);
    const mantissa = (n & ~(~0 << MANTISSA_BITS));
    const percentage = mantissa / MAX_MANTISSA_VALUE;
    return (-1) ** sign * (1 + percentage) * 2 ** (exponent - BIAS);
};

Finally, we can test our encoding and decoding process.

const original = 0.1;
const encoded = encode(original);
const decoded = decode(encoded);
console.log({ original, encoded, decoded });
// { original: 0.1, encoded: 1036831948, decoded: 0.09999999403953552 }

Don't be surprised when you see the 09999999403953552. This is just the approximation I said in the beginning.

Lastly, this is just a toy code to show the idea of IEEE 754. Use it with causion.

yaox023's blog

yaox023's blog

Implementing IEEE 754