## 电子工程代写|并行计算代写Parallel Computing代考|Floating Point Numbers

The IEEE standard makes computers work with floating point numbers. The point will be floating around within the number representation. It is conceptionally close to our second variant from the introductory thought experiment: However, we write down the format with an explicit exponent, and we let the exponent determine where the decimal point is. The point does not float around in the bit representation, but it floats around in the real representation. Lets derive this format step by step.

We start our introduction into machine data representations with a simple observation: we need normalisation. It is obvious that $a \cdot b c d \cdot 10^{4}=a b . c d \cdot 10^{3}$. Whenever we write down a number, we have some degree of freedom. Lets exploit this degree of freedom-well, lets actually remove it-and always move the point to the left such that the leading digit is 1 . This makes the notation unique, as we work in a binary world. So let $a b .\left.c d \cdot 2^{4}\right|{2}$ be a number that is not normalised. The same number $a .\left.b c d \cdot 2^{5}\right|{2}=1 .\left.b c d \cdot 2^{5}\right|_{2}$ is normalised.

Definition $4.2$ (Normalised number representation) If a number is written down as
$$\hat{\imath}=(-1)^{\hat{s}}\left(1+\sum_{i=1}^{S-1} \hat{x}_{i} 2^{-i}\right) \cdot 2 \cdots,$$
it is in its normalised representation.
Normalisation is nice to realise (in hardware or software): It requires bit shifts which is something computers are really good at:

In the upper representations, there are four zeroes to the left of the leftmost digit $1 .$ If we assume that the one and only bit left of the decimal point should be a one, then we have to move all the digits four positions to the left. This is called a bit shift. Before the shift, we had an exponent of $\left.4\right|{10}=\left.100\right|{2}$ which we have to reduce in return.

I emphasise that the discussion of the exponent above is “wrong”. We’ll discuss that in a minute. However, the principle idea holds: Hardware shifts around the bits, a process we call normalisation.

## 电子工程代写|并行计算代写Parallel Computing代考|The IEEE Format

Once we write down any (non-integer) number as
$$\hat{x}=(-1)^{\hat{s}}\left(1+\sum_{i=1}^{S-1} \hat{x}{i} 2^{-i}\right) \cdot 2^{E}$$ the bit sequence is a unique representation of the number $x$. It consists of $S$ significant bits of which one bit is sacrificed for the sign bit $\hat{s}$. The $\hat{x}{i}$ are the bits after the decimal point. The significand ${ }^{1}$ is supplemented by $E$ exponent bits for the exponent. Other names for significand are fraction or mantissa.

Problems result from the fact that we have to squeeze this representation into a finite number of bits. Lets assume that we have 24 bits for the significand plus sign.

One bit is “lost” for the sign. This leaves us with 23 bits. We get one back, as we do not have to store the leading 1 bit of the normalised representation. We know it has to be there as we have defined normalisation that way. There is no point in storing it. As a consequence, we know exactly how many bits we have available to store the significand. For a 32-bit floating point number, we use 23 bits for all the significant bits right from the decimal point. This is C’s $\mathrm{f}$ loat. For a 64-bit number, a $C$ double, we use 52 bits (Table $4.1)$.

Definition $4.3$ (Truncation as rounding) If we take a normalised floating point number of any bit count and squeeze it into the (IEEE) floating point number, we might throw bits away. We truncate the representation such that it fits into our predefined number of bits. Effectively, this is rounding or chopping off. The other way round, we might have to add bits as we move the significand to the left. In this case, we add either $1 \mathrm{~s}$ or $0 \mathrm{~s}$. Both of them might be the wrong thing to add (compared to the exact math), so we effectively add garbage.

IEEE 标准使计算机能够处理浮点数。该点将在数字表示中浮动。它在概念上接近于介绍性思想实验的第二个变体：但是，我们用明确的指数写下格式，并让指数确 定小数点的位置。该点在位表示中不浮动，但在实际表示中浮动。让我们一步一步推导出这种格式。

$$\hat{\imath}=(-1)^{\hat{s}}\left(1+\sum_{i=1}^{S-1} \hat{x}_{i} 2^{-i}\right) \cdot 2 \cdots$$

## 电子工程代写|并行计算代写Parallel Computing代考|The IEEE Format

$$\hat{x}=(-1)^{\hat{s}}\left(1+\sum_{i=1}^{S-1} \hat{x} i 2^{-i}\right) \cdot 2^{E}$$

