What Every Data Scientist Should Know About Floating-Point Arithmetic

In 1991 David Goldberg at Xerox PARC published the seminal paper on floating point arithmetic titled “What every computer scientist should know about floating-point arithmetic.” This paper was especially influential in the 1990’s and early 2000’s when limitations in computer hardware drove people to operate in a regime that most exposes them to the limitations of floating point arithmetic, specifically using 32 bit floats for storing and calculating floating point numbers. With modern 64 bit architecture and low storage and RAM costs, the incentive to use 32 bit floats is no longer there and computer scientist generally use 64 bit double precision floats, which greatly reduces them to the issues of floating point arithmetic.

However, 64 bit floats do not remove the issues, it just proverbially kicks the can down the road waiting for usage patterns to evolve to once again cause floating point issues to become prominent again. With Big Data and extremely large data sets, both data scientists and computer scientists need to be concerned about the limitations of floating point arithmetic even with 64 bit floating point variables. Issues like roundoff error and cancellation can make Big Data calculations to be erroneous to the point of being useless.

In this article I cover some best practices that data and computer scientists should use when working with large data sets. The content of this article is very much based of the concepts that David Goldberg wrote about in his 1991 paper. There is no way I can replicate the thoroughness of that paper. However, my goal here is to give the reader a working understanding of the issues with floating point arithmetic and discuss how these learnings can be applied to the Big Data use case.

Floating Point Representation

The core cause for any issue with floating point arithmetic is how floating point numbers get represented in a finite amount of memory within your computer. Per the IEEE 754 standard, a floating point number is represented with 4 basic parts:

\pm C\times \beta ^{E}

Where \pm indicates the sign of the number, C is the coefficient known as the significand (it used to be called the mantissa), \beta is the base the number is expressed in, and E is an exponent applied to the base. In most floating point implementations, \beta is set to base 2, and C is always a fractional number in the “ones” place the base. That means in base 2, the C is always a number of the form 1.F where F is the fractional part of the number. For example:

\begin{aligned}2 & = 1.0\cdot 2^{1}\\10 &= 1.25 \cdot 2^{3}\\22&= 1.375 \cdot 2^{4}\\10000& = 1.220703125\cdot 2^{13}\\0.1& = 1.6\cdot 2^{-4}\end{aligned}

Amongst these examples, the last one, 0.1, presents a particular challenge to being represented by the IEEE standard. While the calculation looks to be precise, the significand cannot be represented precisely in binary. Specifically, \left ( 1.6 \right )_{10} = \left ( 1.100110011001100 \overline{1100} \right )_{2} (if you are unfamiliar with binary fractions, read this). Note the infinitely repeating \overline{1100} digits. Herein lies the problem; per the IEEE 754 standard, an infinite number of digits are not available to represent an infinitely repeating digit sequence.

There are two main types of floating point containers (variable types) in the IEEE standard. One is known as the single precision floating point type, commonly known as the float, is contained in 32 bits of memory. The other is known as the double precision floating point type, commonly known as the double, is contained in 64 bits of memory. Because each floating point type is a different size, they each allocated a different number of bits.

Floating Point TypeTotal Size (bits)Significand Bits
Includes Sign
Decimal Significant DigitsExponent BitsExponent Range
float32247.228-126 to +127
double645315.9511-1022 to +1023
long double12811334.0215−16382 to +16383

In order to represent the \left ( 1.100110011001100 \overline{1100} \right )_{2} in a float, the binary fraction must fit into 24 bits. In the IEEE 754 standard, the 1 value of the binary fraction assumed, so we have 23 bits for the fractional part of the value. The last binary digit of the fraction gets rounded, so the original value of 1.6 is approximated as \left ( 1.10011001100110011001101 \right )_{2} in a float. The significand’s representation of 1.6 ends up containing the value of 1.600000023841858, which when multiplied by 2^{-4} creates the value of 0.10000000149011612.

Most computer languages will round a value to the significant decimal digits of the significand before displaying the value. This means for a float, the value of 0.10000000149011612 will get rounded to 7 significant digits, or 0.1000000. This value rounding then creates the illusion that 0.1 is precisely represented by a float. It is in fact not precisely resented. This floating point representational error is the root cause for most issues data and computer scientists have with floating point calculations.

Floating Point Issues

Due to representational errors in floating point numbers, there are three main issues that data scientists using big data platforms. These are aliasing, roundoff error, and cancellation.


When a floating point value is cast from float to double, the representation of the value is not actually changed. Consider the representation of 0.1. The float‘s 23 bit significand is \left ( 1.10011001100110011001101 \right )_{2}. When casting to a double, enough 0’s are added to the right of the significand value to pad out the 53 bit significand of a double. This means the signficand’s value is still effectively \left ( 1.10011001100110011001101 \right )_{2}. Ideally the significand for 0.1 in a double would be \left ( 1.10011001100110011001100110011001100110011001100110011 \right )_{2}, which has its own representational error but with an error that is much smaller than the float significand. However, casting the float‘s significant to a has the effect of persisting the representational error of the float. Furthermore, since the significant digits of a double are about 15, no rounding will occur when presenting the number is it did for the original float. This means the 0.10000000149011612 value will be used in any calculation and presentation.

The aliasing effect can easily be demonstrated in Python using the NumPy library to control the floating type used.

>>> import numpy as np
>>> val = np.float32(0.1)
>>> val
>>> np.float64(val)

The best way to mitigate aliasing is to consistently use the same floating point type through your entire software system.

Roundoff Error

When two floating point numbers are operated on, their representation is adjusted such that their exponents match the higher value of the two. This also requires the significand to be adjusted such that the original value is still represented. When this occurs, since the exponent of the floating point is usually increase, the significand gets right shifted. The problem arises is when the significand starts to loose bits of the right side due to the limited number of bits in the significand. This causes the significant digits of the significand to be reduced and the value represented rounded off as a result, hence the reason it is called round off error.

As an example, consider adding 100000000 and 10. Here is how each are represented in a 32 bit float:

\begin{array}{rll}100000000 & = \left ( 1.4901161193847656 \right )_{10} \cdot 2^{26} & = \left ( 1.011111010111100001 \right )_{2} \cdot 2^{26}\\10 &= \left ( 1.25 \right )_{10} \cdot 2^{3} &= \left ( 1.01 \right )_{2} \cdot 2^{3}\end{array}

In order to add the two numbers, the representation of 10 gets adjusted to use 26 as its exponent, which means right shifting the significand 23 bits:

\begin{array}{rll}10 &= \left ( 0.0000001490116119384765625\right )_{10} \cdot 2^{26} &= \left (0.0000000000000000000000101\right )_{2} \cdot 2^{26}\end{array}

Adding the two binary signficands yields:

\begin{array}{cl}& \left ( 1.011111010111100001 \right )_{2}\\ + & \underline{\left ( 0.0000000000000000000000101 \right )_{2}} \\ & \left ( 1.0111110101111000010000101 \right )_{2} \\ \end{array}

However, since the 32 bit float only has 23 bits right of the decimal point for the significand, the addition results become:

\begin{array}{cl}& \left ( 1.011111010111100001 \right )_{2}\\ + & \underline{\left ( 0.0000000000000000000000101 \right )_{2}} \\ & \left ( 1.01111101011110000100001 \right )_{2} \\ \end{array}

This causes the last 01 to be dropped, effectively turning the second number’s 10 value into 8. This behavior can be demonstrated in Python:

>>> import numpy as np
>>> values = np.full( 100, 10, dtype=np.float32 )
>>> sum = np.float32(100000000)
>>> for val in values:
...     sum += val
>>> sum

It can generally be stated that roundoff error will occur when the difference in order of magnitude between the two operands is close to or larger than the decimal significant digits of the floating point type being used, or if one or both operands are imprecisely represented and there is any order or magnitude difference.

If you are using a float, almost any data set can easily be affected by roundoff error as the float has only 7 significant digits in the significand. It is not unreasonable for most data sets to have a range of values to varies over 7 or more orders of magnitude. If you are using a double, the range afforded is now 15 orders of magnitude. That means you won’t start having problems until you add 10 to 10,000,000,000,000,000.  In a real world example, if the US Government was using a double to calculate the national debt daily, at the current rate of debt accumulation, it will take over 10,000 years before the daily debt calculations will experience significant round off error. A double will safely handle the order of magnitude range for most data sets.

However, even double types are susceptible to roundoff error caused by imprecise representations. This is because as one of the operand’s significand is right shifted prior to the operation, it will lose bits from its already imprecise representation of the value. The greater the order of magnitude difference between the two operands, the greater the error. Roundoff error can generally be mitigated by using Kahan Summation Algorithm to accumulate numbers. This algorithm works by determining the error of each summation, then separately accumulating this error to add back into the summation later when it is significant enough to not experience roundoff error. In Python, the algorithm can be implemented as follows:

>>> import numpy as np
>>> values = np.full( 100, 10, dtype=np.float32 )
>>> sum = np.float32(100000000)
>>> error = np.float32(0)
>>> for val in values:
...     adjusted_val = val - error
...     temp_sum = sum + adjusted_val
...     error = (temp_sum - sum) - adjusted_val
...     sum = temp_sum
>>> sum

In Python, you do not need to implement the Kahan algorithm yourself. Python has implements a safe summation algorithm for you in the math.fsum() function:

>>> import numpy as np
>>> from math import fsum
>>> values = [np.float64(100000000000000000)]
>>> values.extend(np.full( 100, 10, dtype=np.float64 ))
>>> sum(values)
>>> fsum(values)

Notice how even 64 bit floating point values have roundoff error and how math.fsum() corrects it.


Cancellation, all referred to as loss of significance, is different from roundoff error in that it involves two operands very close in value where one or both are imprecisely represented. When one of these numbers are subtracted from the other, the number of digits in the remainder’s significand might be few, resulting in only a few significant bits from the significand determining the value of the difference.

Consider for example finding the difference between 0.1111112 and 0.1111111. In binary, the resulting operation for a 32 bit float would be:

\begin{array}{cl}& \left ( 1.11000111000111001000101\right)_{2} \cdot 2^{-4}\\ -& \underline{\left ( 1.11000111000111000110111 \right )_{2} \cdot 2^{-4}} \\ & \left ( 0.00000000000000000001110 \right )_{2} \cdot 2^{-4} \\ \end{array}

As can seen in this example, two numbers with 23 bits of information in the significand are reduced down to 4 significant bits of information. The resulting number after left shifting the significand to bring it to standard form will very precisely represent \left ( 1.75 \right )_{10} \cdot 2^{-24}, which itself approximates 0.00000010430813. In this example, over 4% of error was introduced due to the cancellation, much higher than what the significant digits of the float type would suggest is possible.

Note that the Kahan Summation Algorithm described above will not correct cancellation. The Kahan algorithm allows you to recapture information that would have been lost when right shifting the significand of the smaller operand. However, cancellation is not about information lost. Instead, cancellation represents the lack of information in the first place. The imprecise representation of the numbers prevents us from knowing precisely the difference between them.

The above example can easily be demonstrated in Python:

>>> import numpy as np
>>> np.float32(0.1111112) - np.float32(0.1111111)

The classic example of cancellation-prone algorithms is finding roots to the quadratic question. Consider an equation of the form:


The two roots to this equation are:

x_{1}=\frac{ - b -\displaystyle\sqrt {b^2 - 4ac} }{2a} , x_{2}=\frac{ - b +\displaystyle\sqrt {b^2 - 4ac} }{2a}

Now, consider the case where b\gg 4ac, then \sqrt {b^2 - 4ac}\approx b, making the x_{2} root prone to cancellation error. Indeed, if you use a=0.1, b=100, and c=0.1, you get a vastly different answer between 32 bit and 64 bit float:

>>> a=np.float32(0.1) 
>>> b=np.float32(100)
>>> c=np.float32(0.1)
>>> x2=(-b+np.float32(sqrt(b*b-np.float32(4.0)*a*c)))/(np.float32(2.0)*a)
>>> x2
>>> a=np.float64(0.1)
>>> b=np.float64(100)
>>> c=np.float64(0.1)
>>> x2=(-b+np.float64(sqrt(b*b-np.float64(4.0)*a*c)))/(np.float64(2.0)*a)
>>> x2

Using Wolfram Alpha to calculate the right answer with many digits of precision, we get x_{2}=-0.00100000100000200000500001400004200013200042900143.... With this level of precision, you can see that even the 64 bit float has error well within it’s significant digits of 15.92, albeit to a lesser degree than the float 32.

We can fix the cancellation problem by adjusting the x_{2} formula to avoid the unstable subtraction:

\begin{array}{cl}x_{2} & =\frac{ - b +\displaystyle\sqrt {b^2 - 4ac} }{2a} \\ & =\frac{ - b +\displaystyle\sqrt {b^2 - 4ac} }{2a} \cdot \frac{-b-\displaystyle\sqrt {b^2 - 4ac} }{-b-\displaystyle\sqrt {b^2 - 4ac} } \\& =\frac{2c}{-b -\displaystyle\sqrt {b^2 - 4ac} }\end{array}

If you use the updated formula with a 32 bit float, you get a better answer:

>>> a=np.float32(0.1)
>>> b=np.float32(100)
>>> c=np.float32(0.1)
>>> x2a = np.float32(2)*c/(-b-np.float32(sqrt(b*b-np.float32(4)*a*c)))
>>> x2a

Best Practices for Data Science

So the discussion above begets the question: How does any of this pertain to Big Data and Data Science? If the type of data science you are doing involves floating point calculations of any sort, then the issues I describe above will impact you. Below is the top situations in data science when the floating point imprecise representation will create problems for you.

Do Not Look for Floating Point Equality

To demonstrate this issue, run the following code in python:

>>> import numpy as np
>>> sum = np.float64(0)
>>> while sum != np.float64(10.1):
...     sum += np.float64(0.1)
...     print(sum)

As the numbers scroll by, you will see that the value 10.1 does get printed at some point. Note that this roundoff error even occurred in 64 bit floating point values. So why didn’t the loop break? When a floating point comparison operation occurs, the actual bits in both operands are compared. Due to round off error, ever so slight in this case, adding 0.1, which is imprecisely represented, to itself 11 times will not be precisely equal to the floating point representation of 10.1. Since 0.1 and 10.0 are two orders of magnitude apart, 0.1 ‘s significant will get right shifted in the addition, and this the resulting bits will not be precisely the same as directly composing the value 10.1.  Recall that when the language (python) prints out a floating point, it only prints out the significant digits even though the representation would have more digits beyond the significant digits. For this reason, the loop illustrated above will never pass the stoping condition and become an infinite loop.

The right way to test for floating point equality is to use error bounds. The size of the error bounds should beat the significant digits you care about. For the above example, that would look something like:

>> import numpy as np
>>> sum = np.float64(0)
>>> while not ( sum < 10.11 and sum >10.09):
...     sum += np.float64(0.1)
...     print(sum)

Picking the error bounds takes some judgement. You need to consider the error you expect to accumulate and the significant digits at which a difference from your comparison value is meaningful for your situation.

Favor Integer Values Over Floating Point Values When Possible

The idea of using integers rather than floats since integers do not have a imprecise representation seems like a no brainer, but it is surprising common to use floating point values when could also be used. The most common situation where I have seen this issue arise is storing monetary values. Frequently the value $1.10 will be contained in a floating point variable, because it is a floating point value. However, we know that it will be imprecisely represented. You can completely avoid the issues of floating point representation (at least in the scope of comparison, summation and subtraction operations) the value gets reinterpreted as an integer count of cents. So $1.10 should actually represented instead as 110¢. Now, if you do comparison against of summation of these refactored values, you will not get the set of problems described above. Refactoring the units of a value to turn the value into an integer is not always possible, but when it is, integers should be favored.

Do Not Store Real Numbers as 32 bit Floating Point Values in Databases and Row Files

Avoid storing values as 32 bit floats in databases, row files, or any other form of  storage.  Even if you use a 64 bit float in your code, you will be exposed to the aliasing problem. In SQL, this means using a DOUBLE PRECISION (or it’s equivalent, such as REAL in MySQL) instead of a FLOAT.  Do not specify the the two digit configuration for the type – the ( M, D ) part –  as that will force the number to be rounded off regardless of the precision of the underlying data store.

In my career I have seen it frequently argued that one should store floating point values in 32 bit floats is to save on space. Back when hard drives storage was measured in megabytes and floppy disks were still heavily used, this argument was meaningful. Today, when hard drive storages is measured in terabytes and with the existence of distributed file systems, there is no real argument for saving space at this scale. Even if you are required doubling your disk footprint to accommodate the doubling in precision, more hard drives are far cheaper than the time and effort you will need to put in to mitigate the effects of floating point aliasing.

A potentially valid argument against using 64 bit floating point variables in storage is that doubling the size will take longer to deserialize the information off disk, thus slowing down your Big Data job. My recommendation in that case is that the storage files should be compressed. Deserializing compressed data from disk is frequently faster than reading the same data uncompressed from disk. Disk drive read times are so slow and CPUs are so fast that the reduce data read off disk due to compression and the fast decompressing of the data by the CPU will be frequently faster in aggregate than reading the full sized, uncompressed data from disk.

Always Use Double Precision Floating Point Values or Better

Similar to not using 32 bit floating point types to store values, do not use 32 bit floating point types to do calculations in. In Big Data applications, only having 7 significant digits of precision is in the floating point value is rather significant and will needless expose you to round off error and significant cancellation problems. Similar to the arguments presented above, RAM is so cheap that there is really no reason to use the 32 bit float anymore.

Another consideration is a concept not yet discussed: overflow and underflow. Since the exponent of a floating number has a finite size, they have a maximum and minim value that can be represented. Simply put, 64 bit floating point variable can represent much larger number than 32 bit floating points (about 1.8 \times 10^{308} versus about 3.4 \times 10^{38}) You are less likely to have overflow or underflow problems with a double.

Use Kahan Algorithm Avoid Roundoff Error When Reducing Large Data Sets

One of the great things about platforms like Hadoop is that you can take very, very large data sets and reduce them down to interesting insights. In doing so, you will frequently need to accumulate values, sometimes floating point values. Piping a large floating point data set through a summation type reducer will expose you to roundoff error due to the accumulated value will quick grow orders of magnitude larger than the individual values being accumulated. That is the point where round off error occurs, especially if the data sets has a large range in the orders of magnitude of individual values.

A quick rule of thumb is that if the order of magnitude range of your data set (that is, the order of magnitude of the largest absolute value minus the order of magnitude of the smallest absolute value) plus the order of magnitude of the number of values in the data set approaches or exceed the significant digits of the floating point variable type you are using, you will be significantly affected by roundoff error. In order to mitigate that, your reducer should us the Kahan algorithm to accumulate discrete values without being strongly affected by roundoff error. If you are using SQL to do your analysis, create a UDAF for the Kahan algorithm rather than using the SUM() function in SQL.

Rework Algorithms to Avoid Cancellation

In Data Science, you are frequently dealing with mathematical formulas which will transforms input values into some result. But, as demonstrated above with the the simple algorithm of quadratic root finding, it is not that hard to have a formula that is prone to cancellation error. The quadratic root finding formulas above demonstrate this concept. Anything that calculates the Euclidean distance between points, such as the nearest neighbor algorithm or clustering algorithms that use Euclidean distance, will likely be exposed to cancellation when points are near each other in space. The list goes on. The point here is that you should understand where the formula might be sensitive to cancellation error and take steps to mitigate it.

Reworking an algorithm to avoid subtraction all together is an approach for avoid cancellation. However, avoiding the substation may not be workable in all cases. Another strategy might be to re-order the operations such that you are less likely to subtract two numbers that are close in value or when the subtraction occurs, the difference in the two numbers doesn’t occur on the edge of the significant digits of the floating point type. This approach can be demonstrated with the formula a^{2} - b^{2}. Suppose a=900000.2 and b=900000.1. The difference in these two numbers is prone to cancellation with 32 bit floating points due to the difference being at the seventh significant digit. However, the difference in a^{2} - b^{2}=900000.2^{2}-900000.1^{2}=810000360000.04-810000180000.01 has much more cancellation due to large value at the significant digits. It end up being better to have the subtraction occur at a smaller magnitude by refactoring the formula to a^{2} - b^{2}=\left ( a+b \right ) \cdot \left ( a-b \right ). This can be demonstrated in Python:

>>> import numpy as np
>>> a = np.float32(900000.2)
>>> b = np.float32(900000.1)
>>> a*a - b*b
>>> (a+b)*(a-b)

The right answer is 180000.03. You can see that the second method produced less error due to cancellation.

Cancellation error is not to be confused with the concept of condition in data science. An algorithm (or matrix) is said to be ill conditioned if for a small change in input there is a large change in output. This means that the algorithm is prone to amplifying error regardless of the source of error. With ill conditioned problems, you should be just as concerned with measurement error as roundoff error or cancellation error. Being ill conditioned isn’t a necessarily property of floating point representation, but the nature of the algorithm itself.


As any sort of scientist, it is important to understand the limitations of the to tools you are using to solve problems. In the field of data science, the most fundamental tool is floating point numbers. Used improperly or without regard for their limitations, floating point numbers can create significant error in calculations. However, understanding and mitigating the limitations of floating point numbers will allow the data scientist to operate on large swaths of data relatively assured that the result produced is accurate.

One thought on “What Every Data Scientist Should Know About Floating-Point Arithmetic”

Leave a Reply