Tag Archives: Floating point

Comparing floating point values

So, floating points. They’re a pain.

Because, as you should probably know by now, floating points aren’t what they seem. They’re an approximation of a number. They’re a really good approximation, but they’re still an approximation.

Take for example:

#include <stdio.h>

int main(int argc,char **argv)
{
 double a=1.1,b=2.2,c=3.3,d;

 d=a+b;

 if (d==c)
 printf("True\n");
 else
 printf("False\n");

 return 0;
}

This should be true, because obviously 1.1+2.2==3.3

The program fails and returns False.

The estimates are great for many things, equalities aren’t one of them.

So, how do you fix it.

Simply, you use ints.

Let’s copy them over to ints and see what we get…

int main(int argc,char **argv)
{
  double a=1.1,b=2.2,c=3.3,d;
  int64_t left,right;

  d=a+b;

  memcpy(&left,&d,sizeof(int64_t));
  memcpy(&right,&c,sizeof(int64_t));

  printf("Left: %ld, Right: %ld\n",left,right);

  return 0;
}

Left: 4614613358185178727, Right: 4614613358185178726

Well, look at that. They’re one apart.

Turns out that this is all effectively a rounding error. If you are comparing two floats, convert them to ints, and then see if they’re as close as one apart.

int main(int argc,char **argv)
{
  double a=1.1,b=2.2,c=3.3,d;
  int64_t left,right;

  d=a+b;

  memcpy(&left,&d,sizeof(int64_t));
  memcpy(&right,&c,sizeof(int64_t));

  if (abs(left-right) <2)
    printf("True\n");
  else
    printf("False\n");

  return 0;
}

And there we are, correct answer.

As a note, I was mucking around in javascript when this bit me for the hundredth time and I decided to fix it once and for all. So here's the JS code which does the same

function float_equal(left,right)
{   
    var ary1,ary2,iarray,iarray2;
    var loopa,comp;


    if (left==right)
        return true;

    if (Math.abs(left-right)>=1)
        return false;

    ary1=new Float64Array(1);
    ary1[0]=left;
    ary2=new Float64Array(1);
    ary2[0]=right;

    iarray1=new BigInt64Array(ary1.buffer);
    iarray2=new BigInt64Array(ary2.buffer);

    comp=iarray1[0]-iarray2[0];

    if (comp > -2 && comp < 2)
        return true;

    return false;
}

It's not even close to the efficiency of the C version, but it does work.