Subject: | floating-point formats not according to spec |

Date: | Mon, 2 Feb 2015 15:14:53 +0000 |

To: | bug-Sereal-Encoder [...] rt.cpan.org |

From: | Zefram <zefram [...] fysh.org> |

The Sereal format spec clearly says that floating-point values
are to be stored in IEEE formats. However, the spec does poorly at
specifying which formats these are, and the implementation actually uses
not-necessarily-IEEE native formats.
First, the spec. The tag table documents tags named "FLOAT", "DOUBLE",
and "LONG_DOUBLE", describing the type-specific data for each as
"<IEEE-FLOAT>", "<IEEE-DOUBLE>", and "<IEEE-LONG-DOUBLE>" respectively.
This, along with the general note about using IEEE formats and the
other general note about numeric quantities being little-endian, is the
entire specification for the floating-point representations. This is
a problem, because the terms "float" and "long double" don't have any
defined meaning in the context of IEEE 754, so the intended referents
are unclear. The family of terms "float", "double", and "long double"
are actually C type names, and IEEE 754 does not specify any particular
mapping of its formats onto C types.
IEEE 754 defines four binary floating-point formats for data interchange.
Each has an explicit name based on its bit length, and a common
name based on multiples of the historical default size of 32 bits.
The four formats are binary16 ("half precision", 5 exponent bits, 10
fractional significand bits), binary32 ("single precision", 8 exponent
bits, 23 fractional significand bits), binary64 ("double precision", 11
exponent bits, 52 fractional significand bits), and binary128 ("quadruple
precision", 15 exponent bits, 112 fractional significand bits). The spec
should refer to these formats by either of their names, not by C types.
Next, the implementation. srl_encoder.c (and matching srl_decoder.c
of Sereal::Decoder) does not make any effort to use IEEE formats in
serialised data. Instead it uses the native C floating-point types,
associating float, double, and long double each with the similarly-named
tag. It doesn't even canonicalise endianness: so even where the native
types are IEEE, a big-endian system serialises contrary to the spec's
statement that numeric data are little-endian. (With the varint clause
explicitly specifying little-endian, and all other numeric quantities
being single bytes, the floating-point formats seem to be the only data
to which the general endianness statement apply.)
Hosts with different endianness or otherwise differing floating-point
formats of the same size will see corrupted numeric data when they
try to exchange serialised NVs. Hosts whose floating-point formats
for particular C types have different sizes will see worse corruption,
by virtue of losing synchronisation between encoder and decoder.
I'm not presently affected by this, but foreseeably might become affected.
We configure our perls to use long double for NV, and currently for
us that's the decidedly non-IEEE x87 80-bit format. (Not only is
the format not one of the lengths specified by IEEE, but by using an
explicit integral significand bit it doesn't even follow IEEE's rules
for constructing floating-point formats.) It is foreseeable that this
format will eventually be supplanted, one way or another, by the IEEE
quad-precision format, which is already used for the long double type on
some platforms. In switching over we would face a compatibility problem.
-zefram