Adding two IEEE754 floating-point representations and interpreting the result.
This isn't for any class or homework. As part of my personal study, I'm trying to better understand the IEEE754 representation of decimal floating-point numbers in binary. I'd like to add two numbers: $1.111$ and $2.222$, then compare the result by converting the IEEE754 representation of the sum back to decimal.
Per this online tool:
- $1.111 = 00111111100011100011010100111111$
- $2.222 = 01000000000011100011010100111111$
Summing these two together using signed binary addition, I get:
$0111 1111 1001 1100 0110 1010 0111 1110$
In hexadecimal, this is:
$7F9C6A7E$
And according to this other version of the tool, that corresponds to $NaN$.
What's going on here?
binary floating-point
|
show 4 more comments
This isn't for any class or homework. As part of my personal study, I'm trying to better understand the IEEE754 representation of decimal floating-point numbers in binary. I'd like to add two numbers: $1.111$ and $2.222$, then compare the result by converting the IEEE754 representation of the sum back to decimal.
Per this online tool:
- $1.111 = 00111111100011100011010100111111$
- $2.222 = 01000000000011100011010100111111$
Summing these two together using signed binary addition, I get:
$0111 1111 1001 1100 0110 1010 0111 1110$
In hexadecimal, this is:
$7F9C6A7E$
And according to this other version of the tool, that corresponds to $NaN$.
What's going on here?
binary floating-point
You can't expect doing integer addition on floating-point representations to give meaningful results.
– Henning Makholm
Nov 25 at 1:01
How would I go about trying to do what I want to do here?
– AleksandrH
Nov 25 at 1:06
I have no idea what it is you want to do. Use floating-point addition rather than integer?
– Henning Makholm
Nov 25 at 1:07
Yes, I was under the impression that once I have the two floating-point numbers represented as binary strings, I could simply add them together bit by bit and then translate the resulting 32-bit string to decimal floating point. The IEEE754 standard defines conversions in both directions (binary to decimal and decimal to binary).
– AleksandrH
Nov 25 at 1:12
You have to adjust them so they have the same mantissa before you add them. You ought to read about what the IEEE754 representation is actually constructed.
– saulspatz
Nov 25 at 1:12
|
show 4 more comments
This isn't for any class or homework. As part of my personal study, I'm trying to better understand the IEEE754 representation of decimal floating-point numbers in binary. I'd like to add two numbers: $1.111$ and $2.222$, then compare the result by converting the IEEE754 representation of the sum back to decimal.
Per this online tool:
- $1.111 = 00111111100011100011010100111111$
- $2.222 = 01000000000011100011010100111111$
Summing these two together using signed binary addition, I get:
$0111 1111 1001 1100 0110 1010 0111 1110$
In hexadecimal, this is:
$7F9C6A7E$
And according to this other version of the tool, that corresponds to $NaN$.
What's going on here?
binary floating-point
This isn't for any class or homework. As part of my personal study, I'm trying to better understand the IEEE754 representation of decimal floating-point numbers in binary. I'd like to add two numbers: $1.111$ and $2.222$, then compare the result by converting the IEEE754 representation of the sum back to decimal.
Per this online tool:
- $1.111 = 00111111100011100011010100111111$
- $2.222 = 01000000000011100011010100111111$
Summing these two together using signed binary addition, I get:
$0111 1111 1001 1100 0110 1010 0111 1110$
In hexadecimal, this is:
$7F9C6A7E$
And according to this other version of the tool, that corresponds to $NaN$.
What's going on here?
binary floating-point
binary floating-point
asked Nov 25 at 0:53
AleksandrH
1,22221123
1,22221123
You can't expect doing integer addition on floating-point representations to give meaningful results.
– Henning Makholm
Nov 25 at 1:01
How would I go about trying to do what I want to do here?
– AleksandrH
Nov 25 at 1:06
I have no idea what it is you want to do. Use floating-point addition rather than integer?
– Henning Makholm
Nov 25 at 1:07
Yes, I was under the impression that once I have the two floating-point numbers represented as binary strings, I could simply add them together bit by bit and then translate the resulting 32-bit string to decimal floating point. The IEEE754 standard defines conversions in both directions (binary to decimal and decimal to binary).
– AleksandrH
Nov 25 at 1:12
You have to adjust them so they have the same mantissa before you add them. You ought to read about what the IEEE754 representation is actually constructed.
– saulspatz
Nov 25 at 1:12
|
show 4 more comments
You can't expect doing integer addition on floating-point representations to give meaningful results.
– Henning Makholm
Nov 25 at 1:01
How would I go about trying to do what I want to do here?
– AleksandrH
Nov 25 at 1:06
I have no idea what it is you want to do. Use floating-point addition rather than integer?
– Henning Makholm
Nov 25 at 1:07
Yes, I was under the impression that once I have the two floating-point numbers represented as binary strings, I could simply add them together bit by bit and then translate the resulting 32-bit string to decimal floating point. The IEEE754 standard defines conversions in both directions (binary to decimal and decimal to binary).
– AleksandrH
Nov 25 at 1:12
You have to adjust them so they have the same mantissa before you add them. You ought to read about what the IEEE754 representation is actually constructed.
– saulspatz
Nov 25 at 1:12
You can't expect doing integer addition on floating-point representations to give meaningful results.
– Henning Makholm
Nov 25 at 1:01
You can't expect doing integer addition on floating-point representations to give meaningful results.
– Henning Makholm
Nov 25 at 1:01
How would I go about trying to do what I want to do here?
– AleksandrH
Nov 25 at 1:06
How would I go about trying to do what I want to do here?
– AleksandrH
Nov 25 at 1:06
I have no idea what it is you want to do. Use floating-point addition rather than integer?
– Henning Makholm
Nov 25 at 1:07
I have no idea what it is you want to do. Use floating-point addition rather than integer?
– Henning Makholm
Nov 25 at 1:07
Yes, I was under the impression that once I have the two floating-point numbers represented as binary strings, I could simply add them together bit by bit and then translate the resulting 32-bit string to decimal floating point. The IEEE754 standard defines conversions in both directions (binary to decimal and decimal to binary).
– AleksandrH
Nov 25 at 1:12
Yes, I was under the impression that once I have the two floating-point numbers represented as binary strings, I could simply add them together bit by bit and then translate the resulting 32-bit string to decimal floating point. The IEEE754 standard defines conversions in both directions (binary to decimal and decimal to binary).
– AleksandrH
Nov 25 at 1:12
You have to adjust them so they have the same mantissa before you add them. You ought to read about what the IEEE754 representation is actually constructed.
– saulspatz
Nov 25 at 1:12
You have to adjust them so they have the same mantissa before you add them. You ought to read about what the IEEE754 representation is actually constructed.
– saulspatz
Nov 25 at 1:12
|
show 4 more comments
1 Answer
1
active
oldest
votes
You cannot expect to use integer binary addition on two floating-point representations and get a meaningful result.
First, $1.111$ cannot be represented exactly in binary floating point. Your 00111111100011100011010100111111
is actually the IEEE-754 single precision representation of the number
$$ 1.11099994182586669921875 $$
which is the closest representable number to $1.111$. This breaks up as
0 01111111 00011100011010100111111
sign biased exponent fractional part of mantissa
and stands for the number
$$ 1.00011100011010100111111_2 times 2^{127-127} $$
The representation of $2.222$ is twice that, with the same mantissa but the exponent one higher. When we add them we must position the mantissas correctly with respect to each other:
1.00011100011010100111111
+ 10.0011100011010100111111
----------------------------
= 11.01010101001111110111101
11.0101010100111111011110 <-- rounded to 1+23 bits mantissa using round-to-even
0 10000000 10101010100111111011110
sign biased exp fractional mantissa
And the representation 01000000010101010100111111011110
corresponds to the number
$$ 3.332999706268310546875 $$
Note that this is not the closest representable number to $3.333$, which would be the next one,
$$ 3.33329999446868896484375 $$
but the round-to-even rule led to rounding down the full result of the addition, which compounded the error inherent in the two inputs each being slightly smaller than $1.111$ and $2.222$.
I followed this well until we got to the $10.00...$ part. Why did the decimal point move one place to the right?
– AleksandrH
Nov 25 at 1:43
@AleksandrH: Because the second addend has a biased exponent of10000000
, so it represents the number $1.langlemathit{mantissa}rangle_2 times 2^{128-127}$ -- in other words the binary points is shifted one position to the right.
– Henning Makholm
Nov 25 at 1:46
Yeah, I don't understand. Sorry for wasting your time.
– AleksandrH
Nov 25 at 14:06
@AleksandrH: The job of the exponent is to encode where the binary point is. That's what makes the representation "floating point" -- you can move the point! In the $2.22$ representation the exponent is $1$ (after we subtract the fixed bias), meaning that the point is after one of the explicitly represented mantissa bits.
– Henning Makholm
Nov 25 at 14:15
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3012295%2fadding-two-ieee754-floating-point-representations-and-interpreting-the-result%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You cannot expect to use integer binary addition on two floating-point representations and get a meaningful result.
First, $1.111$ cannot be represented exactly in binary floating point. Your 00111111100011100011010100111111
is actually the IEEE-754 single precision representation of the number
$$ 1.11099994182586669921875 $$
which is the closest representable number to $1.111$. This breaks up as
0 01111111 00011100011010100111111
sign biased exponent fractional part of mantissa
and stands for the number
$$ 1.00011100011010100111111_2 times 2^{127-127} $$
The representation of $2.222$ is twice that, with the same mantissa but the exponent one higher. When we add them we must position the mantissas correctly with respect to each other:
1.00011100011010100111111
+ 10.0011100011010100111111
----------------------------
= 11.01010101001111110111101
11.0101010100111111011110 <-- rounded to 1+23 bits mantissa using round-to-even
0 10000000 10101010100111111011110
sign biased exp fractional mantissa
And the representation 01000000010101010100111111011110
corresponds to the number
$$ 3.332999706268310546875 $$
Note that this is not the closest representable number to $3.333$, which would be the next one,
$$ 3.33329999446868896484375 $$
but the round-to-even rule led to rounding down the full result of the addition, which compounded the error inherent in the two inputs each being slightly smaller than $1.111$ and $2.222$.
I followed this well until we got to the $10.00...$ part. Why did the decimal point move one place to the right?
– AleksandrH
Nov 25 at 1:43
@AleksandrH: Because the second addend has a biased exponent of10000000
, so it represents the number $1.langlemathit{mantissa}rangle_2 times 2^{128-127}$ -- in other words the binary points is shifted one position to the right.
– Henning Makholm
Nov 25 at 1:46
Yeah, I don't understand. Sorry for wasting your time.
– AleksandrH
Nov 25 at 14:06
@AleksandrH: The job of the exponent is to encode where the binary point is. That's what makes the representation "floating point" -- you can move the point! In the $2.22$ representation the exponent is $1$ (after we subtract the fixed bias), meaning that the point is after one of the explicitly represented mantissa bits.
– Henning Makholm
Nov 25 at 14:15
add a comment |
You cannot expect to use integer binary addition on two floating-point representations and get a meaningful result.
First, $1.111$ cannot be represented exactly in binary floating point. Your 00111111100011100011010100111111
is actually the IEEE-754 single precision representation of the number
$$ 1.11099994182586669921875 $$
which is the closest representable number to $1.111$. This breaks up as
0 01111111 00011100011010100111111
sign biased exponent fractional part of mantissa
and stands for the number
$$ 1.00011100011010100111111_2 times 2^{127-127} $$
The representation of $2.222$ is twice that, with the same mantissa but the exponent one higher. When we add them we must position the mantissas correctly with respect to each other:
1.00011100011010100111111
+ 10.0011100011010100111111
----------------------------
= 11.01010101001111110111101
11.0101010100111111011110 <-- rounded to 1+23 bits mantissa using round-to-even
0 10000000 10101010100111111011110
sign biased exp fractional mantissa
And the representation 01000000010101010100111111011110
corresponds to the number
$$ 3.332999706268310546875 $$
Note that this is not the closest representable number to $3.333$, which would be the next one,
$$ 3.33329999446868896484375 $$
but the round-to-even rule led to rounding down the full result of the addition, which compounded the error inherent in the two inputs each being slightly smaller than $1.111$ and $2.222$.
I followed this well until we got to the $10.00...$ part. Why did the decimal point move one place to the right?
– AleksandrH
Nov 25 at 1:43
@AleksandrH: Because the second addend has a biased exponent of10000000
, so it represents the number $1.langlemathit{mantissa}rangle_2 times 2^{128-127}$ -- in other words the binary points is shifted one position to the right.
– Henning Makholm
Nov 25 at 1:46
Yeah, I don't understand. Sorry for wasting your time.
– AleksandrH
Nov 25 at 14:06
@AleksandrH: The job of the exponent is to encode where the binary point is. That's what makes the representation "floating point" -- you can move the point! In the $2.22$ representation the exponent is $1$ (after we subtract the fixed bias), meaning that the point is after one of the explicitly represented mantissa bits.
– Henning Makholm
Nov 25 at 14:15
add a comment |
You cannot expect to use integer binary addition on two floating-point representations and get a meaningful result.
First, $1.111$ cannot be represented exactly in binary floating point. Your 00111111100011100011010100111111
is actually the IEEE-754 single precision representation of the number
$$ 1.11099994182586669921875 $$
which is the closest representable number to $1.111$. This breaks up as
0 01111111 00011100011010100111111
sign biased exponent fractional part of mantissa
and stands for the number
$$ 1.00011100011010100111111_2 times 2^{127-127} $$
The representation of $2.222$ is twice that, with the same mantissa but the exponent one higher. When we add them we must position the mantissas correctly with respect to each other:
1.00011100011010100111111
+ 10.0011100011010100111111
----------------------------
= 11.01010101001111110111101
11.0101010100111111011110 <-- rounded to 1+23 bits mantissa using round-to-even
0 10000000 10101010100111111011110
sign biased exp fractional mantissa
And the representation 01000000010101010100111111011110
corresponds to the number
$$ 3.332999706268310546875 $$
Note that this is not the closest representable number to $3.333$, which would be the next one,
$$ 3.33329999446868896484375 $$
but the round-to-even rule led to rounding down the full result of the addition, which compounded the error inherent in the two inputs each being slightly smaller than $1.111$ and $2.222$.
You cannot expect to use integer binary addition on two floating-point representations and get a meaningful result.
First, $1.111$ cannot be represented exactly in binary floating point. Your 00111111100011100011010100111111
is actually the IEEE-754 single precision representation of the number
$$ 1.11099994182586669921875 $$
which is the closest representable number to $1.111$. This breaks up as
0 01111111 00011100011010100111111
sign biased exponent fractional part of mantissa
and stands for the number
$$ 1.00011100011010100111111_2 times 2^{127-127} $$
The representation of $2.222$ is twice that, with the same mantissa but the exponent one higher. When we add them we must position the mantissas correctly with respect to each other:
1.00011100011010100111111
+ 10.0011100011010100111111
----------------------------
= 11.01010101001111110111101
11.0101010100111111011110 <-- rounded to 1+23 bits mantissa using round-to-even
0 10000000 10101010100111111011110
sign biased exp fractional mantissa
And the representation 01000000010101010100111111011110
corresponds to the number
$$ 3.332999706268310546875 $$
Note that this is not the closest representable number to $3.333$, which would be the next one,
$$ 3.33329999446868896484375 $$
but the round-to-even rule led to rounding down the full result of the addition, which compounded the error inherent in the two inputs each being slightly smaller than $1.111$ and $2.222$.
edited Nov 25 at 1:39
answered Nov 25 at 1:23
Henning Makholm
238k16303537
238k16303537
I followed this well until we got to the $10.00...$ part. Why did the decimal point move one place to the right?
– AleksandrH
Nov 25 at 1:43
@AleksandrH: Because the second addend has a biased exponent of10000000
, so it represents the number $1.langlemathit{mantissa}rangle_2 times 2^{128-127}$ -- in other words the binary points is shifted one position to the right.
– Henning Makholm
Nov 25 at 1:46
Yeah, I don't understand. Sorry for wasting your time.
– AleksandrH
Nov 25 at 14:06
@AleksandrH: The job of the exponent is to encode where the binary point is. That's what makes the representation "floating point" -- you can move the point! In the $2.22$ representation the exponent is $1$ (after we subtract the fixed bias), meaning that the point is after one of the explicitly represented mantissa bits.
– Henning Makholm
Nov 25 at 14:15
add a comment |
I followed this well until we got to the $10.00...$ part. Why did the decimal point move one place to the right?
– AleksandrH
Nov 25 at 1:43
@AleksandrH: Because the second addend has a biased exponent of10000000
, so it represents the number $1.langlemathit{mantissa}rangle_2 times 2^{128-127}$ -- in other words the binary points is shifted one position to the right.
– Henning Makholm
Nov 25 at 1:46
Yeah, I don't understand. Sorry for wasting your time.
– AleksandrH
Nov 25 at 14:06
@AleksandrH: The job of the exponent is to encode where the binary point is. That's what makes the representation "floating point" -- you can move the point! In the $2.22$ representation the exponent is $1$ (after we subtract the fixed bias), meaning that the point is after one of the explicitly represented mantissa bits.
– Henning Makholm
Nov 25 at 14:15
I followed this well until we got to the $10.00...$ part. Why did the decimal point move one place to the right?
– AleksandrH
Nov 25 at 1:43
I followed this well until we got to the $10.00...$ part. Why did the decimal point move one place to the right?
– AleksandrH
Nov 25 at 1:43
@AleksandrH: Because the second addend has a biased exponent of
10000000
, so it represents the number $1.langlemathit{mantissa}rangle_2 times 2^{128-127}$ -- in other words the binary points is shifted one position to the right.– Henning Makholm
Nov 25 at 1:46
@AleksandrH: Because the second addend has a biased exponent of
10000000
, so it represents the number $1.langlemathit{mantissa}rangle_2 times 2^{128-127}$ -- in other words the binary points is shifted one position to the right.– Henning Makholm
Nov 25 at 1:46
Yeah, I don't understand. Sorry for wasting your time.
– AleksandrH
Nov 25 at 14:06
Yeah, I don't understand. Sorry for wasting your time.
– AleksandrH
Nov 25 at 14:06
@AleksandrH: The job of the exponent is to encode where the binary point is. That's what makes the representation "floating point" -- you can move the point! In the $2.22$ representation the exponent is $1$ (after we subtract the fixed bias), meaning that the point is after one of the explicitly represented mantissa bits.
– Henning Makholm
Nov 25 at 14:15
@AleksandrH: The job of the exponent is to encode where the binary point is. That's what makes the representation "floating point" -- you can move the point! In the $2.22$ representation the exponent is $1$ (after we subtract the fixed bias), meaning that the point is after one of the explicitly represented mantissa bits.
– Henning Makholm
Nov 25 at 14:15
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3012295%2fadding-two-ieee754-floating-point-representations-and-interpreting-the-result%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You can't expect doing integer addition on floating-point representations to give meaningful results.
– Henning Makholm
Nov 25 at 1:01
How would I go about trying to do what I want to do here?
– AleksandrH
Nov 25 at 1:06
I have no idea what it is you want to do. Use floating-point addition rather than integer?
– Henning Makholm
Nov 25 at 1:07
Yes, I was under the impression that once I have the two floating-point numbers represented as binary strings, I could simply add them together bit by bit and then translate the resulting 32-bit string to decimal floating point. The IEEE754 standard defines conversions in both directions (binary to decimal and decimal to binary).
– AleksandrH
Nov 25 at 1:12
You have to adjust them so they have the same mantissa before you add them. You ought to read about what the IEEE754 representation is actually constructed.
– saulspatz
Nov 25 at 1:12