Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix retransmission and send offset overflow #65

Closed
wants to merge 1 commit into from

Conversation

pothos
Copy link
Contributor

@pothos pothos commented Oct 28, 2017

Resolves #62

Duplicate ACKs should not replace the retransmission timer
but otherwise, as normal ACKs, set the keep-alive timer.

During the retransmission, higher sequence numbers than the
last one sent could be acknowledged. This led to an overflow
in the offset calculation, which resulted in no further data
being sent. Now the maximum is used to correctly set it to 0.

@whitequark
Copy link
Contributor

Thanks. Can you add tests?

Sorry, something went wrong.

@pothos pothos force-pushed the retransmission-and-overflow branch from 0280597 to 700ddd0 Compare October 29, 2017 13:10
@pothos
Copy link
Contributor Author

pothos commented Oct 29, 2017

Good idea, there was one more problem I could spot through the tests: When it continued to send data, it was using a wrong (too low) sequence number.
Also in one debug print there was an overflow.
So I decided to just update the value correctly at the place where the local sequence number is forwarded. Is it ok now?

Sorry, something went wrong.

src/wire/tcp.rs Outdated
@@ -10,7 +10,7 @@ use super::ip::checksum;
///
/// A sequence number is a monotonically advancing integer modulo 2<sup>32</sup>.
/// Sequence numbers do not have a discontiguity when compared pairwise across a signed overflow.
#[derive(Debug, PartialEq, Eq, Clone, Copy, Default)]
#[derive(Debug, PartialEq, Eq, Ord, Clone, Copy, Default)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no no, this is very intentional! Sequence numbers are not totally ordered! They wrap!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. Then by just looking at the two numbers there is no way to distinguish between the case of normal operation with local_seq_no just wrapped before the max statement and the retransmission case. There is also a wrapped case for retransmission where the minimum would have to be taken. So first a way to find out if the value of remote_last_seq was coming from a retransmission is needed, then it could be updated directly and there is no min/max involved.

@whitequark
Copy link
Contributor

I'll need to take a closer look at these changes. I'm not yet completely convinced they're correct. But, they do seem to go in the right direction.

@pothos pothos force-pushed the retransmission-and-overflow branch from 700ddd0 to 2a69903 Compare November 2, 2017 05:13
@pothos
Copy link
Contributor Author

pothos commented Nov 2, 2017

This is roughly how I would solve it now. Instead of the casts it could use PartialOrd::partial_cmp.
Just considering the two values local_seq_no (newly set) and remote_last_seq and the two cases normal vs retransmission there are four combinations because of wrapping (i.e. ):

(N1) Normal with local_seq_no < remote_last_seq and (N2) normal with remote_last_seq < local_seq_no. (Wrapping happened already for remote_last_seq in N2.)

(R1) Retransmission with remote_last_seq < local_seq_no and (R2) retransmission with local_seq_no < remote_last_seq. (Wrapping happend just now for local_seq_no in R2)

Looking at old_local_seq_no helps to distinguish them, because in the wrapped cases it is guaranteed to be in the interval of the other two numbers while in the non-wrapped cases the opposite holds.

It looks ugly rightnow and needs to presented in a better way. I think it is correct, would you agree that handling these cases is an acceptable approach?

@pothos pothos force-pushed the retransmission-and-overflow branch 2 times, most recently from 530489d to 8196b7c Compare November 5, 2017 07:49
@pothos
Copy link
Contributor Author

pothos commented Nov 5, 2017

Maybe this is now a bit more readable? I have included the wrapping cases in the test. In wire.rs I had to make the two wrapping subtractions explicit because otherwise sending gave panics in debug mode.

@pothos pothos force-pushed the retransmission-and-overflow branch 2 times, most recently from d31d8e7 to 3888f11 Compare November 6, 2017 11:14
(self.local_seq_no < old_local_seq_no && old_local_seq_no < self.remote_last_seq) {
// These two cases describe retransmission where remote_last_seq lags behind local_seq_no.
// The value for the next sequence number to be used is updated.
self.remote_last_seq = self.local_seq_no;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove all empty cases and the unreachable which I added to see if something was missed. Then if would just be the following:

if (self.remote_last_seq < self.local_seq_no &&
   !(self.remote_last_seq < old_local_seq_no && old_local_seq_no < self.local_seq_no)) ||
  (self.local_seq_no < old_local_seq_no && old_local_seq_no < self.remote_last_seq) {
    self.remote_last_seq = self.local_seq_no;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sure that you have discovered a real issue, but I do not seem to understand your explanation. Do you think you could expand on it or rephrase it in a different way?

@pothos pothos force-pushed the retransmission-and-overflow branch from 3888f11 to aaf9ca5 Compare November 10, 2017 04:55
(State::Established, TcpControl::None) => {
self.timer.set_for_idle(timestamp, self.keep_alive);
match self.timer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems obviously correct and OK to merge. Do you think you can extract this part and a dedicated test, and submit a PR for it?

Also, would you mind rewriting the condition a bit? I would add a Timer::is_retransmit() -> bool function, and write the condition as if !timer.is_retransmit() || ack_len != 0 { self.timer.set_for_idle(...) }

src/wire/tcp.rs Outdated
@@ -51,13 +51,13 @@ impl ops::Sub for SeqNumber {
type Output = usize;

fn sub(self, rhs: SeqNumber) -> usize {
(self.0 - rhs.0) as usize
self.0.wrapping_sub(rhs.0) as usize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this?

@pothos pothos force-pushed the retransmission-and-overflow branch from aaf9ca5 to bf1ffd7 Compare November 13, 2017 09:06
@pothos
Copy link
Contributor Author

pothos commented Nov 13, 2017

Hello,
thanks a lot for the feedback. I've split this commit into two again, first for the regular overflow (test case in socket/tcp.rs because it is just a test for normal send operations which depends on the explicit wrapping), then for the special retransmission wrong-seq-number/overflow bug when the value is not updated.
Regards,
Kai

// ACKs are received, both for regular and retransmitted data.
for local_seq_start in vec![LOCAL_SEQ, TcpSeqNumber(i32::MAX - 7),
TcpSeqNumber(i32::MAX - 6), TcpSeqNumber(i32::MAX - 5), TcpSeqNumber(i32::MAX - 4),
TcpSeqNumber(i32::MAX - 3), TcpSeqNumber(i32::MAX - 2)] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how wrapping is relevant here. TCP sequence numbers are compared modulo 2³². Specifically, TcpSeqNumber(i32::MAX - 6) - TcpSeqNumber(i32::MAX - 3) equals TcpSeqNumber(-6) - TcpSeqNumber(-3).

Can you show me a test case that results in an error only when the sequence number wraps? If one exists, it means there is a bug somewhere else apart from the retransmission code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overflow assertions are fixed in the other commit with an extra test: 3480238 (but are not shown by github anymore since the master merge).
Here the test also includes wrapping and non-wrapping cases to cover the constellations of local_seq_no, old_local_seq_no and remote_last_seq.
When the ACK arrives after the first packet retransmitted for MAX-7 we have: old_local_seq_no < remote_last_seq < local_seq_no
For MAX-6 local_seq_no already wraps to: local_seq_no < old_local_seq_no < remote_last_seq
For MAX-3 local_seq_no and remote_last_seq now wrapped to: remote_last_seq < local_seq_no < old_local_seq_no
In all these cases the value of remote_last_seq will be updated.
The last send macro sends an ACK for a normal transmission and the idea was that also the constellations are permuted for this condition where no update should happen. But I just realized that tx_buffer.len() == 0 is not enough to ensure that this is the case, so I can add some more data to be transfered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now saw the picked commit, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the test coverage for normal transmission where remote_last_seq should not be touched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The example project mentioned in the issue can also be used to reproduce this, but it does not check the contents sent as the test cases do: https://github.com/pothos/taptest)

@pothos pothos force-pushed the retransmission-and-overflow branch from 40133c9 to d51f854 Compare November 14, 2017 07:24
During the retransmission, higher sequence numbers than the
last one calculated after (re)sending could be acknowledged.
This led to an overflow in the offset calculation, which
resulted in no further data being sent.
Also, the wrong sequence number would have been used for
following packets.
Now the sequence number to be used for sending is updated
accordingly when the local sequence number is changed.
@whitequark
Copy link
Contributor

You have correctly identified the root cause of the problem however your patch is overly complicated. There is no need to test for wraparound since that is already handled by the TcpSeqNumber newtype. You can verify that your testcase works on master. I've merged a simpler version of both your fix and your test as b247f64.

@whitequark whitequark closed this Dec 22, 2017
@pothos pothos deleted the retransmission-and-overflow branch January 2, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants