Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change: [Network] lower TCP connect() timeout to 3s #9112

Merged
merged 1 commit into from Apr 27, 2021

Conversation

TrueBrain
Copy link
Member

@TrueBrain TrueBrain commented Apr 26, 2021

CI will fail till #9116 lands and this is rebased.

Motivation / Problem

OpenTTD currently has the issue that if you have IPv4 and IPv6, but your IPv6 is broken (which happens more than we would like), you cannot really use the Content Service. It starts a connection to it over IPv6, wait till it fails, and tries IPv4 next. The problem: the timeout is 30s. And no sane user is going to look at an empty window for 30s in the hope something shows up :)

Additionally, in an upcoming PR we add STUN support. STUN is a bit of a trick, but basically it tries to connect a client and a server together via various of methods. Two of those are via connect(). Having a 30s timeout there too means that is takes up to a minute before a fallback is tried that is very likely to succeed (but expensive for us in terms of cost). This means the timeout for STUN requests has to be reduced, as otherwise we will get a lot of reports :D

(edit: where I write 30s, it turns out it is this: 20s on Linux, 24s on Windows, and 75s on BSD / MacOS. Timeouts can never exceed that value, so 20s is the max, basically).

Description

Currently we use default OS timeout for TCP connections, which
is around 30s. 99% of the users will never notice this, but there
are a few cases where this is an issue:

- If you have a broken IPv6 connection, using Content Service is
  first tried over IPv6. Only after 30s it times out and tries
  IPv4. Nobody is waiting for that 30s.
- Upcoming STUN support has several methods of establishing a
  connection between client and server. This requires feedback
  from connect() to know if any method worked (they have to be
  tried one by one). With 30s, this would take a very long time.

What is good to mention, is that there is no good value here. Any
value will have edge-cases where the experience is suboptimal. But
with 3s we support most of the stable connections, and if it fails,
the user can just retry. On the other side of the spectrum, with 30s,
it means the user has no possibility to use the service. So worst case
we annoy a few users with them having the retry vs annoying a few
users which have no means of resolving the situation.

It really is a balance.

Limitations

In this PR I purpose to change the default timeout to 3s, but we can tune it per TCP connection we want to setup. For example, I now also set the timeout to 3s for game servers, but we could leave that on 30s.

My reasoning why 3s is fine:

  • I doubt a user will wait more than 3s on joining a game server in this day and age. He will get annoyed by that time, and cancel anyway (which you can't atm, but that is a completely different issue I plan to resolve).
  • 3s means the average latency between you and destination at time of connection is 1500ms. That is a lot of latency.
  • I rather annoy a few users with them having to retry, than give no options to a few users because they have to sit out the 30s timeout.

Arguments against it could be:

  • OS does 30s, who are we to disagree?
  • My connection stalls from time to time, so I am expecting this to fail from time to time (to which I say: I rather annoy you once in a while than give no option to those that will have to wait for the timeout)

The main issue is and remains: 99% of the users connect far before any timeout, so it really is only a problem for a select few, on both sides of the isle. This isn't perfect, but neither is 30s, and anything in between. We can always up the limit if it turns out to be way too low for too many people.

Checklist for review

Some things are not automated, and forgotten often. This list is a reminder for the reviewers.

  • The bug fix is important enough to be backported? (label: 'backport requested')
  • This PR affects the save game format? (label 'savegame upgrade')
  • This PR affects the GS/AI API? (label 'needs review: Script API')
    • ai_changelog.hpp, gs_changelog.hpp need updating.
    • The compatibility wrappers (compat_*.nut) need updating.
  • This PR affects the NewGRF API? (label 'needs review: NewGRF')

@TrueBrain
Copy link
Member Author

A good comment made by @PeterN that for the IPv6 / IPv4 issue, we could always just initiate both connection at (nearly) the same time, and wait to see which connect. That is a bunch of work, but for sure possible.
For STUN that doesn't really work, so we need to do something there anyway.

Either way, I think this solution is simpler, and works as well. But I understand setting timeouts on anything "that might eventually work without it" is always though.

@TrueBrain
Copy link
Member Author

Some additional info:

  • We start to warn a client is lagging if it is 2 seconds behind (so once in-game, we consider a 2 second latency "too much").
  • We drop a client after 16 seconds.
  • We give a client 3 seconds after connect to announce itself.

@TrueBrain TrueBrain force-pushed the network-connect branch 2 times, most recently from c411636 to d38d7bc Compare April 26, 2021 09:55
@nielsmh
Copy link
Contributor

nielsmh commented Apr 26, 2021

Effectively, the only plausible scenarios this PR would block off is an astronaut on the Moon trying to connect to a game being played on Earth.

Currently we use default OS timeout for TCP connections, which
is around 30s. 99% of the users will never notice this, but there
are a few cases where this is an issue:

- If you have a broken IPv6 connection, using Content Service is
  first tried over IPv6. Only after 30s it times out and tries
  IPv4. Nobody is waiting for that 30s.
- Upcoming STUN support has several methods of establishing a
  connection between client and server. This requires feedback
  from connect() to know if any method worked (they have to be
  tried one by one). With 30s, this would take a very long time.

What is good to mention, is that there is no good value here. Any
value will have edge-cases where the experience is suboptimal. But
with 3s we support most of the stable connections, and if it fails,
the user can just retry. On the other side of the spectrum, with 30s,
it means the user has no possibility to use the service. So worst case
we annoy a few users with them having the retry vs annoying a few
users which have no means of resolving the situation.
@TrueBrain TrueBrain merged commit 8fa53f5 into OpenTTD:master Apr 27, 2021
@TrueBrain TrueBrain deleted the network-connect branch April 27, 2021 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants