Change: [Network] lower TCP connect() timeout to 3s #9112

TrueBrain · 2021-04-26T09:46:45Z

CI will fail till #9116 lands and this is rebased.

Motivation / Problem

OpenTTD currently has the issue that if you have IPv4 and IPv6, but your IPv6 is broken (which happens more than we would like), you cannot really use the Content Service. It starts a connection to it over IPv6, wait till it fails, and tries IPv4 next. The problem: the timeout is 30s. And no sane user is going to look at an empty window for 30s in the hope something shows up :)

Additionally, in an upcoming PR we add STUN support. STUN is a bit of a trick, but basically it tries to connect a client and a server together via various of methods. Two of those are via connect(). Having a 30s timeout there too means that is takes up to a minute before a fallback is tried that is very likely to succeed (but expensive for us in terms of cost). This means the timeout for STUN requests has to be reduced, as otherwise we will get a lot of reports :D

(edit: where I write 30s, it turns out it is this: 20s on Linux, 24s on Windows, and 75s on BSD / MacOS. Timeouts can never exceed that value, so 20s is the max, basically).

Description

Currently we use default OS timeout for TCP connections, which
is around 30s. 99% of the users will never notice this, but there
are a few cases where this is an issue:

- If you have a broken IPv6 connection, using Content Service is
  first tried over IPv6. Only after 30s it times out and tries
  IPv4. Nobody is waiting for that 30s.
- Upcoming STUN support has several methods of establishing a
  connection between client and server. This requires feedback
  from connect() to know if any method worked (they have to be
  tried one by one). With 30s, this would take a very long time.

What is good to mention, is that there is no good value here. Any
value will have edge-cases where the experience is suboptimal. But
with 3s we support most of the stable connections, and if it fails,
the user can just retry. On the other side of the spectrum, with 30s,
it means the user has no possibility to use the service. So worst case
we annoy a few users with them having the retry vs annoying a few
users which have no means of resolving the situation.

It really is a balance.

Limitations

In this PR I purpose to change the default timeout to 3s, but we can tune it per TCP connection we want to setup. For example, I now also set the timeout to 3s for game servers, but we could leave that on 30s.

My reasoning why 3s is fine:

I doubt a user will wait more than 3s on joining a game server in this day and age. He will get annoyed by that time, and cancel anyway (which you can't atm, but that is a completely different issue I plan to resolve).
3s means the average latency between you and destination at time of connection is 1500ms. That is a lot of latency.
I rather annoy a few users with them having to retry, than give no options to a few users because they have to sit out the 30s timeout.

Arguments against it could be:

OS does 30s, who are we to disagree?
My connection stalls from time to time, so I am expecting this to fail from time to time (to which I say: I rather annoy you once in a while than give no option to those that will have to wait for the timeout)

The main issue is and remains: 99% of the users connect far before any timeout, so it really is only a problem for a select few, on both sides of the isle. This isn't perfect, but neither is 30s, and anything in between. We can always up the limit if it turns out to be way too low for too many people.

Checklist for review

Some things are not automated, and forgotten often. This list is a reminder for the reviewers.

The bug fix is important enough to be backported? (label: 'backport requested')
This PR affects the save game format? (label 'savegame upgrade')
This PR affects the GS/AI API? (label 'needs review: Script API')
- ai_changelog.hpp, gs_changelog.hpp need updating.
- The compatibility wrappers (compat_*.nut) need updating.
This PR affects the NewGRF API? (label 'needs review: NewGRF')
- newgrf_debug_data.h may need updating.
- PR must be added to API tracker

TrueBrain · 2021-04-26T09:47:54Z

A good comment made by @PeterN that for the IPv6 / IPv4 issue, we could always just initiate both connection at (nearly) the same time, and wait to see which connect. That is a bunch of work, but for sure possible.
For STUN that doesn't really work, so we need to do something there anyway.

Either way, I think this solution is simpler, and works as well. But I understand setting timeouts on anything "that might eventually work without it" is always though.

TrueBrain · 2021-04-26T09:51:11Z

Some additional info:

We start to warn a client is lagging if it is 2 seconds behind (so once in-game, we consider a 2 second latency "too much").
We drop a client after 16 seconds.
We give a client 3 seconds after connect to announce itself.

nielsmh · 2021-04-26T10:32:27Z

Effectively, the only plausible scenarios this PR would block off is an astronaut on the Moon trying to connect to a game being played on Earth.

Currently we use default OS timeout for TCP connections, which is around 30s. 99% of the users will never notice this, but there are a few cases where this is an issue: - If you have a broken IPv6 connection, using Content Service is first tried over IPv6. Only after 30s it times out and tries IPv4. Nobody is waiting for that 30s. - Upcoming STUN support has several methods of establishing a connection between client and server. This requires feedback from connect() to know if any method worked (they have to be tried one by one). With 30s, this would take a very long time. What is good to mention, is that there is no good value here. Any value will have edge-cases where the experience is suboptimal. But with 3s we support most of the stable connections, and if it fails, the user can just retry. On the other side of the spectrum, with 30s, it means the user has no possibility to use the service. So worst case we annoy a few users with them having the retry vs annoying a few users which have no means of resolving the situation.

TrueBrain force-pushed the network-connect branch 2 times, most recently from c411636 to d38d7bc Compare April 26, 2021 09:55

rubidium42 mentioned this pull request Apr 27, 2021

Fix: [Network] Error handling on Windows broken #9116

Merged

TrueBrain force-pushed the network-connect branch 2 times, most recently from 2ce49ca to 0a2a094 Compare April 27, 2021 10:49

TrueBrain force-pushed the network-connect branch from 0a2a094 to deccd49 Compare April 27, 2021 17:54

rubidium42 approved these changes Apr 27, 2021

View reviewed changes

TrueBrain merged commit 8fa53f5 into OpenTTD:master Apr 27, 2021

TrueBrain deleted the network-connect branch April 27, 2021 18:18

LordAro mentioned this pull request May 6, 2021

empty window Check Online Content #9198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change: [Network] lower TCP connect() timeout to 3s #9112

Change: [Network] lower TCP connect() timeout to 3s #9112

TrueBrain commented Apr 26, 2021 •

edited

TrueBrain commented Apr 26, 2021

TrueBrain commented Apr 26, 2021

nielsmh commented Apr 26, 2021

Change: [Network] lower TCP connect() timeout to 3s #9112

Change: [Network] lower TCP connect() timeout to 3s #9112

Conversation

TrueBrain commented Apr 26, 2021 • edited

Motivation / Problem

Description

Limitations

Checklist for review

TrueBrain commented Apr 26, 2021

TrueBrain commented Apr 26, 2021

nielsmh commented Apr 26, 2021

TrueBrain commented Apr 26, 2021 •

edited