Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: If a thread is blocking on UDP send, it can block the entire main loop #9001

Closed
wants to merge 2 commits into from

Conversation

nielsmh
Copy link
Contributor

@nielsmh nielsmh commented Apr 10, 2021

Motivation / Problem

Under some rare configurations, UDP sends can block. Since we hold a mutex on some objects for this, it can cause blocking of the main loop which will eventually cause clients to drop.

Description

For example, the advertise to master server thread takes the UDP lock and tries to send on a socket.

/* Send the packet */
Packet p(PACKET_UDP_SERVER_REGISTER);
/* Packet is: WELCOME_MESSAGE, Version, server_port */
p.Send_string(NETWORK_MASTER_SERVER_WELCOME_MESSAGE);
p.Send_uint8 (NETWORK_MASTER_SERVER_VERSION);
p.Send_uint16(_settings_client.network.server_port);
p.Send_uint64(_session_key);
std::lock_guard<std::mutex> lock(_network_udp_mutex);
if (_udp_master_socket != nullptr) _udp_master_socket->SendPacket(&p, &out_addr, true);
}

If this send blocks, it will cause the main loop to block in an unexpected way.

Limitations

This may cause some other UDP responses (prospective clients' info requests) to be delayed/lost, but probably preferable over lose active clients.

Checklist for review

Some things are not automated, and forgotten often. This list is a reminder for the reviewers.

  • The bug fix is important enough to be backported? (label: 'backport requested')
  • This PR affects the save game format? (label 'savegame upgrade')
  • This PR affects the GS/AI API? (label 'needs review: Script API')
    • ai_changelog.hpp, gs_changelog.hpp need updating.
    • The compatibility wrappers (compat_*.nut) need updating.
  • This PR affects the NewGRF API? (label 'needs review: NewGRF')

…n loop

Under some rare configurations, UDP sends can block. Since we hold a mutex on some objects for this, it can cause blocking of the main loop which will eventually cause clients to drop.
@TrueBrain
Copy link
Member

TrueBrain commented Apr 10, 2021

To put IRC here too:

If I read this right, this basically only fixes the cases where send() from the advertisement thread stalls. It does not help for when send() by the main thread stalls.

For a server, many of the incoming UDP requests cause an outgoing UDP packet, so there are many places that these ReceivePackets will cause a send().

Do I understand that correct?

(I have nothing against this PR, to be clear :D Just want to set expectations :P)

@nielsmh
Copy link
Contributor Author

nielsmh commented Apr 10, 2021

Yes a server will respond to many incoming UDP requests with another UDP packet back, and those responses are in fact done on the same thread receiving the UDP request packet. However those happen on the _udp_server_socket rather than the _udp_master_socket, the former gets lots of traffic back and forth while the latter does very little.

@nielsmh
Copy link
Contributor Author

nielsmh commented Apr 10, 2021

Another, possibly better, solution might be to protect the _udp_network_socket and _udp_master_socket objects by each their own mutex, so the regular server UDP functions can keep running if the master server functions block, and vice versa.

@Berbe
Copy link
Contributor

Berbe commented Apr 11, 2021

I can confirm this was induced by the containerised environment I was running, adding delays to domain name resolution with using IPv6 resolvers.

This, however, triggered interesting behaviour from the game, showing that difficulties impacting communication with the master server had effects on all networking, in this case blocking responses to client traffic, leading to clients disconnection.

This can be reproduced with TrafficControl with the use of the netem queueing discipline:

tc qdisc add dev <interface> root netem delay 2s

This is very easy to experiment in a Docker environment with custom network (not the default bridge), as all DNS traffic is forwarded through the local service discovery proxy on 127.0.0.11.
Use tc on lo and here is your sluggish name resolution.

Of course, way more interesting effects with surgical precision can be simulated with other queing disciplines, like HTB, for instance to simulate slow traffic from/to specific IP addresses

I also can confirm #8878 was actually impacted by the slow DNS traffic too.

@PeterN
Copy link
Member

PeterN commented Apr 11, 2021

Does this PR alleviate the client connection issues while this slow DNS request is occurring?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants