Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection reset by peer / timeout on core device #456

Closed
cjbe opened this issue May 30, 2016 · 31 comments
Closed

Connection reset by peer / timeout on core device #456

cjbe opened this issue May 30, 2016 · 31 comments

Comments

@cjbe
Copy link
Contributor

cjbe commented May 30, 2016

Using release 1.0 I see sporadic "connection reset by peer" or "socket.timeout: timed out" errors on the master to core device connection. The frequency of these errors depends on the experiment, but is typically between 1 in 5 and 1 in 1000.

Using the trivial experiment below I get a "connection reset by peer" error after 200-500 runs.
Two typical packet dumps of this behaviour are here and here. There is nothing in the core device log.

This does not seem to be the same issue as #398 there are no jumbo frames.

from artiq.experiment import *
import time

class CrashScan2(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("scheduler")

    @kernel
    def runScan(self):
        delay(1*ms)

    def run(self):
        self.runScan()
        self.scheduler.submit(self.scheduler.pipeline_name, self.scheduler.expid, 
                self.scheduler.priority, time.time(), False)
@sbourdeauducq
Copy link
Member

Reuploading files as dump2.pcap expires in 7 days.

dumps.zip

@whitequark
Copy link
Contributor

So that's interesting. I'm not sure if I know enough about TCP internals to exactly pinpoint the problem, but the overall pattern is clear (and identical in both pcap files).

Here's the normal operation:
2312

ARTIQ comm sends the signature (14 bytes) to the core device right away and it's also ACKed right away.

Here's the bug:
2313

The first four packets are identical. The fifth is where the differences start. The host PC doesn't get the ACK, gets impatient and in 200ms after it has sent the signature it retransmits it. Instead of an ACK, the core device replies with a keep-alive packet (as I understand the "keep-alive" mark is something Wireshark pulls out of thin air and not a flag, and it determines that by seeing that there is no sequence number advance). This happens again in 400ms and 800ms, with no advance in sequence number, after which the core device decides it has had enough of this bullshit and drops the connection.

This looks rather like the TCP state machine in lwip got confused and/or corrupted. I don't really know how to proceed from here; suppose I find a way to reliably reproduce this but we have no JTAG for the OR1K core and I don't really see myself successfully debugging this by randomly inserting printf's inside the lwip codebase.

@sbourdeauducq
Copy link
Member

I guess this could be a bug in lwip keepalive. Try disabling it.

@whitequark
Copy link
Contributor

@cjbe Do you have any examples that trigger this bug more often? It currently took me 27700 runs to reproduce it once.

@cjbe
Copy link
Contributor Author

cjbe commented Jun 19, 2016

@whitequark This example was the fastest was to trigger the bug that I found. It reliably took less than 5 minutes to trigger. When I have the hardware powered up on Monday I will confirm exactly which bitstream I had on it.

@cjbe
Copy link
Contributor Author

cjbe commented Jun 20, 2016

@whitequark I am using the nist_clock bitstream.

I just tried to trigger this bug from a few different machines - there seems to be a lot of variability in the triggering rate:

  1. dump.pcap : Linux machine, quite a lot of network fabric between master and KC705 (this is the same setup as used for the packet dumps I previously uploaded). Bug triggered every 100 - 1000 runs.
  2. dump_windows.pcap : Windows machine, 1 switch between master and KC705. Bug triggered every ~10 runs.
  3. dump_vm.pcap (does not include capture of bug) : Linux VM running on above Window machine. Bug triggered every O(10e4) runs.

dumps.zip

whitequark added a commit that referenced this issue Jun 20, 2016
This sometimes results in an RST sent by lwip after a retransmission,
although it is not clear exactly why. See #456.
@whitequark
Copy link
Contributor

Fixed in 5c54a6a.

@cjbe
Copy link
Contributor Author

cjbe commented Jun 27, 2016

Using release 1.1 nist_clock binaries I still see this problem with the same frequency. Attached is a packet dump that shows a similar signature to the previous packet dumps.

dump_extract.zip

@whitequark whitequark reopened this Jun 27, 2016
@whitequark
Copy link
Contributor

@cjbe I am completely unable to reproduce this bug in 1.1 after several hours of looping the testcase.

@sbourdeauducq
Copy link
Member

@cjbe Are you able to send us network equipment that causes the problem?

@sbourdeauducq sbourdeauducq modified the milestones: 1.2, 1.1 Jul 1, 2016
@sbourdeauducq sbourdeauducq removed this from the 1.2 milestone Jul 8, 2016
@r-srinivas
Copy link

r-srinivas commented Jul 8, 2016

I tried this on our set up using 1.1 on Windows, for about 2000 runs and didn't get an error. I did eventually run into an error but I think that's due to some permission issues on our end.

ERROR:worker(16409,connection_reset_test.py):root:Terminating with exception (Pa
rentActionError: PermissionError: [WinError 5] Access is denied: 'c:\\artiq-magt
rap\\tmpxwd0szhj' -> 'last_rid.pyon')
  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\worker
.py", line 217, in _handle_worker_requests
    data = func(*obj["args"], **obj["kwargs"])
  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\schedu
ler.py", line 419, in submit
    return pipeline.pool.submit(expid, priority, due_date, flush, pipeline_name)

  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\schedu
ler.py", line 132, in submit
    rid = self.ridc.get()
  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\worker
_db.py", line 26, in get
    self._update_cache(rid)
  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\worker
_db.py", line 48, in _update_cache
    os.replace(tmpname, self.cache_filename)
ParentActionError: PermissionError: [WinError 5] Access is denied: 'c:\\artiq-ma
gtrap\\tmpxwd0szhj' -> 'last_rid.pyon'

For reference the FPGA and the computer are connected to the network using this switch,
https://www.netgear.com/support/product/GS110TP.aspx?cid=wmt_netgear_organic

@jordens
Copy link
Member

jordens commented Aug 7, 2016

@cjbe This is a shot in the dark and just to exclude keepalive in the current lwip version: if you are still seeing this and have a bit of time, could you try building, flashing, and testing a runtime for your artiq version, with keepalive disabled (along the lines of 0db6ef0)?

@cjbe
Copy link
Contributor Author

cjbe commented Aug 10, 2016

@jordens disabling keepalive does not seem to change the frequency of the problem
@whitequark , @sbourdeauducq :
I am using a different master computer since my tests on 27th June - this seems to have reduced the rate of the problems dramatically. (I don't know why this should be - this is quite worrying to me).
I tested a few combinations of networking hardware:

  1. Direct 1000M link between KC705 and master : no crashes in ~500k runs
  2. Once DLink DGS-108 switch between KC705 and master, switch also connected to department network : crashes once every ~50k runs
  3. Via department network : similar to (2)

@jordens
Copy link
Member

jordens commented Aug 10, 2016

If you could get your hands on the actual traffic on the coredevice port of the switch (and not just on the host side), that could be helpful. Your D-Link switch does not give you a port mirroring feature, but your upstream switch should (modulo administrative issues), or you could get a slightly more powerful switch. If you mirror the kc705 port traffic, then get another machine (or another network interface on the original host) and dump the traffic.
Alternatively and depending on how comfortable you are with brctl and friends, you could use your linux machine as the switch and bridge two interfaces.

@sbourdeauducq
Copy link
Member

Does it still happen with the PC->switch->KC705 configuration? (like 2, but with the department network not connected to the switch)

@sbourdeauducq
Copy link
Member

@cjbe ping

@whitequark
Copy link
Contributor

I suspect this might benefit from migrating to a different TCP/IP stack too.

@jordens
Copy link
Member

jordens commented Oct 24, 2016

@whitequark this one looks similar to me. @jbqubit can reproduce it. There is a cheap Netgear switch involved. Joe, is that on a Linux VM as well?
coredevice_joe_phaser_062aca2.pcap.zip

@jbqubit
Copy link
Contributor

jbqubit commented Oct 24, 2016

I'm running on bare metal. 14.04.1-Ubuntu. Ethernet interface on PC is Intel I219-LM with driver e1000e operating at 1 Gbit. Router is Netgear GS608.

@sbourdeauducq
Copy link
Member

Can this be narrowed down to the switch or the network card?

@jbqubit
Copy link
Contributor

jbqubit commented Oct 24, 2016

I'm doing tests of the phaser branch.
https://github.com/m-labs/artiq/blob/phaser/README_PHASER.rst

Per @jordens suggestion I tried the following. With direct ethernet connection between KC705 and PC (no switch) I ran dac_setup.py 20 times and saw no errors. Upon returning to the switch (Netgear GS608) I see errors about 1 in 4 times running dac_setup.py.

To reproduce... I returned to direct connection and ran dac_setup.py another 10 times. No errors. Reconnected to switch... errors appear again.

@jordens
Copy link
Member

jordens commented Oct 24, 2016

@jbqubit Does reducing the MTU on the linux side of things change the behavior? (ip l s dev eth0 mtu 1500 vs 9000 and then rerun your test case)

@jbqubit
Copy link
Contributor

jbqubit commented Oct 24, 2016

Reducing the MTU size doesn't appear to improve the problem.

@jbqubit
Copy link
Contributor

jbqubit commented Oct 31, 2016

I replaced Netgear GS608 with Netgear ProSafe GS110TP. The problem is gone.

@sbourdeauducq
Copy link
Member

Troublesome Netgear GS608 switch received, thanks @jbqubit

@whitequark
Copy link
Contributor

@jordens if you have an easy way to reproduce this bug, can you please check whether it's still present with smoltcp?

@jordens
Copy link
Member

jordens commented Jan 24, 2017

I can only try to reproduce #647 which may or may not be the same.

@whitequark
Copy link
Contributor

@jordens That's useful too.

@sbourdeauducq
Copy link
Member

@whitequark I'll be in Hong Kong soon, do you want me to connect the KC705(s) through the problematic switch?

Sorry, something went wrong.

@whitequark
Copy link
Contributor

@sbourdeauducq sure, let's test this.

Sorry, something went wrong.

@sbourdeauducq
Copy link
Member

Switch installed. Unit tests still pass and 700 runs of @cjbe's experiment went through without problem.

Sorry, something went wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants