Connection reset by peer / timeout on core device #456

cjbe · 2016-05-30T10:20:25Z

Using release 1.0 I see sporadic "connection reset by peer" or "socket.timeout: timed out" errors on the master to core device connection. The frequency of these errors depends on the experiment, but is typically between 1 in 5 and 1 in 1000.

Using the trivial experiment below I get a "connection reset by peer" error after 200-500 runs.
Two typical packet dumps of this behaviour are here and here. There is nothing in the core device log.

This does not seem to be the same issue as #398 there are no jumbo frames.

from artiq.experiment import *
import time

class CrashScan2(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("scheduler")

    @kernel
    def runScan(self):
        delay(1*ms)

    def run(self):
        self.runScan()
        self.scheduler.submit(self.scheduler.pipeline_name, self.scheduler.expid, 
                self.scheduler.priority, time.time(), False)

sbourdeauducq · 2016-05-31T03:52:26Z

Reuploading files as dump2.pcap expires in 7 days.

dumps.zip

whitequark · 2016-06-16T14:32:59Z

So that's interesting. I'm not sure if I know enough about TCP internals to exactly pinpoint the problem, but the overall pattern is clear (and identical in both pcap files).

Here's the normal operation:

ARTIQ comm sends the signature (14 bytes) to the core device right away and it's also ACKed right away.

Here's the bug:

The first four packets are identical. The fifth is where the differences start. The host PC doesn't get the ACK, gets impatient and in 200ms after it has sent the signature it retransmits it. Instead of an ACK, the core device replies with a keep-alive packet (as I understand the "keep-alive" mark is something Wireshark pulls out of thin air and not a flag, and it determines that by seeing that there is no sequence number advance). This happens again in 400ms and 800ms, with no advance in sequence number, after which the core device decides it has had enough of this bullshit and drops the connection.

This looks rather like the TCP state machine in lwip got confused and/or corrupted. I don't really know how to proceed from here; suppose I find a way to reliably reproduce this but we have no JTAG for the OR1K core and I don't really see myself successfully debugging this by randomly inserting printf's inside the lwip codebase.

sbourdeauducq · 2016-06-16T15:00:15Z

I guess this could be a bug in lwip keepalive. Try disabling it.

whitequark · 2016-06-19T09:43:04Z

@cjbe Do you have any examples that trigger this bug more often? It currently took me 27700 runs to reproduce it once.

cjbe · 2016-06-19T11:53:58Z

@whitequark This example was the fastest was to trigger the bug that I found. It reliably took less than 5 minutes to trigger. When I have the hardware powered up on Monday I will confirm exactly which bitstream I had on it.

cjbe · 2016-06-20T18:29:39Z

@whitequark I am using the nist_clock bitstream.

I just tried to trigger this bug from a few different machines - there seems to be a lot of variability in the triggering rate:

dump.pcap : Linux machine, quite a lot of network fabric between master and KC705 (this is the same setup as used for the packet dumps I previously uploaded). Bug triggered every 100 - 1000 runs.
dump_windows.pcap : Windows machine, 1 switch between master and KC705. Bug triggered every ~10 runs.
dump_vm.pcap (does not include capture of bug) : Linux VM running on above Window machine. Bug triggered every O(10e4) runs.

dumps.zip

This sometimes results in an RST sent by lwip after a retransmission, although it is not clear exactly why. See #456.

whitequark · 2016-06-21T16:55:06Z

Fixed in 5c54a6a.

cjbe · 2016-06-27T09:50:35Z

Using release 1.1 nist_clock binaries I still see this problem with the same frequency. Attached is a packet dump that shows a similar signature to the previous packet dumps.

dump_extract.zip

whitequark · 2016-06-28T08:13:36Z

@cjbe I am completely unable to reproduce this bug in 1.1 after several hours of looping the testcase.

sbourdeauducq · 2016-06-28T08:15:00Z

@cjbe Are you able to send us network equipment that causes the problem?

r-srinivas · 2016-07-08T19:01:51Z

I tried this on our set up using 1.1 on Windows, for about 2000 runs and didn't get an error. I did eventually run into an error but I think that's due to some permission issues on our end.

ERROR:worker(16409,connection_reset_test.py):root:Terminating with exception (Pa
rentActionError: PermissionError: [WinError 5] Access is denied: 'c:\\artiq-magt
rap\\tmpxwd0szhj' -> 'last_rid.pyon')
  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\worker
.py", line 217, in _handle_worker_requests
    data = func(*obj["args"], **obj["kwargs"])
  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\schedu
ler.py", line 419, in submit
    return pipeline.pool.submit(expid, priority, due_date, flush, pipeline_name)

  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\schedu
ler.py", line 132, in submit
    rid = self.ridc.get()
  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\worker
_db.py", line 26, in get
    self._update_cache(rid)
  File "C:\Anaconda3\envs\artiq-2016-06-24\lib\site-packages\artiq\master\worker
_db.py", line 48, in _update_cache
    os.replace(tmpname, self.cache_filename)
ParentActionError: PermissionError: [WinError 5] Access is denied: 'c:\\artiq-ma
gtrap\\tmpxwd0szhj' -> 'last_rid.pyon'

For reference the FPGA and the computer are connected to the network using this switch,
https://www.netgear.com/support/product/GS110TP.aspx?cid=wmt_netgear_organic

jordens · 2016-08-07T17:11:25Z

@cjbe This is a shot in the dark and just to exclude keepalive in the current lwip version: if you are still seeing this and have a bit of time, could you try building, flashing, and testing a runtime for your artiq version, with keepalive disabled (along the lines of 0db6ef0)?

cjbe · 2016-08-10T11:38:46Z

@jordens disabling keepalive does not seem to change the frequency of the problem
@whitequark , @sbourdeauducq :
I am using a different master computer since my tests on 27th June - this seems to have reduced the rate of the problems dramatically. (I don't know why this should be - this is quite worrying to me).
I tested a few combinations of networking hardware:

Direct 1000M link between KC705 and master : no crashes in ~500k runs
Once DLink DGS-108 switch between KC705 and master, switch also connected to department network : crashes once every ~50k runs
Via department network : similar to (2)

jordens · 2016-08-10T12:30:36Z

If you could get your hands on the actual traffic on the coredevice port of the switch (and not just on the host side), that could be helpful. Your D-Link switch does not give you a port mirroring feature, but your upstream switch should (modulo administrative issues), or you could get a slightly more powerful switch. If you mirror the kc705 port traffic, then get another machine (or another network interface on the original host) and dump the traffic.
Alternatively and depending on how comfortable you are with brctl and friends, you could use your linux machine as the switch and bridge two interfaces.

sbourdeauducq · 2016-08-11T02:53:32Z

Does it still happen with the PC->switch->KC705 configuration? (like 2, but with the department network not connected to the switch)

sbourdeauducq · 2016-10-17T14:57:40Z

@cjbe ping

whitequark · 2016-10-17T14:58:23Z

I suspect this might benefit from migrating to a different TCP/IP stack too.

jordens · 2016-10-24T12:54:34Z

@whitequark this one looks similar to me. @jbqubit can reproduce it. There is a cheap Netgear switch involved. Joe, is that on a Linux VM as well?
coredevice_joe_phaser_062aca2.pcap.zip

jbqubit · 2016-10-24T13:44:23Z

I'm running on bare metal. 14.04.1-Ubuntu. Ethernet interface on PC is Intel I219-LM with driver e1000e operating at 1 Gbit. Router is Netgear GS608.

sbourdeauducq · 2016-10-24T13:46:34Z

Can this be narrowed down to the switch or the network card?

jbqubit · 2016-10-24T14:29:36Z

I'm doing tests of the phaser branch.
https://github.com/m-labs/artiq/blob/phaser/README_PHASER.rst

Per @jordens suggestion I tried the following. With direct ethernet connection between KC705 and PC (no switch) I ran dac_setup.py 20 times and saw no errors. Upon returning to the switch (Netgear GS608) I see errors about 1 in 4 times running dac_setup.py.

To reproduce... I returned to direct connection and ran dac_setup.py another 10 times. No errors. Reconnected to switch... errors appear again.

jordens · 2016-10-24T15:51:14Z

@jbqubit Does reducing the MTU on the linux side of things change the behavior? (ip l s dev eth0 mtu 1500 vs 9000 and then rerun your test case)

jbqubit · 2016-10-24T16:17:08Z

Reducing the MTU size doesn't appear to improve the problem.

jbqubit · 2016-10-31T13:53:13Z

I replaced Netgear GS608 with Netgear ProSafe GS110TP. The problem is gone.

sbourdeauducq · 2016-11-14T03:12:59Z

Troublesome Netgear GS608 switch received, thanks @jbqubit

whitequark · 2017-01-24T20:49:17Z

@jordens if you have an easy way to reproduce this bug, can you please check whether it's still present with smoltcp?

jordens · 2017-01-24T21:00:37Z

I can only try to reproduce #647 which may or may not be the same.

whitequark · 2017-01-24T21:05:31Z

@jordens That's useful too.

sbourdeauducq · 2017-01-24T23:39:21Z

@whitequark I'll be in Hong Kong soon, do you want me to connect the KC705(s) through the problematic switch?

whitequark · 2017-01-24T23:40:27Z

@sbourdeauducq sure, let's test this.

sbourdeauducq · 2017-01-26T08:38:08Z

Switch installed. Unit tests still pass and 700 runs of @cjbe's experiment went through without problem.

jordens added this to the 1.1 milestone May 30, 2016

jordens added prio:3-serious area:coredevice labels May 30, 2016

sbourdeauducq assigned whitequark Jun 9, 2016

whitequark added a commit that referenced this issue Jun 20, 2016

runtime: disable lwip TCP keepalive.

0db6ef0

This sometimes results in an RST sent by lwip after a retransmission, although it is not clear exactly why. See #456.

whitequark added a commit that referenced this issue Jun 21, 2016

Upgrade lwip to 2.0.0 to fix the keepalive bug #456.

5c54a6a

whitequark closed this as completed Jun 21, 2016

sbourdeauducq pushed a commit that referenced this issue Jun 22, 2016

Upgrade lwip to 2.0.0 to fix the keepalive bug #456.

43d0bdd

whitequark reopened this Jun 27, 2016

sbourdeauducq modified the milestones: 1.2, 1.1 Jul 1, 2016

sbourdeauducq added the state:cannot-reproduce label Jul 5, 2016

sbourdeauducq removed this from the 1.2 milestone Jul 8, 2016

sbourdeauducq removed the prio:3-serious label Jul 8, 2016

sbourdeauducq closed this as completed Jan 26, 2017

jordens mentioned this issue Oct 4, 2017

KC705 MAC sensitivity to Ethernet switch model #837

Closed

jordens mentioned this issue Oct 23, 2017

single-slot uTCA chassis sinara-hw/sinara#339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection reset by peer / timeout on core device #456

Connection reset by peer / timeout on core device #456

cjbe commented May 30, 2016

sbourdeauducq commented May 31, 2016

whitequark commented Jun 16, 2016

sbourdeauducq commented Jun 16, 2016

whitequark commented Jun 19, 2016

cjbe commented Jun 19, 2016

cjbe commented Jun 20, 2016

whitequark commented Jun 21, 2016

cjbe commented Jun 27, 2016

whitequark commented Jun 28, 2016

sbourdeauducq commented Jun 28, 2016

r-srinivas commented Jul 8, 2016 •

edited

Loading

jordens commented Aug 7, 2016

cjbe commented Aug 10, 2016

jordens commented Aug 10, 2016 •

edited

Loading

sbourdeauducq commented Aug 11, 2016

sbourdeauducq commented Oct 17, 2016

whitequark commented Oct 17, 2016

jordens commented Oct 24, 2016

jbqubit commented Oct 24, 2016

sbourdeauducq commented Oct 24, 2016

jbqubit commented Oct 24, 2016

jordens commented Oct 24, 2016

jbqubit commented Oct 24, 2016

jbqubit commented Oct 31, 2016

sbourdeauducq commented Nov 14, 2016

whitequark commented Jan 24, 2017

jordens commented Jan 24, 2017

whitequark commented Jan 24, 2017

sbourdeauducq commented Jan 24, 2017

whitequark commented Jan 24, 2017

sbourdeauducq commented Jan 26, 2017

Connection reset by peer / timeout on core device #456

Connection reset by peer / timeout on core device #456

Comments

cjbe commented May 30, 2016

sbourdeauducq commented May 31, 2016

whitequark commented Jun 16, 2016

sbourdeauducq commented Jun 16, 2016

whitequark commented Jun 19, 2016

cjbe commented Jun 19, 2016

cjbe commented Jun 20, 2016

whitequark commented Jun 21, 2016

cjbe commented Jun 27, 2016

whitequark commented Jun 28, 2016

sbourdeauducq commented Jun 28, 2016

r-srinivas commented Jul 8, 2016 • edited Loading

jordens commented Aug 7, 2016

cjbe commented Aug 10, 2016

jordens commented Aug 10, 2016 • edited Loading

sbourdeauducq commented Aug 11, 2016

sbourdeauducq commented Oct 17, 2016

whitequark commented Oct 17, 2016

jordens commented Oct 24, 2016

jbqubit commented Oct 24, 2016

sbourdeauducq commented Oct 24, 2016

jbqubit commented Oct 24, 2016

jordens commented Oct 24, 2016

jbqubit commented Oct 24, 2016

jbqubit commented Oct 31, 2016

sbourdeauducq commented Nov 14, 2016

whitequark commented Jan 24, 2017

jordens commented Jan 24, 2017

whitequark commented Jan 24, 2017

sbourdeauducq commented Jan 24, 2017

whitequark commented Jan 24, 2017

sbourdeauducq commented Jan 26, 2017

r-srinivas commented Jul 8, 2016 •

edited

Loading

jordens commented Aug 10, 2016 •

edited

Loading