-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection reset by peer / timeout on core device #456
Comments
Reuploading files as dump2.pcap expires in 7 days. |
So that's interesting. I'm not sure if I know enough about TCP internals to exactly pinpoint the problem, but the overall pattern is clear (and identical in both pcap files). ARTIQ comm sends the signature (14 bytes) to the core device right away and it's also ACKed right away. The first four packets are identical. The fifth is where the differences start. The host PC doesn't get the ACK, gets impatient and in 200ms after it has sent the signature it retransmits it. Instead of an ACK, the core device replies with a keep-alive packet (as I understand the "keep-alive" mark is something Wireshark pulls out of thin air and not a flag, and it determines that by seeing that there is no sequence number advance). This happens again in 400ms and 800ms, with no advance in sequence number, after which the core device decides it has had enough of this bullshit and drops the connection. This looks rather like the TCP state machine in lwip got confused and/or corrupted. I don't really know how to proceed from here; suppose I find a way to reliably reproduce this but we have no JTAG for the OR1K core and I don't really see myself successfully debugging this by randomly inserting printf's inside the lwip codebase. |
I guess this could be a bug in lwip keepalive. Try disabling it. |
@cjbe Do you have any examples that trigger this bug more often? It currently took me 27700 runs to reproduce it once. |
@whitequark This example was the fastest was to trigger the bug that I found. It reliably took less than 5 minutes to trigger. When I have the hardware powered up on Monday I will confirm exactly which bitstream I had on it. |
@whitequark I am using the nist_clock bitstream. I just tried to trigger this bug from a few different machines - there seems to be a lot of variability in the triggering rate:
|
This sometimes results in an RST sent by lwip after a retransmission, although it is not clear exactly why. See #456.
Fixed in 5c54a6a. |
Using release 1.1 nist_clock binaries I still see this problem with the same frequency. Attached is a packet dump that shows a similar signature to the previous packet dumps. |
@cjbe I am completely unable to reproduce this bug in 1.1 after several hours of looping the testcase. |
@cjbe Are you able to send us network equipment that causes the problem? |
I tried this on our set up using 1.1 on Windows, for about 2000 runs and didn't get an error. I did eventually run into an error but I think that's due to some permission issues on our end.
For reference the FPGA and the computer are connected to the network using this switch, |
@jordens disabling keepalive does not seem to change the frequency of the problem
|
If you could get your hands on the actual traffic on the coredevice port of the switch (and not just on the host side), that could be helpful. Your D-Link switch does not give you a port mirroring feature, but your upstream switch should (modulo administrative issues), or you could get a slightly more powerful switch. If you mirror the kc705 port traffic, then get another machine (or another network interface on the original host) and dump the traffic. |
Does it still happen with the PC->switch->KC705 configuration? (like 2, but with the department network not connected to the switch) |
@cjbe ping |
I suspect this might benefit from migrating to a different TCP/IP stack too. |
@whitequark this one looks similar to me. @jbqubit can reproduce it. There is a cheap Netgear switch involved. Joe, is that on a Linux VM as well? |
I'm running on bare metal. 14.04.1-Ubuntu. Ethernet interface on PC is Intel I219-LM with driver e1000e operating at 1 Gbit. Router is Netgear GS608. |
Can this be narrowed down to the switch or the network card? |
I'm doing tests of the phaser branch. Per @jordens suggestion I tried the following. With direct ethernet connection between KC705 and PC (no switch) I ran dac_setup.py 20 times and saw no errors. Upon returning to the switch (Netgear GS608) I see errors about 1 in 4 times running dac_setup.py. To reproduce... I returned to direct connection and ran dac_setup.py another 10 times. No errors. Reconnected to switch... errors appear again. |
@jbqubit Does reducing the MTU on the linux side of things change the behavior? ( |
Reducing the MTU size doesn't appear to improve the problem. |
I replaced Netgear GS608 with Netgear ProSafe GS110TP. The problem is gone. |
Troublesome Netgear GS608 switch received, thanks @jbqubit |
@jordens if you have an easy way to reproduce this bug, can you please check whether it's still present with smoltcp? |
I can only try to reproduce #647 which may or may not be the same. |
@jordens That's useful too. |
@whitequark I'll be in Hong Kong soon, do you want me to connect the KC705(s) through the problematic switch? |
@sbourdeauducq sure, let's test this. |
Switch installed. Unit tests still pass and 700 runs of @cjbe's experiment went through without problem. |
Using release 1.0 I see sporadic "connection reset by peer" or "socket.timeout: timed out" errors on the master to core device connection. The frequency of these errors depends on the experiment, but is typically between 1 in 5 and 1 in 1000.
Using the trivial experiment below I get a "connection reset by peer" error after 200-500 runs.
Two typical packet dumps of this behaviour are here and here. There is nothing in the core device log.
This does not seem to be the same issue as #398 there are no jumbo frames.
The text was updated successfully, but these errors were encountered: