Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARP requests not coming through UDP "flood" #319

Closed
birkenfeld opened this issue Dec 20, 2019 · 10 comments · Fixed by #368 or #369
Closed

ARP requests not coming through UDP "flood" #319

birkenfeld opened this issue Dec 20, 2019 · 10 comments · Fixed by #368 or #369
Labels

Comments

@birkenfeld
Copy link

I'm using smoltcp as the network stack on a STM32 embedded controller via https://github.com/stm32-rs/stm32-eth (cc @astro )

I've written a small test firmware that sends out UDP packets as fast as possible, saturating the 100 MBit link, and the device stops sending after 60 seconds, which is the timeout for the receiver's IP address in the ARP neighbor cache.

It looks like when the buffers are full (they are pretty small of course), smoltcp can't find anywhere to queue the ARP request and a kind of a deadlock is the result: no packets can go out anymore. Once I send a packet back to the device, this fills the cache and resolves the deadlock until the timeout hits again.

Do you think this is an issue with smoltcp, with the stm32-eth driver or is it to be expected and won't fix?

@whitequark
Copy link
Contributor

It looks like when the buffers are full (they are pretty small of course), smoltcp can't find anywhere to queue the ARP request and a kind of a deadlock is the result: no packets can go out anymore.

Wait. Which buffers get full? The ARP request is not queued anywhere; it is generated on the fly instead of the outgoing packet (the UDP packet in this case) when there is no cache entry. Do you have a log at TRACE verbosity?

@birkenfeld
Copy link
Author

Oh, that's a good point! I'll see if I can extract a trace log, but probably after Christmas. Thanks for the heads up, I hadn't looked deep enough in the code.

@birkenfeld
Copy link
Author

Ok, I have some more details now. First, I reduced the ENTRY_LIFETIME in the neighbor cache to 5s, to speed up reproduction.

I disabled the trace messages for UDP "buffer to send ... octets" and "sending ... octets", to be able to keep the data rate at the full 100 MBit/s.

When switching the log level to trace, the problem went away - the log looks like this

[TRACE] [0]: adding
[DEBUG] address 192.168.1.1 not in neighbor cache, sending ARP request
[TRACE] #0: neighbor 192.168.1.1 missing, silencing until t+3.000s
[TRACE] filled 192.168.1.1 => 68-05-ca-3a-c7-43 (was empty)
[TRACE] #0:192.168.1.222:54321:192.168.1.1:54321: receiving 3 octets
[TRACE] #0: neighbor 192.168.1.1 discovered, unsilencing
[DEBUG] address 192.168.1.1 not in neighbor cache, sending ARP request
[TRACE] #0: neighbor 192.168.1.1 missing, silencing until t+3.000s
[TRACE] #0: neighbor 192.168.1.1 discovered, unsilencing

with the last 3 lines repeating in 5s intervals.

After some trial and error, I removed the net_debug call here, and suddenly could reproduce the problem again!

I then replaced the debug call by a simple (not-optimized-out) delay loop, and by tuning the length of the loop I could make the problem appear and disappear.

That leads me to believe that the problem must be on the STM32 side, and there's some race condition that prevents sending that ARP packet or swallows it.

@whitequark
Copy link
Contributor

Sounds about right. Closing as not an issue in smoltcp.

@dnadlinger
Copy link
Contributor

Which STM32 driver are you using?

@birkenfeld
Copy link
Author

https://github.com/stm32-rs/stm32-eth - is there another one I could try?

@dnadlinger
Copy link
Contributor

@birkenfeld: There is a basic version for STM32H7… at https://github.com/quartiq/stabilizer/blob/master/src/eth.rs, which might be similar. We've had some fun with memory coherency, but off the top of my head, that shouldn't actually be a problem for the Cortex-M4 core in the STM32F4 series.

@whitequark
Copy link
Contributor

Reopening since this was narrowed down to an issue in the ARP code in stm32-rs/stm32-eth#10.

@thalesfragoso
Copy link
Contributor

I think I found another problem that might be the cause of the missing ARP.

During socket egress, there is a call to lookup_hardware_addr, which does the following:

smoltcp/src/iface/ethernet.rs

Lines 1593 to 1623 in cc85616

match self.neighbor_cache.lookup(&dst_addr, timestamp) {
NeighborAnswer::Found(hardware_addr) =>
return Ok((hardware_addr, tx_token)),
NeighborAnswer::RateLimited =>
return Err(Error::Unaddressable),
NeighborAnswer::NotFound => (),
}
match (src_addr, dst_addr) {
#[cfg(feature = "proto-ipv4")]
(&IpAddress::Ipv4(src_addr), IpAddress::Ipv4(dst_addr)) => {
net_debug!("address {} not in neighbor cache, sending ARP request",
dst_addr);
let arp_repr = ArpRepr::EthernetIpv4 {
operation: ArpOperation::Request,
source_hardware_addr: self.ethernet_addr,
source_protocol_addr: src_addr,
target_hardware_addr: EthernetAddress::BROADCAST,
target_protocol_addr: dst_addr,
};
self.dispatch_ethernet(tx_token, timestamp, arp_repr.buffer_len(), |mut frame| {
frame.set_dst_addr(EthernetAddress::BROADCAST);
frame.set_ethertype(EthernetProtocol::Arp);
arp_repr.emit(&mut ArpPacket::new_unchecked(frame.payload_mut()))
})?;
Err(Error::Unaddressable)
}

On the first call to poll just after the address expired, the call in self.neighbor_cache.lookup will take the last branch on the method, increasing the silent_until timestamp and returning NotFound:

pub(crate) fn lookup(&mut self, protocol_addr: &IpAddress, timestamp: Instant) -> Answer {
match self.lookup_pure(protocol_addr, timestamp) {
Some(hardware_addr) =>
Answer::Found(hardware_addr),
None if timestamp < self.silent_until =>
Answer::RateLimited,
None => {
self.silent_until = timestamp + Self::SILENT_TIME;
Answer::NotFound
}
}
}

This will cause lookup_hardware_addr to craft an ARP packet and try to send it via the device, but this sending can fail with Error::Exhausted (which I think it's happening in the case of this issue), the error will bubble up to socket_egress:

match (device_result, socket_result) {
(Err(Error::Exhausted), _) => break, // nowhere to transmit
(Ok(()), Err(Error::Exhausted)) => (), // nothing to transmit
(Err(Error::Unaddressable), _) => {
// `NeighborCache` already takes care of rate limiting the neighbor discovery
// requests from the socket. However, without an additional rate limiting
// mechanism, we would spin on every socket that has yet to discover its
// neighboor.
socket.meta_mut().neighbor_missing(timestamp,
neighbor_addr.expect("non-IP response packet"));
break
}
(Err(err), _) | (_, Err(err)) => {
net_debug!("{}: cannot dispatch egress packet: {}",
socket.meta().handle, err);
return Err(err)
}
(Ok(()), Ok(())) => emitted_any = true
}

which will just hit a break, however, if the next call to poll happens before one second has passed (very likely), self.neighbor_cache.lookup will now return RateLimited, which will cause lookup_hardware_addr to return Error::Unaddressable which will cause socket_egress to mark the NeighborState as Waiting (for 3 seconds) for an ARP response without ever having sent a single ARP request.

The underlying problem is that we set the silent_until timestamp in self.neighbor_cache.lookup before we know if we really managed to send the ARP request.

@whitequark is my understanding of this correct ? If so, do you have a preferred way to address this ?

@birkenfeld
Copy link
Author

Thanks everyone for figuring this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 participants