ARP requests not coming through UDP "flood" #319

birkenfeld · 2019-12-20T10:06:26Z

I'm using smoltcp as the network stack on a STM32 embedded controller via https://github.com/stm32-rs/stm32-eth (cc @astro )

I've written a small test firmware that sends out UDP packets as fast as possible, saturating the 100 MBit link, and the device stops sending after 60 seconds, which is the timeout for the receiver's IP address in the ARP neighbor cache.

It looks like when the buffers are full (they are pretty small of course), smoltcp can't find anywhere to queue the ARP request and a kind of a deadlock is the result: no packets can go out anymore. Once I send a packet back to the device, this fills the cache and resolves the deadlock until the timeout hits again.

Do you think this is an issue with smoltcp, with the stm32-eth driver or is it to be expected and won't fix?

whitequark · 2019-12-20T10:28:57Z

It looks like when the buffers are full (they are pretty small of course), smoltcp can't find anywhere to queue the ARP request and a kind of a deadlock is the result: no packets can go out anymore.

Wait. Which buffers get full? The ARP request is not queued anywhere; it is generated on the fly instead of the outgoing packet (the UDP packet in this case) when there is no cache entry. Do you have a log at TRACE verbosity?

birkenfeld · 2019-12-20T13:09:03Z

Oh, that's a good point! I'll see if I can extract a trace log, but probably after Christmas. Thanks for the heads up, I hadn't looked deep enough in the code.

birkenfeld · 2020-01-03T08:23:12Z

Ok, I have some more details now. First, I reduced the ENTRY_LIFETIME in the neighbor cache to 5s, to speed up reproduction.

I disabled the trace messages for UDP "buffer to send ... octets" and "sending ... octets", to be able to keep the data rate at the full 100 MBit/s.

When switching the log level to trace, the problem went away - the log looks like this

[TRACE] [0]: adding
[DEBUG] address 192.168.1.1 not in neighbor cache, sending ARP request
[TRACE] #0: neighbor 192.168.1.1 missing, silencing until t+3.000s
[TRACE] filled 192.168.1.1 => 68-05-ca-3a-c7-43 (was empty)
[TRACE] #0:192.168.1.222:54321:192.168.1.1:54321: receiving 3 octets
[TRACE] #0: neighbor 192.168.1.1 discovered, unsilencing
[DEBUG] address 192.168.1.1 not in neighbor cache, sending ARP request
[TRACE] #0: neighbor 192.168.1.1 missing, silencing until t+3.000s
[TRACE] #0: neighbor 192.168.1.1 discovered, unsilencing

with the last 3 lines repeating in 5s intervals.

After some trial and error, I removed the net_debug call here, and suddenly could reproduce the problem again!

I then replaced the debug call by a simple (not-optimized-out) delay loop, and by tuning the length of the loop I could make the problem appear and disappear.

That leads me to believe that the problem must be on the STM32 side, and there's some race condition that prevents sending that ARP packet or swallows it.

whitequark · 2020-01-03T08:30:14Z

Sounds about right. Closing as not an issue in smoltcp.

dnadlinger · 2020-01-03T08:44:58Z

Which STM32 driver are you using?

birkenfeld · 2020-01-03T12:24:20Z

https://github.com/stm32-rs/stm32-eth - is there another one I could try?

dnadlinger · 2020-01-03T12:41:08Z

@birkenfeld: There is a basic version for STM32H7… at https://github.com/quartiq/stabilizer/blob/master/src/eth.rs, which might be similar. We've had some fun with memory coherency, but off the top of my head, that shouldn't actually be a problem for the Cortex-M4 core in the STM32F4 series.

whitequark · 2020-08-09T21:49:17Z

Reopening since this was narrowed down to an issue in the ARP code in stm32-rs/stm32-eth#10.

thalesfragoso · 2020-08-11T16:16:59Z

I think I found another problem that might be the cause of the missing ARP.

During socket egress, there is a call to lookup_hardware_addr, which does the following:

smoltcp/src/iface/ethernet.rs

Lines 1593 to 1623 in cc85616

    
           match self.neighbor_cache.lookup(&dst_addr, timestamp) { 
        
               NeighborAnswer::Found(hardware_addr) => 
        
                   return Ok((hardware_addr, tx_token)), 
        
               NeighborAnswer::RateLimited => 
        
                   return Err(Error::Unaddressable), 
        
               NeighborAnswer::NotFound => (), 
        
           } 
        
           match (src_addr, dst_addr) { 
        
               #[cfg(feature = "proto-ipv4")] 
        
               (&IpAddress::Ipv4(src_addr), IpAddress::Ipv4(dst_addr)) => { 
        
                   net_debug!("address {} not in neighbor cache, sending ARP request", 
        
                              dst_addr); 
        
                   let arp_repr = ArpRepr::EthernetIpv4 { 
        
                       operation: ArpOperation::Request, 
        
                       source_hardware_addr: self.ethernet_addr, 
        
                       source_protocol_addr: src_addr, 
        
                       target_hardware_addr: EthernetAddress::BROADCAST, 
        
                       target_protocol_addr: dst_addr, 
        
                   }; 
        
                   self.dispatch_ethernet(tx_token, timestamp, arp_repr.buffer_len(), |mut frame| { 
        
                       frame.set_dst_addr(EthernetAddress::BROADCAST); 
        
                       frame.set_ethertype(EthernetProtocol::Arp); 
        
                       arp_repr.emit(&mut ArpPacket::new_unchecked(frame.payload_mut())) 
        
                   })?; 
        
                   Err(Error::Unaddressable) 
        
               }

On the first call to poll just after the address expired, the call in self.neighbor_cache.lookup will take the last branch on the method, increasing the silent_until timestamp and returning NotFound:

smoltcp/src/iface/neighbor.rs

Lines 175 to 186 in cc85616

    
           pub(crate) fn lookup(&mut self, protocol_addr: &IpAddress, timestamp: Instant) -> Answer { 
        
               match self.lookup_pure(protocol_addr, timestamp) { 
        
                   Some(hardware_addr) => 
        
                       Answer::Found(hardware_addr), 
        
                   None if timestamp < self.silent_until => 
        
                       Answer::RateLimited, 
        
                   None => { 
        
                       self.silent_until = timestamp + Self::SILENT_TIME; 
        
                       Answer::NotFound 
        
                   } 
        
               } 
        
           }

This will cause lookup_hardware_addr to craft an ARP packet and try to send it via the device, but this sending can fail with Error::Exhausted (which I think it's happening in the case of this issue), the error will bubble up to socket_egress:

smoltcp/src/iface/ethernet.rs

Lines 630 to 648 in cc85616

    
           match (device_result, socket_result) { 
        
               (Err(Error::Exhausted), _) => break,     // nowhere to transmit 
        
               (Ok(()), Err(Error::Exhausted)) => (),   // nothing to transmit 
        
               (Err(Error::Unaddressable), _) => { 
        
                   // `NeighborCache` already takes care of rate limiting the neighbor discovery 
        
                   // requests from the socket. However, without an additional rate limiting 
        
                   // mechanism, we would spin on every socket that has yet to discover its 
        
                   // neighboor. 
        
                   socket.meta_mut().neighbor_missing(timestamp, 
        
                       neighbor_addr.expect("non-IP response packet")); 
        
                   break 
        
               } 
        
               (Err(err), _) | (_, Err(err)) => { 
        
                   net_debug!("{}: cannot dispatch egress packet: {}", 
        
                              socket.meta().handle, err); 
        
                   return Err(err) 
        
               } 
        
               (Ok(()), Ok(())) => emitted_any = true 
        
           }

which will just hit a break, however, if the next call to poll happens before one second has passed (very likely), self.neighbor_cache.lookup will now return RateLimited, which will cause lookup_hardware_addr to return Error::Unaddressable which will cause socket_egress to mark the NeighborState as Waiting (for 3 seconds) for an ARP response without ever having sent a single ARP request.

The underlying problem is that we set the silent_until timestamp in self.neighbor_cache.lookup before we know if we really managed to send the ARP request.

@whitequark is my understanding of this correct ? If so, do you have a preferred way to address this ?

birkenfeld · 2020-08-17T10:08:25Z

Thanks everyone for figuring this out!

whitequark added the kind/question label Jan 3, 2020

whitequark closed this as completed Jan 3, 2020

birkenfeld mentioned this issue Jan 31, 2020

Trying to achieve max bandwidth results in (some) packets not being sent stm32-rs/stm32-eth#10

Closed

whitequark reopened this Aug 9, 2020

whitequark added kind/bug and removed kind/question labels Aug 9, 2020

thalesfragoso mentioned this issue Aug 11, 2020

Allow for ARP retry during egress #368

Merged

thalesfragoso mentioned this issue Aug 11, 2020

Add reproducer example for #10 stm32-rs/stm32-eth#21

Closed

whitequark closed this as completed in #368 Aug 11, 2020

whitequark reopened this Aug 11, 2020

thalesfragoso mentioned this issue Aug 12, 2020

Only limit the neighbor cache rate after sending a request packet #369

Merged

whitequark closed this as completed in #369 Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARP requests not coming through UDP "flood" #319

ARP requests not coming through UDP "flood" #319

birkenfeld commented Dec 20, 2019

whitequark commented Dec 20, 2019

birkenfeld commented Dec 20, 2019

birkenfeld commented Jan 3, 2020

whitequark commented Jan 3, 2020

dnadlinger commented Jan 3, 2020

birkenfeld commented Jan 3, 2020

dnadlinger commented Jan 3, 2020

whitequark commented Aug 9, 2020

thalesfragoso commented Aug 11, 2020

birkenfeld commented Aug 17, 2020

ARP requests not coming through UDP "flood" #319

ARP requests not coming through UDP "flood" #319

Comments

birkenfeld commented Dec 20, 2019

whitequark commented Dec 20, 2019

birkenfeld commented Dec 20, 2019

birkenfeld commented Jan 3, 2020

whitequark commented Jan 3, 2020

dnadlinger commented Jan 3, 2020

birkenfeld commented Jan 3, 2020

dnadlinger commented Jan 3, 2020

whitequark commented Aug 9, 2020

thalesfragoso commented Aug 11, 2020

birkenfeld commented Aug 17, 2020