GlasgowEmbedded · Apr 5, 2019
Showing with 128 additions and 3 deletions.

+128 −3 software/glasgow/gateware/fx2_crossbar.py
diff --git a/software/glasgow/gateware/fx2_crossbar.py b/software/glasgow/gateware/fx2_crossbar.py
@@ -1,6 +1,131 @@
-# Synchronous FIFO timings reference:
-# http://www.cypress.com/file/138911/download#page=53
-
+# Overview
+# --------
+#
+# The FX2 has four FIFOs, but only one at a time may be selected to be read (for OUT FIFOs) or
+# written (for IN FIFOs). To achieve high transfer rates when more than one stream of data is used
+# (bidirectional channel, two one-directional channels, and so on), the FPGA has its own FIFOs that
+# mirror the FX2 FIFOs. The FX2 crossbar switch is the gateware that coordinates transfers between
+# these FIFOs, up to eight total depending on application requirements.
+#
+# Timing issues
+# -------------
+#
+# The FX2 can work in asynchronous or synchronous FIFO mode. The asynchronous mode is a bit of
+# a relic; the maximum throughput is much lower as well, so it's not useful. The synchronous mode
+# can source or accept a clock, and timings change based on this.
+#
+# Using fast parallel synchronous interfaces on an FPGA is a bit tricky. It is never safe to
+# drive a bus with combinatorial logic directly, or drive combinatorial logic from a bus, because
+# the timing relationship is neither defined (the inferred logic depth may vary, and placement
+# further affects it) nor easily enforced (it's quite nontrivial to define the right timing
+# constraints, if the toolchain allows it at all).
+#
+# The way to make things work is use a register placed inside the FPGA I/O buffer and clocked by
+# a global clock network; this makes timings consistent regardless of inferred logic or placement,
+# and the I/O buffer is qualified by only two main properties: clock-to-output delay, and input
+# capture window. Unfortunately, this adds pipelining, which complicates feedback. For example, if
+# the FPGA is asserting a write strobe and waiting for a full flag to go high, it will observe
+# the flag as high one cycle late, by which point the FIFO has overflowed, and it would take
+# another cycle for the write strobe to deassert; if the flag went high for just one cycle, then
+# a spurious write will also happen after the overflow.
+#
+# Worse yet, the combination of the FX2 and iCE40 FPGA creates another hazard. The input capture
+# window of the FPGA is long before the signals output by the FX2 are valid, and to counteract
+# this, we have to add a delay--in practice this means using DDR inputs and capturing on negative
+# clock edge. However, doing that alone would effectively halve our maximum frequency, so it's
+# necessary to re-register the input in fabric. That adds another cycle of latency.
+#
+# The FX2 has a way to compensate for one cycle of latency, the INFM1 and OEP1 FIFO configuration
+# bits. Unfortunately, this is not enough. Not only there are three cycles of latency total, but
+# this feature does not help avoiding FIFO overflows at all. For IN FIFOs, if the full flag goes
+# high one cycle before the full condition, and the FPGA-side FIFO is empty, the FX2-side FIFO
+# looks full (so if the crossbar switches to a different FIFO, it wouldn't try to fill it again),
+# but the packet in that FIFO is incomplete and not sent (so it'll never become non-full again).
+# For OUT FIFOs, the empty flag and the data are aligned in time, but when the FPGA-side FIFO
+# becomes full and the FPGA deasserts the read strobe, it's too late, as up to one more byte is
+# already in the FPGA input register. Similarly, if the empty flag is asserted for just one cycle,
+# and the crossbar switches to another FIFO pair, the tail end of the read strobe would cause
+# a spurious read.
+#
+# NOTE
+# ----
+#
+# Everything below describes a correct implementation, but the actual code here is not yet that
+# implementation. Keep this in mind.
+#
+# Handling pipelining
+# -------------------
+#
+# This unintentional pipelining is handled in two ways, different for IN and OUT FIFOs. The core
+# of the difference is that the FPGA controls the FX2-side IN FIFO, but the host controls
+# the OUT FIFO.
+#
+# For IN FIFOs, the solution is to track the FIFO level on the FPGA using a counter. This creates
+# a "perfect" full flag on the FPGA, and simplifies other things as well, such as ZLP generation.
+# (More on that later.)
+#
+# The host may explicitly purge the FX2-side FIFOs in some circumstances, e.g. changing the USB
+# configuration or interface altsetting, which would require resetting the IN level counter, but
+# this requires resetting the FPGA-side FIFO contents anyway, so it already has to be coordinated
+# via some out-of-band mechanism.
+#
+# For OUT FIFOs, the solution is to use an overflow buffer--a very small additional FIFO in front
+# of the normal large FPGA-side FIFO to absorb any writes that may happen after the strobe was
+# deasserted. (A naive approach would be to compare the FPGA-side FIFO level to get an "almost
+# full" marker, but this does not work if that FIFO is used to bridge clock domains, and in any
+# case it would result in more logic.)
+#
+# Moreover, for correct results, the FIFO address (the index of the FX2 FIFO in use) and read
+# strobe must be synchronized to the data valid flag (i.e. inverse of empty flag) and the data;
+# that is, the FIFO address and read strobe must be delayed by 3 cycles and used to select and
+# enable writes to the FPGA-side FIFO. Essentially, the FPGA-side FIFO should be driven by
+# the control signals as seen by the FX2, because only then the FX2 outputs are meaningful.
+#
+# Once the control signals that indicate FX2's state are appropriately received, generated or
+# regenerated, the purpose of the rest of the crossbar is only to provide stimulus to the FX2,
+# i.e. switch between addresses and generate read, write and packet end strobes.
+#
+# Handling packetization
+# ----------------------
+#
+# There is one more concern that needs to be handled by the crossbar. The FIFOs provided on
+# the FPGA are a byte-oriented abstraction; they have no inherent packet boundaries. However, USB
+# is a packet-oriented bus. Therefore, for IN FIFOs, the crossbar has to insert packet boundaries,
+# and because bulk endpoints place no particular requirements on when the host controller will poll
+# them, the choices made during packetization have a major impact on performance. (For OUT FIFOs,
+# the host inserts packet boundaries, and since no particular guarantees are provided by the FX2
+# as to behavior of the empty flag between packets, it doesn't make sense to expose a packet-
+# oriented interface to the rest of FPGA gateware, as it would be very asymmetric.)
+#
+# To provide control over IN packet boundaries, the crossbar uses a flush flag. If it has been
+# asserted, and the FX2-side FIFO has an incomplete packet in it, and the FPGA-side FIFO is empty,
+# the FX2 is instructed to send the incomplete packet as-is.
+#
+# To achieve the highest throughput, it is necessary to send long packets, since the FX2 only has
+# up to 4 buffers per packet (in 2-endpoint mode; 2 buffers in 4-endpoint mode), and the longer
+# the packets are, the higher is the FX2-side buffer utilization. However, this is only taking
+# the FX2 and USB protocol into account. If we consider the host controller and OS as well, it
+# becomes apparent that it is necessary to send maximum length packets.
+#
+# To understand the reason for this, consider that an application has to proivde the OS with
+# a buffer to fill with data read from the USB device. This buffer has to be a multiple of
+# the maximum packet size; if more data is returned, the extra data is discarded and an error
+# is indicated. However, what happens if less data is returned? In that case, the OS returns
+# the buffer to the application immediately. This can dramatically reduce performance: if
+# the application queues 10 8192-byte buffers, and the device returns 512 byte maximum-length
+# packets, then 160 packets can be received. However, if the device returns 511 byte packets,
+# then only 10 packets will be received!
+#
+# Unfortunately, if a device returns (for example) a single maximum-length packet and then stops,
+# then the OS will hold onto the buffer, assuming that there is more data to come; this will appear
+# as a hang. To indicate to the OS that there really is no more data, a zero-length packet needs
+# to be generated. This is where the IN FIFO level counter comes in handy as well.
+#
+# Addendum: FX2 Synchronous FIFO timings summary
+# ----------------------------------------------
+#
+# Based on: http://www.cypress.com/file/138911/download#page=53
+#
 # All timings in ns referenced to positive edge of non-inverted IFCLK.
 # "Int" means IFCLK sourced by FX2, "Ext" means IFCLK sourced by FPGA.
 #