Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CDC issue in async FIFO: ensure that Gray code pointers are registered before resynchronisation. #38

Closed
wants to merge 2 commits into from

Conversation

Wren6991
Copy link
Contributor

@Wren6991 Wren6991 commented Mar 3, 2019

The AsyncFIFO class is subtly different from the Clifford Cummings paper referenced in the source code. Here is a diagram showing the difference:

In the current nMigen code, there is a path that goes from:

  • produce counter register in the write domain
  • -> GrayEncoder
  • -> MultiReg in the read domain.

The problem is that there is a narrow window after a write clock edge where all of the Gray pointer bits may toggle, due to static hazards in the encoder, since all encoder inputs can toggle simultaneously. If you sample this with an asynchronous clock, it's possible to see a completely wrong value for the produce pointer, which can cause irreversible corruption/loss of data.

I've only shown this for the produce pointer, but the same is true in the other direction too. The fix is to drive the Gray encoder from the binary counter register's input, rather than its output, and register the encoder output before resynchronising. From a synchronous point of view this behaves the same as the old Gray count, so the full/empty logic stays the same.

This bug will only be observed with a low probability, and may not be observed at all, e.g. if your two asynchronous clocks share some fixed rational relationship with a common clock source through PLLs, which affects the distribution of the relative phase of one observed at the rising edge of the other. However it is a legitimate bug, with fatal consequences.

This code is not ready to merge:

  • So far I've only run the FIFO smoke test on it. I haven't yet figured out how to run the formal test :( and it also needs soak testing on an FPGA board with two completely separate oscillators.
  • The pointer MultiRegs should ideally be resettable (although that is a different conversation!)
  • I added a helper class, GrayBinCounter, which is currently only used by AsyncFIFO. It contains some extra detail, like putting a no_retiming attribute on the Gray register.
    • Currently I'm passing the domain name into the __init__ of this class, like MultiReg, but I saw references elsewhere to the DomainRenamer class, which seems cleaner. However it seems you can only call this on Fragments, not Modules, and I currently don't know what a Fragment is :)
    • If you would prefer not to have an extra helper class, I can give an alternative patch that just modifies AsyncFIFO. It just looks a bit copy-pastey.

I'm really new to nMigen, so this code definitely needs a close review.

…tered before resynchronisation.

Feeding a binary counter into a Gray encoder produces static hazards on
the encoder outputs, since each XOR gate in the encoder has multiple inputs,
and potentially ALL of these may transition simultaneously, e.g.
when the binary counter rolls over.

Registering the Gray output ensures that only 1 bit of the domain-crossing
path changes at a time. This has been done in a way that does not affect the
timing or operation of the FIFO.
@Wren6991
Copy link
Contributor Author

Wren6991 commented Mar 3, 2019

Yep looks like I've broken something :D Will update once I've got formal running on my own machine, and fixed the issue.

@Wren6991
Copy link
Contributor Author

Wren6991 commented Mar 3, 2019

image

The reset values look nonsensical to me, e.g. 010 (bin) != 010 (gray). Maybe the fact it passed before was a lucky break due to the bin and gray pointers being driven by the same register, so their initial values were correct. I need to dig into this formal stuff more. Note: it fails during BMC.

@codecov
Copy link

codecov bot commented Mar 3, 2019

Codecov Report

Merging #38 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
+ Coverage   85.52%   85.57%   +0.05%     
==========================================
  Files          27       27              
  Lines        3918     3932      +14     
  Branches      765      765              
==========================================
+ Hits         3351     3365      +14     
  Misses        478      478              
  Partials       89       89
Impacted Files Coverage Δ
nmigen/lib/fifo.py 98.78% <100%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 342bdbe...6b30637. Read the comment docs.

@Wren6991
Copy link
Contributor Author

Wren6991 commented Mar 3, 2019

I've added an additional commit which asserts FIFO resets whilst $initstate is true, in FIFOContractSpec. On my machine I can now run coverage run -m unittest discover and all tests pass.

Since this fix is outside the scope of this PR, I am happy to open a separate one if you would prefer.

@whitequark
Copy link
Contributor

First, thanks for finding the bug! This is indeed very important, and I'm not sure how I missed it. It is definitely not present in Migen, it's an nMigen bug.

I've added an additional commit which asserts FIFO resets whilst $initstate is true, in FIFOContractSpec. On my machine I can now run coverage run -m unittest discover and all tests pass.

Unfortunately you've significantly weakened the contract by adding a reset. Before your changes, the FIFO contract would test an arbitrary initial state (any block RAM contents and any pointers). After your changes the initial state is always defined, so e.g. a FIFO that would pass just two elements through it and then hang would pass it.

@whitequark
Copy link
Contributor

Can you check if the following diff also fixes the bug, please?

diff --git a/nmigen/lib/fifo.py b/nmigen/lib/fifo.py
index 3af5c51..d4f16e1 100644
--- a/nmigen/lib/fifo.py
+++ b/nmigen/lib/fifo.py
@@ -304,10 +304,8 @@ class AsyncFIFO(FIFOInterface):
             GrayEncoder(self._ctr_bits)
         produce_cdc = m.submodules.produce_cdc = \
             MultiReg(produce_w_gry, produce_r_gry, odomain="read")
-        m.d.comb += [
-            produce_enc.i.eq(produce_w_bin),
-            produce_w_gry.eq(produce_enc.o),
-        ]
+        m.d.comb  += produce_enc.i.eq(produce_w_bin)
+        m.d.write += produce_w_gry.eq(produce_enc.o)
 
         consume_r_bin = Signal(self._ctr_bits)
         consume_r_gry = Signal(self._ctr_bits)
@@ -316,17 +314,15 @@ class AsyncFIFO(FIFOInterface):
             GrayEncoder(self._ctr_bits)
         consume_cdc = m.submodules.consume_cdc = \
             MultiReg(consume_r_gry, consume_w_gry, odomain="write")
-        m.d.comb += [
-            consume_enc.i.eq(consume_r_bin),
-            consume_r_gry.eq(consume_enc.o),
-        ]
+        m.d.comb  += consume_enc.i.eq(consume_r_bin)
+        m.d.read  += consume_r_gry.eq(consume_enc.o)
 
         m.d.comb += [
             self.writable.eq(
-                (produce_w_gry[-1]  == consume_w_gry[-1]) |
-                (produce_w_gry[-2]  == consume_w_gry[-2]) |
-                (produce_w_gry[:-2] != consume_w_gry[:-2])),
-            self.readable.eq(consume_r_gry != produce_r_gry)
+                (produce_enc.o[-1]  == consume_w_gry[-1]) |
+                (produce_enc.o[-2]  == consume_w_gry[-2]) |
+                (produce_enc.o[:-2] != consume_w_gry[:-2])),
+            self.readable.eq(consume_enc.o != produce_r_gry)
         ]
 
         do_write = self.writable & self.we
diff --git a/nmigen/test/test_lib_fifo.py b/nmigen/test/test_lib_fifo.py
index c7f4c28..e70d7be 100644
--- a/nmigen/test/test_lib_fifo.py
+++ b/nmigen/test/test_lib_fifo.py
@@ -262,8 +262,8 @@ class FIFOFormalCase(FHDLTestCase):
         # self.assertFormal(FIFOModelEquivalenceSpec(fifo, rdomain="read", wdomain="write"),
         #                   mode="bmc", depth=fifo.depth * 3 + 1)
         self.assertFormal(FIFOContractSpec(fifo, rdomain="read", wdomain="write",
-                                           bound=fifo.depth * 4 + 1),
-                          mode="hybrid", depth=fifo.depth * 4 + 1)
+                                           bound=(fifo.depth + 1) * 4),
+                          mode="hybrid", depth=(fifo.depth + 1) * 4)
 
     def test_async(self):
         self.check_async_fifo(AsyncFIFO(width=8, depth=4))

@Wren6991
Copy link
Contributor Author

Wren6991 commented Mar 3, 2019

I'm not sure how I missed it

Possibly distracted by CC's name!

Unfortunately you've significantly weakened the contract by adding a reset. Before your changes, the FIFO contract would test an arbitrary initial state (any block RAM contents and any pointers). After your changes the initial state is always defined, so e.g. a FIFO that would pass just two elements through it and then hang would pass it.

Ah okay, I see the problem.

Wouldn't you normally test this with a separate induction pass? The failure I saw was due to the design starting in an unreachable state. You are talking about reachable states, which should be testable by induction after a sufficiently long BMC. I haven't used SymbiYosys before, but have used yosys-smtbmc with a different solver backend, and always performed BMC followed by induction.

@whitequark
Copy link
Contributor

Wouldn't you normally test this with a separate induction pass?

Yes, but I never got induction to work on all the FIFO designs. This oddball BMC is the best I could do.

@Wren6991
Copy link
Contributor Author

Wren6991 commented Mar 3, 2019

That diff fixes the bug, but does introduce an extra cycle of latency in each direction. Immediate problem is defused :)

Yes, but I never got induction to work on all the FIFO designs. This oddball BMC is the best I could do.

Sounds like your experience with formal has been similar to mine... if this would be a valuable thing then I can have a play with it. It would be nice at some point in the future to win back that additional latency cycle, without breaking the test!

Thank you for taking a look at my patch.

@whitequark
Copy link
Contributor

That diff fixes the bug, but does introduce an extra cycle of latency in each direction. Immediate problem is defused :)

Oh, I see! I will do something about it. Thanks for pointing out!

  • Currently I'm passing the domain name into the __init__ of this class, like MultiReg, but I saw references elsewhere to the DomainRenamer class, which seems cleaner. However it seems you can only call this on Fragments, not Modules, and I currently don't know what a Fragment is :)

This is an open issue, #29. For now I will do the copy-pastey thing.

@Wren6991
Copy link
Contributor Author

Wren6991 commented Mar 3, 2019

This is an open issue, #29. For now I will do the copy-pastey thing.

Ah okay. I had a quick look at that area of code (Module etc) and decided I needed to learn some GoF patterns first!

If it's any help at all, this is what my "flat" patch looked like: link to commit

But this had the same issue with the FIFO contract test.

@whitequark
Copy link
Contributor

I had a quick look at that area of code (Module etc) and decided I needed to learn some GoF patterns first!

I'm not really a huge fan of GoF. That code doesn't need patterns so much as one horrible hack, and I can't decide exactly how horrible it's going to be. So it's not done just yet.

@whitequark
Copy link
Contributor

Fixed by 4027317, without any added latency. I've solved this properly by adding appropriate Assumes. You are right that this pattern should be extracted into a GrayCounter; it was in Migen, but I did not understand exactly why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants