Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UART DDR design state #1301

Closed
acomodi opened this issue Feb 11, 2020 · 28 comments
Closed

UART DDR design state #1301

acomodi opened this issue Feb 11, 2020 · 28 comments

Comments

@acomodi
Copy link
Contributor

acomodi commented Feb 11, 2020

This issue is to track the state of the UART DDR design that is currently under PR: #1294.

Though CI goes green, the implementation still does not properly work on HW.

How to verify it works on HW

  1. First there is the need to have litex set up:
git clone https://github.com/enjoy-digital/litex
cd litex
git checkout 16d1972
sudo ./setup.py install
cd ..
  1. Program the FPGA with the obtained bitstream.
  2. Switch the litex server on (ttyUSBX is where the arty device connected through UART):
lxserver --uart --uart-port=/dev/ttyUSBX
  1. Run the test script:
./test_sdram.py

The memory calibration should find some correct delay and bitslip values, such as the following (obtained with a 50 MHz working UART DDR implemented design):

FPGA ID:  Minimal Arty DDR3 Design for tests with Project X-Ray 2020-02-03 11:30:24
Release reset
Bring CKE high
Load Mode Register 2, CWL=5
Load Mode Register 3
Load Mode Register 1
Load Mode Register 0, CL=6, BL=8
ZQ Calibration
bitslip 0: |..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|31|
bitslip 1: |..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|

There is also the possibility to manually read/write to memory after the calibration delay has completed (the script selects the first valid bitslip and delay values).

Current state

There are several possible issues with the SymbiFlow implementation that affect the results. Here are listed the main ones:

Inverted clock for ISERDESEs

The design is mainly driven by two clocks, a fast one (200MHz) for IOSERDESes, and a slow one (50 MHz), used in the system.
In addition, the ISERDESEs make use of an inverted fast clock, which is fed into the CLKB port.
There are two ways in which this inverted clock is handled:

  1. In the Yosys+Vivado flow, the fast clock is inverted through logic and directed to a BUFGCTRL, which output gets to the CLKB port.
  2. In the full Vivado flow, the inverted and non-inverted fast clocks are the same, but the inverted on the CLKB port is enabled.
  3. In the Yosys+VtR flow, there is the need to manually insert a BUFG on the clock inversion through logic to get the inverted clock. There is still no information on how to set the inverter on the CLKB port of the ISERDESE, so to avoid inverting the fast clock through logic.

Poor placement

By inspecting the .dcp file generated after fasm2bels, the placement chosen by VPR seems to be off, as it spreads all the blocks in a column style, instead of having a more compact result:

Screenshot from 2020-02-11 10-41-08

This can result in critical blocks to be placed far way from each other (such as the white path in the image, corresponding to the clk200 reset that drives the reset signal of the IDELAYCTRL)

Timing violations

By extracting the timing summary from Vivado after the fasm2bels step, there are several endpoints which fail in the Setup/Hold timing analysys (here an example from the timing summary):

Clock                 WNS(ns)      TNS(ns)  TNS Failing Endpoints  TNS Total Endpoints      WHS(ns)      THS(ns)  THS Failing Endpoints  THS Total Endpoints     WPWS(ns)     TPWS(ns)  TPWS Failing Endpoints  TPWS Total Endpoints
-----                 -------      -------  ---------------------  -------------------      -------      -------  ---------------------  -------------------     --------     --------  ----------------------  --------------------
clk100                                                                                                                                                              3.000        0.000                       0                     3
  builder_pll_fb                                                                                                                                                    8.751        0.000                       0                     2
  main_clkout0        -16.445   -10445.183                   1821                 6387        0.138        0.000                      0                 6387        8.750        0.000                       0                  2736
  main_clkout1                                                                                                                                                      2.845        0.000                       0                    80
  main_clkout2                                                                                                                                                      2.845        0.000                       0                     5
  main_clkout3         -1.283       -5.834                      6                   10        0.268        0.000                      0                   10        0.264        0.000                       0                    12

Below there is a tar containing the various timing reports, both from Vivado and VPR. The commit hashes of the various tools are the following:

  1. VPR: this revision contains the memory clean-up changes and is yet to be merged on master+wip WIP: Update to VPR (including router refactor) SymbiFlow/vtr-verilog-to-routing#393. The commit has is SymbiFlow/vtr-verilog-to-routing@21df3a5
  2. SymbiFlow arch defs: the results have been gathered using the open DDR UART PR: Add ddr test #1294. The commit hash is d8e3bb0.
  3. Yosys: SymbiFlow/yosys@8fe9c84

The tarball containing reports is here

Notes

  1. Regarding the timing analysis comparison between Vivado and VPR, a thing to note is that in VPR, hold and setup timing are calculated starting from the output pin of a BUFGCTRL, while Vivado calculates timing starting from the IPAD clock (the E3 pin in this case).
@acomodi
Copy link
Contributor Author

acomodi commented Feb 11, 2020

@litghost FYI

@acomodi
Copy link
Contributor Author

acomodi commented Feb 11, 2020

Clock Pessimism

There is a value that is used during timing analysis in Vivado which is clock pessimism. This value is absent from VPR timing analysis.
This because, during the arrival time and required time calculations, the common circuitry of the two clock paths ends up having different values. In fact (as reported in the Notes above), Vivado does calculate timing for each path starting from the IPAD E3 pin. This can introduce some differences among the path until the point where they diverge.

Clock pessimism, is used to delete this differences, which is fictitious and not real.

This can be seen in the following example:

Min Delay Paths
--------------------------------------------------------------------------------------
Slack (VIOLATED) :        -0.612ns  (arrival time - required time)
  Source:                 CLBLM_L_X32Y107_SLICE_X50Y107_D_FDRE/C
                            (rising edge-triggered cell FDRE clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Destination:            BRAM_L_X30Y95_RAMB36_X1Y19_RAMB36E1/DIBDI[12]
                            (rising edge-triggered cell RAMB36E1 clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Path Group:             main_clkout0
  Path Type:              Hold (Min at Slow Process Corner)
  Requirement:            0.000ns  (main_clkout0 rise@0.000ns - main_clkout0 rise@0.000ns)
  Data Path Delay:        1.070ns  (logic 0.418ns (39.075%)  route 0.652ns (60.925%))
  Logic Levels:           0  
  Clock Path Skew:        1.276ns (DCD - SCD - CPR)
    Destination Clock Delay (DCD):    11.729ns
    Source Clock Delay      (SCD):    9.645ns
    Clock Pessimism Removal (CPR):    0.808ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock main_clkout0 rise edge)
                                                      0.000     0.000 r  
    E3                                                0.000     0.000 r  clk100 (IN)
                         net (fo=0)                   0.000     0.000    clk100
    E3                                                                r  RIOB33_X43Y75_IOB_X1Y76_IBUF/I
    E3                   IBUF (Prop_ibuf_I_O)         1.418     1.418 r  RIOB33_X43Y75_IOB_X1Y76_IBUF/O
                         net (fo=1, routed)           3.255     4.674    RIOB33_X43Y75_IOB_X1Y76_I
    BUFGCTRL_X0Y16                                                    r  BUFGCLK_BUFG_TOP_R_X60Y53_BUFGCTRL_X0Y16_BUFGCTRL/I0
    BUFGCTRL_X0Y16       BUFGCTRL (Prop_bufgctrl_I0_O)
                                                      0.091     4.765 r  BUFGCLK_BUFG_TOP_R_X60Y53_BUFGCTRL_X0Y16_BUFGCTRL/O
                         net (fo=1, routed)           0.642     5.407    main_pll_clkin
    BUFHCE_X1Y6                                                       r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_BUFHCE/I
    BUFHCE_X1Y6          BUFHCE (Prop_bufhce_I_O)     0.081     5.488 r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_BUFHCE/O
                         net (fo=1, routed)           0.796     6.284    CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_O
    PLLE2_ADV_X1Y0                                                    r  PLLE2_ADVCMT_TOP_L_UPPER_T_X106Y44_PLLE2_ADV_X1Y0_PLLE2_ADV/CLKIN1
    PLLE2_ADV_X1Y0       PLLE2_ADV (Prop_plle2_adv_CLKIN1_CLKOUT0)
                                                      0.083     6.367 r  PLLE2_ADVCMT_TOP_L_UPPER_T_X106Y44_PLLE2_ADV_X1Y0_PLLE2_ADV/CLKOUT0
                         net (fo=1, routed)           1.576     7.943    main_clkout0
    BUFGCTRL_X0Y0                                                     r  CLK_BUFG_BOT_R_X60Y48_BUFGCTRL_X0Y0_BUFGCTRL/I0
    BUFGCTRL_X0Y0        BUFGCTRL (Prop_bufgctrl_I0_O)
                                                      0.091     8.034 r  CLK_BUFG_BOT_R_X60Y48_BUFGCTRL_X0Y0_BUFGCTRL/O
                         net (fo=5, routed)           0.803     8.837    sys_clk__main_clkout_buf0
    BUFHCE_X1Y30                                                      r  CLK_HROW_TOP_R_X60Y130_BUFHCE_X1Y30_BUFHCE/I
    BUFHCE_X1Y30         BUFHCE (Prop_bufhce_I_O)     0.081     8.918 r  CLK_HROW_TOP_R_X60Y130_BUFHCE_X1Y30_BUFHCE/O
                         net (fo=355, routed)         0.727     9.645    CLK_HROW_TOP_R_X60Y130_BUFHCE_X1Y30_O
    SLICE_X50Y107        FDRE                                         r  CLBLM_L_X32Y107_SLICE_X50Y107_D_FDRE/C
  -------------------------------------------------------------------    -------------------
    SLICE_X50Y107        FDRE (Prop_fdre_C_Q)         0.418    10.063 r  CLBLM_L_X32Y107_SLICE_X50Y107_D_FDRE/Q
                         net (fo=18, routed)          0.652    10.715    main_uart_wishbone_dat_w[6]...
    RAMB36_X1Y19         RAMB36E1                                     r  BRAM_L_X30Y95_RAMB36_X1Y19_RAMB36E1/DIBDI[12]
  -------------------------------------------------------------------    -------------------

                         (clock main_clkout0 rise edge)
                                                      0.000     0.000 r  
    E3                                                0.000     0.000 r  clk100 (IN)
                         net (fo=0)                   0.000     0.000    clk100
    E3                                                                r  RIOB33_X43Y75_IOB_X1Y76_IBUF/I
    E3                   IBUF (Prop_ibuf_I_O)         1.489     1.489 r  RIOB33_X43Y75_IOB_X1Y76_IBUF/O
                         net (fo=1, routed)           3.780     5.269    RIOB33_X43Y75_IOB_X1Y76_I
    BUFGCTRL_X0Y16                                                    r  BUFGCLK_BUFG_TOP_R_X60Y53_BUFGCTRL_X0Y16_BUFGCTRL/I0
    BUFGCTRL_X0Y16       BUFGCTRL (Prop_bufgctrl_I0_O)
                                                      0.096     5.365 r  BUFGCLK_BUFG_TOP_R_X60Y53_BUFGCTRL_X0Y16_BUFGCTRL/O
                         net (fo=1, routed)           0.675     6.041    main_pll_clkin
    BUFHCE_X1Y6                                                       r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_BUFHCE/I
    BUFHCE_X1Y6          BUFHCE (Prop_bufhce_I_O)     0.127     6.168 r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_BUFHCE/O
                         net (fo=1, routed)           0.835     7.003    CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_O
    PLLE2_ADV_X1Y0                                                    r  PLLE2_ADVCMT_TOP_L_UPPER_T_X106Y44_PLLE2_ADV_X1Y0_PLLE2_ADV/CLKIN1
    PLLE2_ADV_X1Y0       PLLE2_ADV (Prop_plle2_adv_CLKIN1_CLKOUT0)
                                                      0.088     7.091 r  PLLE2_ADVCMT_TOP_L_UPPER_T_X106Y44_PLLE2_ADV_X1Y0_PLLE2_ADV/CLKOUT0
                         net (fo=1, routed)           1.655     8.746    main_clkout0
    BUFGCTRL_X0Y0                                                     r  CLK_BUFG_BOT_R_X60Y48_BUFGCTRL_X0Y0_BUFGCTRL/I0
    BUFGCTRL_X0Y0        BUFGCTRL (Prop_bufgctrl_I0_O)
                                                      0.096     8.842 r  CLK_BUFG_BOT_R_X60Y48_BUFGCTRL_X0Y0_BUFGCTRL/O
                         net (fo=5, routed)           0.661     9.503    sys_clk__main_clkout_buf0
    BUFHCE_X0Y20                                                      r  CLK_HROW_TOP_R_X60Y78_BUFHCE_X0Y20_BUFHCE/I
    BUFHCE_X0Y20         BUFHCE (Prop_bufhce_I_O)     0.127     9.630 r  CLK_HROW_TOP_R_X60Y78_BUFHCE_X0Y20_BUFHCE/O
                         net (fo=3, routed)           1.160    10.790    CLK_HROW_TOP_R_X60Y78_BUFHCE_X0Y20_O
    BUFHCE_X1Y17                                                      r  CLK_HROW_TOP_R_X60Y78_BUFHCE_X1Y17_BUFHCE/I
    BUFHCE_X1Y17         BUFHCE (Prop_bufhce_I_O)     0.127    10.917 r  CLK_HROW_TOP_R_X60Y78_BUFHCE_X1Y17_BUFHCE/O
                         net (fo=1233, routed)        0.812    11.729    CLK_HROW_TOP_R_X60Y78_BUFHCE_X1Y17_O
    RAMB36_X1Y19         RAMB36E1                                     r  BRAM_L_X30Y95_RAMB36_X1Y19_RAMB36E1/CLKBWRCLK
                         clock pessimism             -0.808    10.922    
    RAMB36_X1Y19         RAMB36E1 (Hold_ramb36e1_CLKBWRCLK_DIBDI[12])
                                                      0.405    11.327    BRAM_L_X30Y95_RAMB36_X1Y19_RAMB36E1
  -------------------------------------------------------------------
                         required time                        -11.327    
                         arrival time                          10.715    
  -------------------------------------------------------------------
                         slack                                 -0.612    

The arrival time calculation until the BUFGCTRL output, report a 7.943 ns delay, while, for the same route in the required time calculation, the reported delay is 8.746 ns.

Clock pessimism should delete their difference.

This should not affect VPR timing analysis, as the timing is already being calculated starting from the BUFGCTRL output. In fact the same path is reported as follows by VPR:

#Path 2
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$25765.Q[0] (FDRE_ZINI at (108,45) clocked by sys_clk__main_clkout_buf0)
Endpoint  : mem.0.0.0.DIBDI[6] (RAMB18E1_VPR at (100,59) clocked by sys_clk__main_clkout_buf0)
Path Type : hold

Point                                                                                           Incr      Path
--------------------------------------------------------------------------------------------------------------
clock sys_clk__main_clkout_buf0 (rise edge)                                                    0.000     0.000
clock source latency                                                                           0.000     0.000
BUFG_1.O[0] (BUFGCTRL_VPR at (79,111))                                                         0.000     0.000
| (intra 'BLK-TL-BUFGCTRL' routing)                                                            0.000     0.000
| (inter-block routing)                                                                        1.027     1.027
| (intra 'BLK-TL-SLICEM' routing)                                                              0.000     1.027
$auto$simplemap.cc:442:simplemap_dffe$25765.C[0] (FDRE_ZINI at (108,45))                       0.000     1.027
| (primitive 'FDRE_ZINI' Tcq_min)                                                              0.118     1.145
$auto$simplemap.cc:442:simplemap_dffe$25765.Q[0] (FDRE_ZINI at (108,45)) [clock-to-output]     0.000     1.145
| (intra 'BLK-TL-SLICEM' routing)                                                              0.000     1.145
| (inter-block routing)                                                                        0.762     1.907
| (intra 'BLK-TL-BRAM_L' routing)                                                              0.000     1.907
mem.0.0.0.DIBDI[6] (RAMB18E1_VPR at (100,59))                                                  0.000     1.907
data arrival time                                                                                        1.907

clock sys_clk__main_clkout_buf0 (rise edge)                                                    0.000     0.000
clock source latency                                                                           0.000     0.000
BUFG_1.O[0] (BUFGCTRL_VPR at (79,111))                                                         0.000     0.000
| (intra 'BLK-TL-BUFGCTRL' routing)                                                            0.000     0.000
| (inter-block routing)                                                                        2.083     2.083
| (intra 'BLK-TL-BRAM_L' routing)                                                              0.000     2.083
mem.0.0.0.CLKBWRCLK[0] (RAMB18E1_VPR at (100,59))                                              0.000     2.083
clock uncertainty                                                                              0.000     2.083
cell hold time                                                                                -0.241     1.842
data required time                                                                                       1.842
--------------------------------------------------------------------------------------------------------------
data required time                                                                                      -1.842
data arrival time                                                                                        1.907
--------------------------------------------------------------------------------------------------------------
slack (MET)                                                                                              0.065

@acomodi
Copy link
Contributor Author

acomodi commented Feb 11, 2020

BUFHCE incorrect routing

BUFHCE to BUFHCE connection
Screenshot from 2020-02-11 13-21-47

Zoom in on the BUFHCEs
Screenshot from 2020-02-11 13-25-50

There is a situation for which a clock is being routed incorrectly between BUFHCEs.
As reported in the image, the white horizontal line is a BUFHCE to BUFHCE connection for a clock-crossing nets.

The incorrect routing stands in the fact that the BUFHCE on the right's input should be routed from a BUFG. Being the clock the same, the BUFHCE on the right should get the clock from the same pip as the left BUFHCE. Instead VPR chooses a longer route through the whole horizontal clock to get to the left CMT and go back to the BUFHCE on the right.

This adds a huge delay in timing analysis, which is actually not captured by VPR:

Vivado report

Slack (VIOLATED) :        -0.450ns  (arrival time - required time)
  Source:                 CLBLM_L_X32Y49_SLICE_X50Y49_D_FDRE/C
                            (rising edge-triggered cell FDRE clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Destination:            $auto$simplemap.cc:420:simplemap_dff$18568CLBLM_L_X32Y53_SLICE_X50Y53_D5_FDRE/D
                            (rising edge-triggered cell FDRE clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Path Group:             main_clkout0
  Path Type:              Hold (Min at Slow Process Corner)
  Requirement:            0.000ns  (main_clkout0 rise@0.000ns - main_clkout0 rise@0.000ns)
  Data Path Delay:        1.179ns  (logic 0.418ns (35.453%)  route 0.761ns (64.547%))
  Logic Levels:           0  
  Clock Path Skew:        1.393ns (DCD - SCD - CPR)
    Destination Clock Delay (DCD):    11.688ns
    Source Clock Delay      (SCD):    9.487ns
    Clock Pessimism Removal (CPR):    0.808ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock main_clkout0 rise edge)
                                                      0.000     0.000 r  
    E3                                                0.000     0.000 r  clk100 (IN)
                         net (fo=0)                   0.000     0.000    clk100
    E3                                                                r  RIOB33_X43Y75_IOB_X1Y76_IBUF/I
    E3                   IBUF (Prop_ibuf_I_O)         1.418     1.418 r  RIOB33_X43Y75_IOB_X1Y76_IBUF/O
                         net (fo=1, routed)           3.255     4.674    RIOB33_X43Y75_IOB_X1Y76_I
    BUFGCTRL_X0Y16                                                    r  BUFGCLK_BUFG_TOP_R_X60Y53_BUFGCTRL_X0Y16_BUFGCTRL/I0
    BUFGCTRL_X0Y16       BUFGCTRL (Prop_bufgctrl_I0_O)
                                                      0.091     4.765 r  BUFGCLK_BUFG_TOP_R_X60Y53_BUFGCTRL_X0Y16_BUFGCTRL/O
                         net (fo=1, routed)           0.642     5.407    main_pll_clkin
    BUFHCE_X1Y6                                                       r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_BUFHCE/I
    BUFHCE_X1Y6          BUFHCE (Prop_bufhce_I_O)     0.081     5.488 r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_BUFHCE/O
                         net (fo=1, routed)           0.796     6.284    CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_O
    PLLE2_ADV_X1Y0                                                    r  PLLE2_ADVCMT_TOP_L_UPPER_T_X106Y44_PLLE2_ADV_X1Y0_PLLE2_ADV/CLKIN1
    PLLE2_ADV_X1Y0       PLLE2_ADV (Prop_plle2_adv_CLKIN1_CLKOUT0)
                                                      0.083     6.367 r  PLLE2_ADVCMT_TOP_L_UPPER_T_X106Y44_PLLE2_ADV_X1Y0_PLLE2_ADV/CLKOUT0
                         net (fo=1, routed)           1.576     7.943    main_clkout0
    BUFGCTRL_X0Y0                                                     r  CLK_BUFG_BOT_R_X60Y48_BUFGCTRL_X0Y0_BUFGCTRL/I0
    BUFGCTRL_X0Y0        BUFGCTRL (Prop_bufgctrl_I0_O)
                                                      0.091     8.034 r  CLK_BUFG_BOT_R_X60Y48_BUFGCTRL_X0Y0_BUFGCTRL/O
                         net (fo=5, routed)           0.639     8.673    sys_clk__main_clkout_buf0
    BUFHCE_X1Y4                                                       r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y4_BUFHCE/I
    BUFHCE_X1Y4          BUFHCE (Prop_bufhce_I_O)     0.081     8.754 r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y4_BUFHCE/O
                         net (fo=870, routed)         0.733     9.487    CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y4_O
    SLICE_X50Y49         FDRE                                         r  CLBLM_L_X32Y49_SLICE_X50Y49_D_FDRE/C
  -------------------------------------------------------------------    -------------------
    SLICE_X50Y49         FDRE (Prop_fdre_C_Q)         0.418     9.905 r  CLBLM_L_X32Y49_SLICE_X50Y49_D_FDRE/Q
                         net (fo=12, routed)          0.761    10.666    main_uart_wishbone_dat_w[12]__builder_basesoc_shared_dat_w[12]__builder_rhs_array_muxed37[12]__builder_rhs_array_muxed45[12]__main_bus_wishbone_dat_w[12]__main_interface0_wb_sdram_dat_w[12]__main_interface1_wb_sdram_dat_w[12]__main_sram_bus_dat_w[12]__main_sram_dat_w[12]__main_uart_data[12]
    SLICE_X50Y53         FDRE                                         r  $auto$simplemap.cc:420:simplemap_dff$18568CLBLM_L_X32Y53_SLICE_X50Y53_D5_FDRE/D
  -------------------------------------------------------------------    -------------------

                         (clock main_clkout0 rise edge)
                                                      0.000     0.000 r  
    E3                                                0.000     0.000 r  clk100 (IN)
                         net (fo=0)                   0.000     0.000    clk100
    E3                                                                r  RIOB33_X43Y75_IOB_X1Y76_IBUF/I
    E3                   IBUF (Prop_ibuf_I_O)         1.489     1.489 r  RIOB33_X43Y75_IOB_X1Y76_IBUF/O
                         net (fo=1, routed)           3.780     5.269    RIOB33_X43Y75_IOB_X1Y76_I
    BUFGCTRL_X0Y16                                                    r  BUFGCLK_BUFG_TOP_R_X60Y53_BUFGCTRL_X0Y16_BUFGCTRL/I0
    BUFGCTRL_X0Y16       BUFGCTRL (Prop_bufgctrl_I0_O)
                                                      0.096     5.365 r  BUFGCLK_BUFG_TOP_R_X60Y53_BUFGCTRL_X0Y16_BUFGCTRL/O
                         net (fo=1, routed)           0.675     6.041    main_pll_clkin
    BUFHCE_X1Y6                                                       r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_BUFHCE/I
    BUFHCE_X1Y6          BUFHCE (Prop_bufhce_I_O)     0.127     6.168 r  CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_BUFHCE/O
                         net (fo=1, routed)           0.835     7.003    CLK_HROW_BOT_R_X60Y26_BUFHCE_X1Y6_O
    PLLE2_ADV_X1Y0                                                    r  PLLE2_ADVCMT_TOP_L_UPPER_T_X106Y44_PLLE2_ADV_X1Y0_PLLE2_ADV/CLKIN1
    PLLE2_ADV_X1Y0       PLLE2_ADV (Prop_plle2_adv_CLKIN1_CLKOUT0)
                                                      0.088     7.091 r  PLLE2_ADVCMT_TOP_L_UPPER_T_X106Y44_PLLE2_ADV_X1Y0_PLLE2_ADV/CLKOUT0
                         net (fo=1, routed)           1.655     8.746    main_clkout0
    BUFGCTRL_X0Y0                                                     r  CLK_BUFG_BOT_R_X60Y48_BUFGCTRL_X0Y0_BUFGCTRL/I0
    BUFGCTRL_X0Y0        BUFGCTRL (Prop_bufgctrl_I0_O)
                                                      0.096     8.842 r  CLK_BUFG_BOT_R_X60Y48_BUFGCTRL_X0Y0_BUFGCTRL/O
                         net (fo=5, routed)           0.661     9.503    sys_clk__main_clkout_buf0
    BUFHCE_X0Y20                                                      r  CLK_HROW_TOP_R_X60Y78_BUFHCE_X0Y20_BUFHCE/I
    BUFHCE_X0Y20         BUFHCE (Prop_bufhce_I_O)     0.127     9.630 r  CLK_HROW_TOP_R_X60Y78_BUFHCE_X0Y20_BUFHCE/O
                         net (fo=3, routed)           1.160    10.790    CLK_HROW_TOP_R_X60Y78_BUFHCE_X0Y20_O
    BUFHCE_X1Y17                                                      r  CLK_HROW_TOP_R_X60Y78_BUFHCE_X1Y17_BUFHCE/I
    BUFHCE_X1Y17         BUFHCE (Prop_bufhce_I_O)     0.127    10.917 r  CLK_HROW_TOP_R_X60Y78_BUFHCE_X1Y17_BUFHCE/O
                         net (fo=1233, routed)        0.771    11.688    CLK_HROW_TOP_R_X60Y78_BUFHCE_X1Y17_O
    SLICE_X50Y53         FDRE                                         r  $auto$simplemap.cc:420:simplemap_dff$18568CLBLM_L_X32Y53_SLICE_X50Y53_D5_FDRE/C
                         clock pessimism             -0.808    10.880    
    SLICE_X50Y53         FDRE (Hold_fdre_C_D)         0.236    11.116    $auto$simplemap.cc:420:simplemap_dff$18568CLBLM_L_X32Y53_SLICE_X50Y53_D5_FDRE
  -------------------------------------------------------------------
                         required time                        -11.116    
                         arrival time                          10.666    
  -------------------------------------------------------------------
                         slack                                 -0.450    

VPR report

#Path 1
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$25771.Q[0] (FDRE_ZINI at (108,108) clocked by sys_clk__main_clkout_buf0)
Endpoint  : $auto$simplemap.cc:420:simplemap_dff$18568.D[0] (FDRE_ZINI at (108,102) clocked by sys_clk__main_clkout_buf0)
Path Type : hold

Point                                                                                            Incr      Path
---------------------------------------------------------------------------------------------------------------
clock sys_clk__main_clkout_buf0 (rise edge)                                                     0.000     0.000
clock source latency                                                                            0.000     0.000
BUFG_1.O[0] (BUFGCTRL_VPR at (79,111))                                                          0.000     0.000
| (intra 'BLK-TL-BUFGCTRL' routing)                                                             0.000     0.000
| (inter-block routing)                                                                         1.027     1.027
| (intra 'BLK-TL-SLICEM' routing)                                                               0.000     1.027
$auto$simplemap.cc:442:simplemap_dffe$25771.C[0] (FDRE_ZINI at (108,108))                       0.000     1.027
| (primitive 'FDRE_ZINI' Tcq_min)                                                               0.118     1.145
$auto$simplemap.cc:442:simplemap_dffe$25771.Q[0] (FDRE_ZINI at (108,108)) [clock-to-output]     0.000     1.145
| (intra 'BLK-TL-SLICEM' routing)                                                               0.000     1.145
| (inter-block routing)                                                                         0.891     2.036
| (intra 'BLK-TL-SLICEM' routing)                                                               0.000     2.036
$auto$simplemap.cc:420:simplemap_dff$18568.D[0] (FDRE_ZINI at (108,102))                        0.000     2.036
data arrival time                                                                                         2.036

clock sys_clk__main_clkout_buf0 (rise edge)                                                     0.000     0.000
clock source latency                                                                            0.000     0.000
BUFG_1.O[0] (BUFGCTRL_VPR at (79,111))                                                          0.000     0.000
| (intra 'BLK-TL-BUFGCTRL' routing)                                                             0.000     0.000
| (inter-block routing)                                                                         2.041     2.041
| (intra 'BLK-TL-SLICEM' routing)                                                               0.000     2.041
$auto$simplemap.cc:420:simplemap_dff$18568.C[0] (FDRE_ZINI at (108,102))                        0.000     2.041
clock uncertainty                                                                               0.000     2.041
cell hold time                                                                                  0.194     2.235
data required time                                                                                        2.235
---------------------------------------------------------------------------------------------------------------
data required time                                                                                       -2.235
data arrival time                                                                                         2.036
---------------------------------------------------------------------------------------------------------------
slack (VIOLATED)                                                                                         -0.199

Timing delay to route to the FDRE/C endpoint in required time calculation:

  • Vivado: 11.688 - 8.746 = 2.942 ns. 11.688 is the delay to get to the FDRE/C, while 8.746 is the delay to get to the BUFGCTRL.
  • VPR: 2.041 ns

The same required time nets are being traversed, both in VPR and Vivado, but VPR may rely on a not entirely correct timing model, to capture all the delays in the path.

@litghost
Copy link
Contributor

The Vivado timing report shows a WNS of -16 ns, which net is that?
The Vivado timing report is showing no hold violations right now, what changed?

@acomodi
Copy link
Contributor Author

acomodi commented Feb 11, 2020

@litghost I forgot to mention that to output more detailed timing reports for VPR I added the following parameters:

--full_stats on --timing_report_npaths 10000 --timing_report_detail aggregated --timing_report_skew on

@litghost
Copy link
Contributor

litghost commented Feb 11, 2020

I believe the Vivado timing analysis requires additional constraints, can you please provide what constraints you used?

@acomodi
Copy link
Contributor Author

acomodi commented Feb 11, 2020

@litghost just the clock creation.

create_clock clk100 -period 10.0

Another thing to add is actually that there should be false paths set for the reset Flip Flops, because they are asynchronous resets. The xdc file from the minitest uses these constraints:

set_false_path -quiet -to [get_nets -quiet -filter {mr_ff == TRUE}]

set_false_path -quiet -to [get_pins -quiet -filter {REF_PIN_NAME == PRE} -of [get_cells -quiet -filter {ars_ff1 == TRUE || ars_ff2 == TRUE}]]

set_max_delay 2 -quiet -from [get_pins -quiet -filter {REF_PIN_NAME == Q} -of [get_cells -quiet -filter {ars_ff1 == TRUE}]] -to [get_pins -quiet -filter {REF_PIN_NAME == D} -of [get_cells -quiet -filter {ars_ff2 == TRUE}]]

ars_ff1 and ars_ff2 are related to asynchronous flip flops for the asynchronous resets signals.

@litghost
Copy link
Contributor

One thing we should probably add to the add_vivado_target is custom XDC (if we don't have it already).

@litghost
Copy link
Contributor

As a random thought, did we try increasing the criticality of nets touching clocks yet? I expect it might have a positive effect on #1301 (comment)

@litghost
Copy link
Contributor

I believe the problem is that the ISERDES and OSERDES is missing all of it's timing model, both in prjxray-db and in symbiflow-arch-defs. The bad timing paths from these missing paths.

@litghost
Copy link
Contributor

I believe the problem is that the ISERDES and OSERDES is missing all of it's timing model, both in prjxray-db and in symbiflow-arch-defs. The bad timing paths from these missing paths.

I've added I/OSERDES timing data to the prjxray-db 007 fuzzer in f4pga/prjxray#1229

@litghost
Copy link
Contributor

I've identified the same timing arc in each analysis, and VPR's model is wrong. Reports are attached:
worst_path.zip

@litghost
Copy link
Contributor

litghost commented Feb 13, 2020

Basic analysis looks like an under-predict of the interconnect delay. The tcl function write_timing_info can be used on the output of fasm2bels to generate a JSON suitable for use with https://github.com/SymbiFlow/prjxray/blob/master/minitests/timing/create_timing_worksheet_db.py that can be used to examine how the interconnect delays were computed. I've started a run of write_timing_info, but it will take a while to run. I expect in the morning I can post the JSON output from the design, and we can compare Vivado interconnect timing model with the model we implemented in VPR and see if the problem doesn't jump out.

GitHub
Documenting the Xilinx 7-series bit-stream format. - SymbiFlow/prjxray

@acomodi
Copy link
Contributor Author

acomodi commented Feb 13, 2020

By running VPR with some parameters I could get better timing results out of Vivado (still violating the setup)

  1. Packer:
  • --allow_unrelated_clustering off: this is to pack nets with different scopes into different clusters
  • --cluster_seed_type timing: this should enable the packer to choose the timing critical primitives first
  • --alpha_clustering 1.0: this optimizes solely in timing instead of area
  • --connection_driven_clustering off: this actually didn't seem to have a huge impact.
  1. Placer:
  • --timing_tradeoff 1.0: this optimizes entirely on timing.

With these settings I got the worst setup path to be -6.383 ns

@litghost
Copy link
Contributor

By running VPR with some parameters I could get better timing results out of Vivado (still violating the setup)

  1. Packer:
  • --allow_unrelated_clustering off: this is to pack nets with different scopes into different clusters
  • --cluster_seed_type timing: this should enable the packer to choose the timing critical primitives first
  • --alpha_clustering 1.0: this optimizes solely in timing instead of area
  • --connection_driven_clustering off: this actually didn't seem to have a huge impact.
  1. Placer:
  • --timing_tradeoff 1.0: this optimizes entirely on timing.

With these settings I got the worst setup path to be -6.383 ns

There is absolutely zero point in tuning parameters until the timing model is fixed.

@litghost
Copy link
Contributor

timing.zip

@litghost
Copy link
Contributor

Initial fixes to create_timing_worksheet_db.py are here: f4pga/prjxray#1235

Initial timing worksheet from updated scripts: timing_ws.zip

@litghost
Copy link
Contributor

Turns out trying to have ~6500 timing calculations in 1 spreadsheet doesn't work great. I've added filtering to the timing worksheet generator, and here is the filtered spreadsheet:
filtered_timing_ws.zip

@litghost
Copy link
Contributor

I may have identified the problem, and I'm working on testing a solution.

@litghost
Copy link
Contributor

Using the new timing data from f4pga/prjxray#1238 and this commit, I believe we have a design that closes timing.

I do not have access to an Arty right now, so I have attached the output from the P&R. @acomodi / @mithro if you could test the bitstream to see if it is working, that would be awesome.

ddr_uart_arty.zip

Vivado is reporting 4 very small hold violations, but I hope the hold violations are small enough to be within timing model margins:

Slack (VIOLATED) :        -0.034ns  (arrival time - required time)
  Source:                 CLBLM_L_X22Y98_SLICE_X34Y98_B_FDRE/C
                            (rising edge-triggered cell FDRE clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Destination:            CLBLM_L_X22Y103_SLICE_X34Y103_RAM64X1D_CD/DP/WADR1
                            (rising edge-triggered cell RAMD64E clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Path Group:             main_clkout0
  Path Type:              Hold (Min at Fast Process Corner)
  Requirement:            0.000ns  (main_clkout0 rise@0.000ns - main_clkout0 rise@0.000ns)
  Data Path Delay:        0.631ns  (logic 0.164ns (25.997%)  route 0.467ns (74.003%))
  Logic Levels:           0  
  Clock Path Skew:        0.355ns (DCD - SCD - CPR)
--
Slack (VIOLATED) :        -0.034ns  (arrival time - required time)
  Source:                 CLBLM_L_X22Y98_SLICE_X34Y98_B_FDRE/C
                            (rising edge-triggered cell FDRE clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Destination:            CLBLM_L_X22Y103_SLICE_X34Y103_RAM64X1D_CD/SP/WADR1
                            (rising edge-triggered cell RAMD64E clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Path Group:             main_clkout0
  Path Type:              Hold (Min at Fast Process Corner)
  Requirement:            0.000ns  (main_clkout0 rise@0.000ns - main_clkout0 rise@0.000ns)
  Data Path Delay:        0.631ns  (logic 0.164ns (25.997%)  route 0.467ns (74.003%))
  Logic Levels:           0  
  Clock Path Skew:        0.355ns (DCD - SCD - CPR)
--
Slack (VIOLATED) :        -0.026ns  (arrival time - required time)
  Source:                 CLBLM_R_X25Y99_SLICE_X38Y99_A_FDRE/C
                            (rising edge-triggered cell FDRE clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Destination:            CLBLM_L_X22Y102_SLICE_X34Y102_RAM64X1D_CD/DP/WADR2
                            (rising edge-triggered cell RAMD64E clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Path Group:             main_clkout0
  Path Type:              Hold (Min at Fast Process Corner)
  Requirement:            0.000ns  (main_clkout0 rise@0.000ns - main_clkout0 rise@0.000ns)
  Data Path Delay:        0.583ns  (logic 0.164ns (28.122%)  route 0.419ns (71.878%))
  Logic Levels:           0  
  Clock Path Skew:        0.355ns (DCD - SCD - CPR)
--
Slack (VIOLATED) :        -0.026ns  (arrival time - required time)
  Source:                 CLBLM_R_X25Y99_SLICE_X38Y99_A_FDRE/C
                            (rising edge-triggered cell FDRE clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Destination:            CLBLM_L_X22Y102_SLICE_X34Y102_RAM64X1D_CD/SP/WADR2
                            (rising edge-triggered cell RAMD64E clocked by main_clkout0  {rise@0.000ns fall@10.000ns period=20.000ns})
  Path Group:             main_clkout0
  Path Type:              Hold (Min at Fast Process Corner)
  Requirement:            0.000ns  (main_clkout0 rise@0.000ns - main_clkout0 rise@0.000ns)
  Data Path Delay:        0.583ns  (logic 0.164ns (28.122%)  route 0.419ns (71.878%))
  Logic Levels:           0  
  Clock Path Skew:        0.355ns (DCD - SCD - CPR)

@kgugala
Copy link
Contributor

kgugala commented Feb 15, 2020

@litghost @mithro @acomodi I have tested the design. Unfortunately the bitstream does not work

@acomodi
Copy link
Contributor Author

acomodi commented Feb 15, 2020

@litghost @kgugala @mithro I have noticed that in the FASM there were some wrong PULLTYPE settings, meaning that a normal Yosys+Vivado flow was setting some PULLTYPE features to NONE for the DDR related IOBs, while SymbiFlow still does not support those.

By manually updating the wrong PULLTYPES from PULLDOWN to NONE in the top.fasm in the zip archive, and by regenerating the bitstream with fasm2frames and frames2bit, the DDR test now works on HW as expected!

@kgugala
Copy link
Contributor

kgugala commented Feb 15, 2020

And here are the bitstream and fasm files from @acomodi

ddr-bitstream.zip

@acomodi
Copy link
Contributor Author

acomodi commented Feb 15, 2020

Correction

By manually updating the wrong PULLTYPES from PULLDOWN to NONE in the top.fasm in the zip archive, and by regenerating the bitstream with fasm2frames and frames2bit, the DDR test now works on HW as expected!

This was actually a wrong statement. I have double checked and most probably the bistream I used to perform the test was a Vivado generated one. By repeating the manual procedure and replacing the PULLTYPEs, the bitstream does not work yet, unfortunately.

@acomodi
Copy link
Contributor Author

acomodi commented Feb 15, 2020

To try to understand what may be wrong I have performed a series of tests to understand what the issue may be, as now timing is met.

To do so I have first made sure that fasm2bels did correctly work and produced working bitstreams following this flow:

  • Yosys+Vivado --> fasm2bels --> Vivado (on the fasm2bels generated netlist, without any constraints)

With the above test, it is verified that fasm2bels produces valid netlists that Vivado is able to implement, obtaining working bitstreams.

Then I proceeded with going through the same flow, starting from VPR generated FASMs:

  • SymbiFlow+VPR -> fasm2bels -> Vivado (by using only the generated netlist without any sort of constraints, to let the tool P&R everything as it should)

I have run a series of symbiflow tests with this approach, and, for now, the only one not working on HW is still the DDR one, even though timings were met.

This can lead to the conclusion that there may be something wrong with the architecture model that we currently have.

To perform this kind of tests I have also created an automated script to speed up this process: f2b_vivado

@mithro
Copy link
Contributor

mithro commented Feb 15, 2020

I created a diagram at https://docs.google.com/drawings/d/1NJlN-cPLNx4nULHiL4938RD-H14izpayqtNSQ4XRjfA/edit which I believe should match what you are attempting @acomodi ?

image

Google Docs
Bitstream Verification Bitstream Generation Synthesis Place and Route Bitstream 1 Bitstream 2 fasmdiff fasm2bel VPR Vivado Vivado Yosys VPR Vivado Vivado Vivado Vivado Zero Differences Verify Verify Zero DRC + Timing Violations Test on hardware Test on hardware design.xdc design.v + +

@acomodi
Copy link
Contributor Author

acomodi commented Feb 17, 2020

@mithro Yes, this diagram describes the tests I have performed.
To be more precise though, there should be another kind of test regarding the VPR flow.
In fact, fasm2bels generates a .tcl file to constrain all the BELs and the routing, as well as constraints in the verilog design. The test I have performed does not make use of any of these constraints, and everything is left to Vivado. The reason behind this is to verify that at least the design produced by fasm2bels is working.

There are some problems though when adopting this flow, that I have worked around in f2b_vivado:

  • Non-existent IS_C_INVERTED parameter when using LDCE flip flops: this is a result of the Vivado -> fasm2bels -> Vivado flow. It is due to fasm2bels adding a non-existent parameter to this kind of FF. This should be solved by: fasm2bels: fix LDCE is_c_inverted #1313
  • BUFHCE wrong placement: fasm2bels outputs also BUFHCE instances which are getting misplaced by Vivado. This because BUFHCE resources are usually inferred by the tools, both VPR and Vivado, and explicitly instantiating them may result in impossible placement solution. To avoid this there are some possibilities:
  1. add parameters to fasm2bels to avoid the instantiation of BUFHCEs;
  2. remove BUFHCEs afterwords from the generated verilog.

@litghost
Copy link
Contributor

litghost commented Mar 2, 2020

I believe all remaining issues have been resolved, and the UART DDR is working consistently on hardware

@litghost litghost closed this as completed Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants