Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability due to small changes to memory initialization #1776

Open
acomodi opened this issue Nov 16, 2020 · 8 comments
Open

Instability due to small changes to memory initialization #1776

acomodi opened this issue Nov 16, 2020 · 8 comments

Comments

@acomodi
Copy link
Contributor

acomodi commented Nov 16, 2020

While testing the RR graph base costs fixes I came across a possible instability issue that causes run-time and QoR to drastically change between one build to another.

With the newly added litex designs autogeneration, the resulting litex-generated designs files are exactly the same from one build to another, except for the mem.init files. The difference in the mem.init files is related to the LiteX bios which will report a different timestamp of the design generation. This is expected to happen.

The small changes in the mem.init files though generate a great difference between the final outputs of the two runs.

In fact, the difference in the generated eblif files is huge, even though they should be different only in a specific BRAM init values fields. This leads to VPR producing different outputs, step by step, with a kind-of non-deterministic behaviour.

An example is the minilitex test:

Difference in litex-generated files:

diff -h arty_soc/gateware/ bup/arty_soc/gateware/
diff -h arty_soc/gateware/mem_2.init bup/arty_soc/gateware/mem_2.init
37,38d36
< 37
< 3a
40c38,40
< 38
---
> 3a
> 31
> 33
diff -h arty_soc/gateware/mem.init bup/arty_soc/gateware/mem.init
4436,4437c4436,4437
< 37303a36
< a32333a
---
> 32303a36
> a37313a
5378c5378
< 7057a674
---
> 9ddab42b
diff -h arty_soc/gateware/top.v bup/arty_soc/gateware/top.v
2c2
< // Auto-generated by Migen (a5cc037) & LiteX (004924a3) on 2020-11-16 16:07:32
---
> // Auto-generated by Migen (a5cc037) & LiteX (004924a3) on 2020-11-16 16:02:17

DIfference in resulting routing run-time and QoR:

  • Control run:

    • routing runtime: 62.37 seconds
    • CPD: 18.962 ns
  • Test run:

    • routing runtime: 94.08 seconds
    • CPD: 17.587

NOTE: this test has been performed using the RR graph base costs fixes, but I am running experiments using the master+wip version on Symbiflow baseline, and I do expect to see a similar behavior.

@mithro
Copy link
Contributor

mithro commented Nov 16, 2020

@acomodi There are a couple of things going wrong here. I would not expect a huge change in routing time for memory init functions?

@mithro
Copy link
Contributor

mithro commented Nov 16, 2020

@acomodi This kind of sounds like the BRAM is being converted into flops?

@acomodi
Copy link
Contributor Author

acomodi commented Nov 16, 2020

@mithro Hmm, I doubt this is the issue. I am pretty sure that, if a BRAM gets inferred as logic, a much higher number of SLICEMs will be used, and DRAMs would get inferred instead, which is not the case here:

Pack log:

 60620     Physical Tile BLK-TL-CLBLM_L:                                                                                                                                                                                                                                                  
 60621     Block Utilization: 0.62 Logical Block: BLK-TL-SLICEL                                                                                                                                                                                                                           
 60622     Block Utilization: 0.01 Logical Block: BLK-TL-SLICEM                                                                                                                                                                                                                           
 60623     Physical Tile BLK-TL-CLBLM_R:                                                                                                                                                                                                                                                  
 60624     Block Utilization: 0.37 Logical Block: BLK-TL-SLICEL                                                                                                                                                                                                                           
 60625     Block Utilization: 0.00 Logical Block: BLK-TL-SLICEM 

I have compared the two different resource utilizations from the pack.log and they are exactly the same in both runs.

This means that the circuit is the same and it gets implemented on the same number/types of resources.

I think that this might be an issue with the initial placement, and placement in general, specific to the BRAMs.
Basically, even though the same packed clusters are being produced (at least this is the assumption) their ordering differs, causing changes in the initial placement.

Given that we currently support only BRAM_Ls, we might end up in a situation where some BRAMs get placed far from the core logic of the design, ending up in the reported differences in CPD and run-time. And this isn't actually related to BRAMs, but to all the tiles, but given that there is a hughe choice of CLBs, the placer should correctly optimize their placement, while, the lack of BRAMs might end up in bad placements (with consequent bad routing results).

@mithro
Copy link
Contributor

mithro commented Nov 16, 2020

@acomodi - So this has nothing to do with the BRAM contents then and should happen with different seeds?

@acomodi
Copy link
Contributor Author

acomodi commented Nov 16, 2020

@mithro At the moment these are only theories. I am performing additional tests and trying also different seeds to see if this might be the real issue here. I'll post some additional data soon.

@acomodi
Copy link
Contributor Author

acomodi commented Nov 16, 2020

Update

Seems that the issue is initial placement indeed. By using the same packer output and by changing the seed during placement to 1000 I got the following routing iterations:

default seed:

## Initializing router criticalities took 0.03 seconds (max_rss 3353.3 MiB, delta_rss +0.0 MiB)
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter   Time    pres  BBs    Heap  Re-Rtd  Re-Rtd Overused RR Nodes      Wirelength      CPD       sTNS       sWNS       hTNS       hWNS Est Succ
      (sec)     fac Updt    push    Nets   Conns                                       (ns)       (ns)       (ns)       (ns)       (ns)     Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 108: 6 timing startpoints were not constrained during timing analysis
Warning 109: 1521 timing endpoints were not constrained during timing analysis
   1   17.0     0.0    0 2.2e+08    7670   25926   12853 ( 0.439%)  332480 ( 5.2%)   18.939     -70.37     -2.273      0.000      0.000      N/A
   2    5.0     2.8    2 4.9e+07    5694   18419    6520 ( 0.223%)  363815 ( 5.7%)   18.424     -52.74     -1.758      0.000      0.000      N/A
   3    4.1     3.4    4 3.7e+07    4113   13278    4585 ( 0.157%)  382077 ( 5.9%)   18.440     -63.06     -1.774      0.000      0.000      N/A
   4    3.6     4.1    4 3.2e+07    3018   10305    2985 ( 0.102%)  396871 ( 6.2%)   18.349     -49.51     -1.683      0.000      0.000      N/A
   5    3.3     4.9    9 2.8e+07    2131    7832    1730 ( 0.059%)  410556 ( 6.4%)   18.377     -53.28     -1.711      0.000      0.000      N/A
   6    3.1     5.9    4 2.4e+07    1396    5555     947 ( 0.032%)  421353 ( 6.6%)   18.409     -52.52     -1.743      0.000      0.000      N/A
   7    1.9     7.0    6 1.5e+07     819    3461     437 ( 0.015%)  428970 ( 6.7%)   18.409     -54.91     -1.743      0.000      0.000      N/A
   8    1.6     8.4    5 1.1e+07     437    1848     190 ( 0.006%)  433440 ( 6.7%)   18.406     -58.14     -1.740      0.000      0.000      N/A
   9    0.5    10.1    8 3655223     202     735      65 ( 0.002%)  435479 ( 6.8%)   18.389     -57.54     -1.723      0.000      0.000      N/A
  10    0.8    12.2    3 4537900      76     281      19 ( 0.001%)  436226 ( 6.8%)   18.389     -58.77     -1.723      0.000      0.000       14
  11    0.1    14.6    0  577913      23      78       4 ( 0.000%)  436748 ( 6.8%)   18.389     -59.10     -1.723      0.000      0.000       13
  12    0.0    17.5    0  154050       5      11       1 ( 0.000%)  436723 ( 6.8%)   18.389     -59.10     -1.723      0.000      0.000       13
  13    0.0    21.0    0   19600       1       5       0 ( 0.000%)  436780 ( 6.8%)   18.389     -59.85     -1.723      0.000      0.000       12

custom seed (1000):

## Initializing router criticalities took 0.03 seconds (max_rss 3353.0 MiB, delta_rss +0.0 MiB)
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter   Time    pres  BBs    Heap  Re-Rtd  Re-Rtd Overused RR Nodes      Wirelength      CPD       sTNS       sWNS       hTNS       hWNS Est Succ
      (sec)     fac Updt    push    Nets   Conns                                       (ns)       (ns)       (ns)       (ns)       (ns)     Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 108: 6 timing startpoints were not constrained during timing analysis
Warning 109: 1521 timing endpoints were not constrained during timing analysis
   1   19.2     0.0    0 2.3e+08    7670   25926   13086 ( 0.447%)  340954 ( 5.3%)   19.869     -132.1     -3.203      0.000      0.000      N/A
   2    5.3     2.8    4 4.5e+07    5679   18278    6717 ( 0.229%)  371999 ( 5.8%)   19.854     -130.3     -3.188      0.000      0.000      N/A
   3    4.3     3.4    7 3.4e+07    4149   13441    4758 ( 0.162%)  391802 ( 6.1%)   19.774     -138.9     -3.108      0.000      0.000      N/A
   4    4.1     4.1    4 3.2e+07    3005   10744    3210 ( 0.110%)  407592 ( 6.3%)   19.761     -137.1     -3.095      0.000      0.000      N/A
   5    3.8     4.9    2 2.8e+07    2148    8274    1896 ( 0.065%)  423111 ( 6.6%)   19.839     -143.1     -3.173      0.000      0.000      N/A
   6    2.7     5.9    7 2.0e+07    1420    5994    1071 ( 0.037%)  432306 ( 6.7%)   19.858     -154.0     -3.192      0.000      0.000      N/A
   7    1.8     7.0   11 1.3e+07     890    3846     509 ( 0.017%)  440328 ( 6.9%)   19.879     -157.9     -3.213      0.000      0.000      N/A
   8    1.2     8.4    8 8812194     450    1885     224 ( 0.008%)  445947 ( 6.9%)   19.923     -163.2     -3.257      0.000      0.000      N/A
   9    1.0    10.1    4 6420327     212     870      78 ( 0.003%)  448470 ( 7.0%)   19.911     -162.1     -3.245      0.000      0.000      N/A
  10    0.3    12.2    4 2119048      82     279      31 ( 0.001%)  449305 ( 7.0%)   19.897     -161.4     -3.231      0.000      0.000       15
  11    0.2    14.6    2 1247043      40     131      12 ( 0.000%)  449878 ( 7.0%)   19.911     -163.3     -3.245      0.000      0.000       14
  12    0.1    17.5    1  798032      17      39       6 ( 0.000%)  450097 ( 7.0%)   19.911     -163.3     -3.245      0.000      0.000       14
  13    0.2    21.0    1  962631      11      23       4 ( 0.000%)  450117 ( 7.0%)   19.911     -163.3     -3.245      0.000      0.000       14
  14    0.2    25.2    0  744835       5       5       2 ( 0.000%)  450255 ( 7.0%)   19.911     -163.3     -3.245      0.000      0.000       14
  15    0.0    30.3    0  104536       2       4       0 ( 0.000%)  450311 ( 7.0%)   19.911     -163.3     -3.245      0.000      0.000       15

This is using the current symbiflow-arch-defs master (1d92154) and its conda VTR package.

I have also double-checked once again the packer results utilizatiion between two different runs and it actually changes from run to run:

Control run:

Resource usage...
        Netlist
                1114    blocks of type: BLK-TL-SLICEL
        Architecture
                2150    blocks of type: BLK-TL-CLBLL_L
                1200    blocks of type: BLK-TL-CLBLL_R
                1800    blocks of type: BLK-TL-CLBLM_L
                3000    blocks of type: BLK-TL-CLBLM_R
        Netlist
                15      blocks of type: BLK-TL-SLICEM
        Architecture
                1800    blocks of type: BLK-TL-CLBLM_L
                3000    blocks of type: BLK-TL-CLBLM_R
        Netlist
                25      blocks of type: BLK-TL-BRAM_L
        Architecture
                55      blocks of type: BLK-TL-BRAM_L
        Netlist
                8       blocks of type: BLK-TL-IOPAD
        Architecture
                6       blocks of type: BLK-TL-LIOPAD_SING
                4       blocks of type: BLK-TL-RIOPAD_SING
                72      blocks of type: BLK-TL-LIOPAD_M
                48      blocks of type: BLK-TL-RIOPAD_M
                72      blocks of type: BLK-TL-LIOPAD_S
                48      blocks of type: BLK-TL-RIOPAD_S
        Netlist
                0       blocks of type: BLK-TL-IOPAD_M
        Architecture
                72      blocks of type: BLK-TL-LIOPAD_M
                48      blocks of type: BLK-TL-RIOPAD_M
        Netlist
                0       blocks of type: BLK-TL-IOPAD_S
        Architecture
                72      blocks of type: BLK-TL-LIOPAD_S
                48      blocks of type: BLK-TL-RIOPAD_S
        Netlist
                2       blocks of type: BLK-TL-BUFGCTRL
        Architecture
                16      blocks of type: BLK-TL-CLK_BUFG_BOT_R
                16      blocks of type: BLK-TL-CLK_BUFG_TOP_R
        Netlist
                1       blocks of type: BLK-TL-PLLE2_ADV
        Architecture
                2       blocks of type: BLK-TL-CMT_TOP_L_UPPER_T
                3       blocks of type: BLK-TL-CMT_TOP_R_UPPER_T
        Netlist
                0       blocks of type: BLK-TL-HCLK_IOI3
        Architecture
                5       blocks of type: BLK-TL-HCLK_IOI3
        Netlist
                1       blocks of type: SYN-VCC
        Architecture
                1       blocks of type: SYN-VCC
        Netlist
                1       blocks of type: SYN-GND
        Architecture
                1       blocks of type: SYN-GND

Test run:

Resource usage...
        Netlist
                1099    blocks of type: BLK-TL-SLICEL
        Architecture
                2150    blocks of type: BLK-TL-CLBLL_L
                1200    blocks of type: BLK-TL-CLBLL_R
                1800    blocks of type: BLK-TL-CLBLM_L
                3000    blocks of type: BLK-TL-CLBLM_R
        Netlist
                15      blocks of type: BLK-TL-SLICEM
        Architecture
                1800    blocks of type: BLK-TL-CLBLM_L
                3000    blocks of type: BLK-TL-CLBLM_R
        Netlist
                25      blocks of type: BLK-TL-BRAM_L
        Architecture
                55      blocks of type: BLK-TL-BRAM_L
        Netlist
                8       blocks of type: BLK-TL-IOPAD
        Architecture
                6       blocks of type: BLK-TL-LIOPAD_SING
                4       blocks of type: BLK-TL-RIOPAD_SING
                72      blocks of type: BLK-TL-LIOPAD_M
                48      blocks of type: BLK-TL-RIOPAD_M
                72      blocks of type: BLK-TL-LIOPAD_S
                48      blocks of type: BLK-TL-RIOPAD_S
        Netlist
                0       blocks of type: BLK-TL-IOPAD_M
        Architecture
                72      blocks of type: BLK-TL-LIOPAD_M
                48      blocks of type: BLK-TL-RIOPAD_M
        Netlist
                0       blocks of type: BLK-TL-IOPAD_S
        Architecture
                72      blocks of type: BLK-TL-LIOPAD_S
                48      blocks of type: BLK-TL-RIOPAD_S
        Netlist
                2       blocks of type: BLK-TL-BUFGCTRL
        Architecture
                16      blocks of type: BLK-TL-CLK_BUFG_BOT_R
                16      blocks of type: BLK-TL-CLK_BUFG_TOP_R
        Netlist
                1       blocks of type: BLK-TL-PLLE2_ADV
        Architecture
                2       blocks of type: BLK-TL-CMT_TOP_L_UPPER_T
                3       blocks of type: BLK-TL-CMT_TOP_R_UPPER_T
        Netlist
                0       blocks of type: BLK-TL-HCLK_IOI3
        Architecture
                5       blocks of type: BLK-TL-HCLK_IOI3
        Netlist
                1       blocks of type: SYN-VCC
        Architecture
                1       blocks of type: SYN-VCC
        Netlist
                1       blocks of type: SYN-GND
        Architecture
                1       blocks of type: SYN-GND

There is a variation in the SLICEL count.

@litghost
Copy link
Contributor

The variation in the SLICEL count might also be a packing issue, rather than a placer issue. Still worth investigating.

@acomodi
Copy link
Contributor Author

acomodi commented Nov 16, 2020

@litghost Indeed. What bugs me though is the huge difference in the eblifs:

eblifs.tar.gz

An initialization memory change should not alter the synthesized output this much.
I'll need to reduce the test case to get a better understanding on what is happening.

Anyway, this issue, for now will reflect on all the litex tests results from CI, so two different CI runs cannot be compared at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants