Quicklogic: Router run time very high #1486

tpagarani · 2020-05-14T08:20:01Z

The router run time for Quicklogic S3 device seems to be significantly slower than run time on ice40 device of around similar size. Please find attached VPR logs for 3 different designs.

S3 log files
vpr_stdout.log
vpr_stdout.log
vpr_stdout.log

Ice40 was run from the master branch

Ice40 log files
vpr_stdout.log
vpr_stdout.log
vpr_stdout.log

Looking at the logs it appears that for S3 device, router starts with Net Criticality = 1 for all the connections, which ends up with very high TNS.

tpagarani · 2020-05-14T17:47:10Z

By settting --max_criticality to 0 all these designs route very fast.. Attached the log for one of the designs with that setting
vpr_stdout.log

litghost · 2020-05-14T20:35:22Z

This points to a bad timing model. Has the timing model present been validated?

mkurc-ant · 2020-05-29T13:47:43Z

One thing that also may affect the runtime is the number of RR graph nodes. The graph for EOS S3 has roughly 6-7 times more nodes than the graph for the iCE40 device.

This is mostly due to the fact that when building the graph for EOS S3 certain connection rules are enforced that not necessary have to be obeyed. For example when connecting two parallel CHANX nodes, an intermediate CHANY node is inserted. This costs adding one extra node and edge to the graph. The function responsible for doing that can be found there in the code: https://github.com/antmicro/symbiflow-arch-defs/blob/fba277748427f7338128e586db61d14607d0da35/quicklogic/utils/routing_import.py#L144

I'd first suggest replacing content of that function by a single call to add_edge (found there: https://github.com/antmicro/symbiflow-arch-defs/blob/fba277748427f7338128e586db61d14607d0da35/quicklogic/utils/routing_import.py#L118) and then comparing RR node and edge counts plus runtime.

GitHub
antmicro/symbiflow-arch-defs
FOSS architecture definitions of FPGA hardware useful for doing PnR device generation. - antmicro/symbiflow-arch-defs

tpagarani · 2020-05-29T18:14:20Z

@mkurc-ant, I remember you mentioning that within switchbox each N input Mux is represented as N+2 nodes and N+1 edges to mode variable load timing. Could this also be causing extra number of nodes?

mkurc-ant · 2020-06-01T06:58:56Z

@tpagarani That's true. You should only need N nodes and N edges per mux. Actually, since we assume all driver resistances to be 1ohm we could "integrate" these resistances into the switches that model sinks. That would require refactoring of the VPR database generation script (switch types have to be modified) and the routing import script (different graph topology).

tpagarani · 2020-06-01T07:55:07Z

@mkurc-ant , I am thinking that if we don't model the input load based delay for each mux and just consider one single delay (worst one) through MUX, then we can represent the whole switchbox will less number of nodes and edges. For example, in STAGE1 there are 3 switches with each switch consisting of 8 , 6-input Mux. Since all 8 Muxes are sharing same inputs coming from STAGE0, we can direcly connect output nodes of STAGE0 with output nodes of STAGE1. Is that possible?

mkurc-ant · 2020-06-01T09:45:30Z

@tpagarani So If I understand you correctly, you want to do it as in the "B" part of the picture below:

If you don't want to have the delay varying with the number of active inputs then that is doable. In the other case you double check if you can build the timing model with that topology using available VPR constructs (switches).

vaughnbetz · 2020-06-01T21:38:27Z

Representation B is more efficient and should still be able to model load-dependent delays by using the Cinternal keyword when creating the switches used in the relevant parts of the rr-graph. See https://docs.verilogtorouting.org/en/latest/arch/reference/#switches

When you have a Cinternal specified on the switches used to go from stage N to stage N+1 in figure B (for example):

Turning on a switch from stage N to N+1 will expose 1 Cinternal load to the driving node (in stage N).
That will increase the delay of that node by [(Rswitch, stageN) + 0.5 * Rwire(stageN)] * Cinternal (StageN+1).
This allows fanout-dependent load modeling by picking your Rswitch, Rwire and Cinternal values (you likely have more degrees of freedom than you need, but can pick any values that lead to the right delays).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quicklogic: Router run time very high #1486

Quicklogic: Router run time very high #1486

tpagarani commented May 14, 2020

tpagarani commented May 14, 2020

litghost commented May 14, 2020

mkurc-ant commented May 29, 2020 •

edited by unfurl-links bot

tpagarani commented May 29, 2020

mkurc-ant commented Jun 1, 2020

tpagarani commented Jun 1, 2020

mkurc-ant commented Jun 1, 2020

vaughnbetz commented Jun 1, 2020 •

edited

Quicklogic: Router run time very high #1486

Quicklogic: Router run time very high #1486

Comments

tpagarani commented May 14, 2020

tpagarani commented May 14, 2020

litghost commented May 14, 2020

mkurc-ant commented May 29, 2020 • edited by unfurl-links bot

tpagarani commented May 29, 2020

mkurc-ant commented Jun 1, 2020

tpagarani commented Jun 1, 2020

mkurc-ant commented Jun 1, 2020

vaughnbetz commented Jun 1, 2020 • edited

mkurc-ant commented May 29, 2020 •

edited by unfurl-links bot

vaughnbetz commented Jun 1, 2020 •

edited