Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quicklogic: Router run time very high #1486

Open
tpagarani opened this issue May 14, 2020 · 8 comments
Open

Quicklogic: Router run time very high #1486

tpagarani opened this issue May 14, 2020 · 8 comments

Comments

@tpagarani
Copy link
Contributor

The router run time for Quicklogic S3 device seems to be significantly slower than run time on ice40 device of around similar size. Please find attached VPR logs for 3 different designs.

S3 log files
vpr_stdout.log
vpr_stdout.log
vpr_stdout.log

Ice40 was run from the master branch

Ice40 log files
vpr_stdout.log
vpr_stdout.log
vpr_stdout.log

Looking at the logs it appears that for S3 device, router starts with Net Criticality = 1 for all the connections, which ends up with very high TNS.

@tpagarani
Copy link
Contributor Author

By settting --max_criticality to 0 all these designs route very fast.. Attached the log for one of the designs with that setting
vpr_stdout.log

@litghost
Copy link
Contributor

This points to a bad timing model. Has the timing model present been validated?

@mkurc-ant
Copy link
Collaborator

mkurc-ant commented May 29, 2020

One thing that also may affect the runtime is the number of RR graph nodes. The graph for EOS S3 has roughly 6-7 times more nodes than the graph for the iCE40 device.

This is mostly due to the fact that when building the graph for EOS S3 certain connection rules are enforced that not necessary have to be obeyed. For example when connecting two parallel CHANX nodes, an intermediate CHANY node is inserted. This costs adding one extra node and edge to the graph. The function responsible for doing that can be found there in the code: https://github.com/antmicro/symbiflow-arch-defs/blob/fba277748427f7338128e586db61d14607d0da35/quicklogic/utils/routing_import.py#L144

I'd first suggest replacing content of that function by a single call to add_edge (found there: https://github.com/antmicro/symbiflow-arch-defs/blob/fba277748427f7338128e586db61d14607d0da35/quicklogic/utils/routing_import.py#L118) and then comparing RR node and edge counts plus runtime.

GitHub
FOSS architecture definitions of FPGA hardware useful for doing PnR device generation. - antmicro/symbiflow-arch-defs
GitHub
FOSS architecture definitions of FPGA hardware useful for doing PnR device generation. - antmicro/symbiflow-arch-defs

@tpagarani
Copy link
Contributor Author

@mkurc-ant, I remember you mentioning that within switchbox each N input Mux is represented as N+2 nodes and N+1 edges to mode variable load timing. Could this also be causing extra number of nodes?

@mkurc-ant
Copy link
Collaborator

@tpagarani That's true. You should only need N nodes and N edges per mux. Actually, since we assume all driver resistances to be 1ohm we could "integrate" these resistances into the switches that model sinks. That would require refactoring of the VPR database generation script (switch types have to be modified) and the routing import script (different graph topology).

@tpagarani
Copy link
Contributor Author

@mkurc-ant , I am thinking that if we don't model the input load based delay for each mux and just consider one single delay (worst one) through MUX, then we can represent the whole switchbox will less number of nodes and edges. For example, in STAGE1 there are 3 switches with each switch consisting of 8 , 6-input Mux. Since all 8 Muxes are sharing same inputs coming from STAGE0, we can direcly connect output nodes of STAGE0 with output nodes of STAGE1. Is that possible?

@mkurc-ant
Copy link
Collaborator

@tpagarani So If I understand you correctly, you want to do it as in the "B" part of the picture below:

Untitled Diagram

If you don't want to have the delay varying with the number of active inputs then that is doable. In the other case you double check if you can build the timing model with that topology using available VPR constructs (switches).

@vaughnbetz
Copy link

vaughnbetz commented Jun 1, 2020

Representation B is more efficient and should still be able to model load-dependent delays by using the Cinternal keyword when creating the switches used in the relevant parts of the rr-graph. See https://docs.verilogtorouting.org/en/latest/arch/reference/#switches

When you have a Cinternal specified on the switches used to go from stage N to stage N+1 in figure B (for example):

  • Turning on a switch from stage N to N+1 will expose 1 Cinternal load to the driving node (in stage N).
  • That will increase the delay of that node by [(Rswitch, stageN) + 0.5 * Rwire(stageN)] * Cinternal (StageN+1).
  • This allows fanout-dependent load modeling by picking your Rswitch, Rwire and Cinternal values (you likely have more degrees of freedom than you need, but can pick any values that lead to the right delays).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants