Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bottleneck in 072 and 074 fuzzers #1214

Open
acomodi opened this issue Jan 28, 2020 · 9 comments
Open

Bottleneck in 072 and 074 fuzzers #1214

acomodi opened this issue Jan 28, 2020 · 9 comments

Comments

@acomodi
Copy link
Contributor

acomodi commented Jan 28, 2020

The data dumping process for fuzzers 072 and 074 is taking a huge part in the run-time, expecially for big parts (e.g. artix 200T).

For what regards fuzzer 074, the run-time to get the data is divided in tiles and nodes:

Vivado start time:
10:48:25
Tiles Job start time:
10:48:34
Tiles Job end time and Nodes Job start time:
11:32:58
Nodes Job end time:
11:36:14
Vivado end time and reduction start time:
11:36:14

The above is related to the zynq7010 part.

This is an issue, as it prevents scaling on bigger parts.
There is the need to find a more optimal solution to dump all data necessary for the reduction step.

@litghost
Copy link
Contributor

Looks like the node job time is pretty small compared to the tiles job time, so that is where we should look for issues.

@acomodi
Copy link
Contributor Author

acomodi commented Jan 28, 2020

@litghost Yep, I have thought that maybe, we exceed in the production of the tiles.json5 temporary files. Meaning that we could use an ROI to shrink the data produced, being the tiles with the same name similar at 99%, except some missing wires.

We could extract the tiles from 1 or 2 clock regions, maintaining the run-time constant with the change of the part. I need to verify whether this is doable though.

@litghost
Copy link
Contributor

We could extract the tiles from 1 or 2 clock regions, maintaining the run-time constant with the change of the part. I need to verify whether this is doable though.

This is a fragile solution, for a number of reasons. There are "weird" tiles around the following areas:

  • Hard blocks
  • Top and bottom of the grid
  • Near the clock regions

As a result, there would be a fairly manual process to identify all the "weird" stuff.

My baseline assumption right now is we are doing something in the tiles loop that is "expensive", e.g. a linear lookup, that needs to be fixed.

I suggest bisecting the work that jobtiles.tcl does until the runtime drops.

As a concrete example, what if jobtiles.tcl only outputs the wires in the tiles, does it still take as long, etc, etc

@acomodi
Copy link
Contributor Author

acomodi commented Jan 28, 2020

@litghost Right, I'll start to profile run-time in a more detailed way to get where exactly is the bottleneck of the process.

@acomodi
Copy link
Contributor Author

acomodi commented Jan 28, 2020

@litghost I think I have identified what the problem is and where is the bottleneck.

By disabling the pip loop that extracts all pips related to a tile, run-time dropped from ~44 minutes to ~8 minutes for the zynq7010.

Moreover, I am keen to think that the issue is in the INT tiles. They are the most popular tiles, and each of them contains hundreds of pips, resulting in INT.json5 files reaching more than 100k lines.

@litghost
Copy link
Contributor

By disabling the pip loop that extracts all pips related to a tile, run-time dropped from ~44 minutes to ~8 minutes for the zynq7010.

Try dropping anything that uses lookup_speed_model_index, or it's children. If that speeds things up, that code is important, but also recent. It might need a refactor.

@acomodi
Copy link
Contributor Author

acomodi commented Jan 28, 2020

@litghost That was the right call, run-time is now ~13 minutes for the tiles job

 Tiles Job start time:
 2020-01-28 17:47:59.772677
 Tiles Job end time
 2020-01-28 18:00:31.769572

@litghost
Copy link
Contributor

@litghost That was the right call, run-time is now ~13 minutes for the tiles job

Ok, so rather than writing out the full timing info, just write the speed index. Then merge all the tile jsons (e.g. merging the speed index), then create a tcl script to back annotate the speed indices with the timing data originally dumps from the tcl script.

@marzoul
Copy link
Contributor

marzoul commented Apr 24, 2022

Hi, the last comments may be a bit old, but the issue is still real ;-)
Especially for 074, I have not looked at results of 072. This is when experimenting with virtex-7, chip 330T, the smallest one of that series.

Disk usage of 074 is 82 GB, and this is nearly exclusively the 174k very tiny json5 files. I did an experiment : concatenate all these, and compress with lz4 with fastest compression => result is one 4 GB file (to be compared to 40+ GB of file contents and 82 GB of actual disk usage). So a reduction of 40x.
Given the very low CPU usage during most of 074 (1-3%, peaks at 8% of one CPU), I think that one of the issues at least is access to disk. Yes I have spinning HDD so this is exacerbated, but at least the issue is revealed ;-)

Looking casually into the python code, it looks like these json files are accessed by bulk with processing interleaved, so it could make sense also for CPU, to have this packed+compressed storage. Everything would fit cached in RAM, too :-)
Perhaps use compressed files per-type of FPGA element (slice, SDP, PIP, etc) in case it better fits how the code accesses it, no problem.

Other issue for scalability, I monitored RAM usage => result is up to 66.5 GB of virtual memory.
To my eyes, given the raw amount of FPGA elements and configuration bits, this is excessive. Casually looking into the python code again, I think that the issue is in implementation of database representation in the python code.
Don't hesitate to tell if I'm wrong - but it looks like generic maps indexed by strings are super nasty in python for RAM usage (and speed too of course, indirectly). A conversion of these computations to C++ could be appropriate. Of course, to consider only after evaluation of packed+compressed disk storage).

EDIT : The fuzzer 074 took 20 days to finish xD
There was a bit of swapping involved, hence my focus on 074.

What do you think of these observations ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants