Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve rendering of large viewport areas #7962

Merged
merged 2 commits into from Dec 20, 2020

Conversation

ldpl
Copy link
Contributor

@ldpl ldpl commented Jan 25, 2020

My take on sprite sorter optimization, I estimate it to be about 10x faster than the current one. And while it's still not ideal it's decently fast and I'm tired of this shit anyway :P

It's made to perfectly replicate the current sorting algorithm (including weird reversing of moved sprites) so there shouldn't be any (new) glitches.

Also, there is definitely some room for further optimizations. At the very least map of sets can probably be replaced with some more specialized structure. For example, kd tree should be faster but its current implementation of node removal is very inefficient and needs to be optimized. It's also possible that deviating from the original algorithm can be beneficial to performance (std::sort does it instantly but I couldn't find a good ordering relation).

If someone is interested here is my testing code and over a dozen sorter implementations:
sprite-sorter.zip

@ldpl
Copy link
Contributor Author

ldpl commented Jan 26, 2020

Apparently iterating 10 times more elements with foward_list is still faster than bothering maps. So it's about 20x faster than original now.

@ldpl ldpl force-pushed the sprite-sorter-optimization branch from 7d7c0ce to 406958d Compare January 26, 2020 11:20
@nielsmh
Copy link
Contributor

nielsmh commented Jan 29, 2020

I tried this with one case that broke in the previous attempt at improving the sprite sorter, and did not get any flickering. So far it seems good.

@JGRennison
Copy link
Contributor

The ordering value seems to be inconsistently typed between int32 and uint32.
sprite_order is typed as indexed by int32 but next_order and ParentSpriteToDraw::order are typed as uint32 with values starting from UINT32_MAX.

@LordAro
Copy link
Member

LordAro commented Jan 30, 2020

@ldpl You mentioned on IRC that you'd found some other optimisations as well. Have you made any further progress with that, or would you consider it a separate change?

@ldpl
Copy link
Contributor Author

ldpl commented Jan 30, 2020

@LordAro well, I have a faster sorter, I just don't like it xD So I'm still looking for better solutions.

@JGRennison
Copy link
Contributor

JGRennison commented Feb 1, 2020

Have you done any measurements within OpenTTD itself, or just within the test harness in the zip?

I did some simple measurements by means of adding a line to the FPS window for the sprite sorter and got worse performance with this PR than the original algorithm (3x worse in the most demanding case).
The test vector in the zip seems to contain 27940 entries/sprites. This is nearly two orders of magnitude larger than the sprite sorter would typically be expected to do in one go.

@nielsmh
Copy link
Contributor

nielsmh commented Feb 1, 2020

If this does perform better on very large sprite counts, but worse on small (normal) counts, would it be worth to have both the original and this implementation, and select which to use based on number of sprites to sort?

@ldpl
Copy link
Contributor Author

ldpl commented Feb 1, 2020

While we indeed can use some other sprite sorter for smaller cases question is whether it's really an issue. As sprite sorter doesn't always need to be fast, even if it sorts sprites in 3 microseconds instead of 1 it doesn't matter much as long it's fast enough to compose the frame. In fact it's usually called with so few sprites to sort that it raises a question of measuring that performance as, for example, PerformanceTimer for such small times may as well measure random noise (especially on linux). And, btw, that test case I'm using is dumped from some average reddit game. But ofc, it's for extreme case of map unzoom on 4k where you don't need any measurements to spot a difference between 1s and 100ms lag xD

@JGRennison
Copy link
Contributor

JGRennison commented Feb 1, 2020

The number of sprites which will need to be sorted at a time is effectively bounded by the redraw region chunking in ViewportDrawChk. Not having to sort too many sprites at a time is one of the main reasons why the chunking is done that way.
As a result of that, the expected load is a larger number of sorts, of quantities of sprites of 1 - 3 figures.
Optimising for a single sort of a 5 figure number of sprites is not really representative, and makes the new algorithms look artificially better than the original. There isn't any need to add code to handle these sorts of sizes which will never be encountered.

@ldpl
Copy link
Contributor Author

ldpl commented Feb 1, 2020

@JGRennison there is very much need, see #6566, it takes several seconds just to sort sprites after zooming out on 4k.

ldpl added a commit to ldpl/OpenTTD that referenced this pull request Feb 14, 2020
ldpl added a commit to ldpl/OpenTTD that referenced this pull request Feb 14, 2020
@ldpl
Copy link
Contributor Author

ldpl commented Feb 14, 2020

Even though #7968 significantly reduced the need for a faster sprite sorter if we have one we can get rid of viewport subdivision completely and further improve fullscreen rendering time on 4k by about 50%.

And while that subdivision was initially added for a different reason it's no longer relevant:

<frosch123> hmm, i wondered what that weird comment for ViewportDrawChk meant... but git blame attributes it to me :/
<frosch123> oh, i just changed the formatting
<frosch123> it's a ludde comment
<frosch123> _dp_: found it. in the past the sprites were not stored in vectors
<frosch123> instead they were linked lists in a statically allocated bugger
<frosch123> *buffer
<frosch123> so the "overflow" refers to dos memory limits
<frosch123> wait what... that buffer was even on the stack
<frosch123> 32k buffer on the stack?
<frosch123> _dp_: april 2008
<frosch123> 4703eb2 removes all the custom-memory-allocator stuff for viewports

And for anyone who wants to mess with sprite sorter in future, I leave this as a source of inspiration, assistance, and a warning xD
sprite-sorter.zip

original     1264.0      search - for, order - quadratic
original_sse 1046.0      search - for, order - quadratic (sse)
opt1         1395.3      search - for, order - for
opt2         1721.5      search - for, order - MemMoveT
opt3         1492.8      search - for, order - for
kdtree1       190.8      search - kdtree, order - for
kdtree2       137.9      search - kdtree, order - vector+deque+map
kdtree3       143.8      search - kdtree, order - vector+deque+map
map1          159.7      x,y - map<set<pair>>; order - set+copy
map2          152.6      x,y - map<set<pair>>; order - set+copy
map3          191.8      x,y - map<set<pair>>; order - set+copy
map4          112.2      x,y - map<set<pair>>; order - vector+map
map4_clean    110.0      x,y - map<set<pair>>; order - vector+stack+map (sse, no debug)
map4_clean2   109.1      x,y - map<set<pair>>; order - vector+stack+map (sse, no debug)
map4_fancy    109.2      x,y - map<set<pair>>; order - vector+stack+map (sse, no debug, comments)
map5         1169.1      x,y - map<set<pair>>; order - vector+stack+map
map6          953.8      x,y - map<set<pair>>; order - vector+stack+map
map7          115.7      x,y - map<multimap>; order - vector+stack+map
map8          106.1      x,y - map<set<pair>>; order - vector+vector
map9          119.5      x,y,z - map<map<set<pair>>>; order - vector+vector
set1          304.8 FAIL x,y,z - set<>; order - vector+vector
list1          50.4      x+y+z - forward_list; order - vector+stack+map
list2          51.0      x+y+z - forward_list; order - vector+stack+map
list2_fancy    48.5      x+y+z - forward_list; order - vector+stack+map
list3          33.8      x+y+z - forward_list; order - vector+vector
list4          34.1 FAIL forward_list+vector
list5          18.4      list3 with moving 1 optimization
list6          20.6      list5 with coordinate normalization
list7          18.0      list5 extra optimized
list7_fancy    15.4      list7 with comments
list8_fancy    13.0      list7 with stack and comments (SSE)
list9_fancy    12.6      list8_fancy with preceding_prev optimization (SSE)
vector1        42.9      x,y,z - vector, order - vector+vector
vector2        34.6      x,y,z - vector, order - vector+vector
listmap1       45.5      list3+map8 (We're gonna combine!)
segtree1       55.4      x,y,z - segtree(min each), order - vector+stack+map
segtree2      359.4      some segtree
segtree3       53.2      some segtree
segtree4       96.1      some segtree
segtree5       54.3      some segtree
segtree6       42.2      some segtree
segtree7       27.1      some segtree
segtree8       25.8      some segtree
segtree9       26.7      x,y,z,order - segtree(min each)
segtree10      23.1      x,y,z - segtree(min each), order - vector+vector
segtree11       9.7 FAIL x,y,z - segtree(min each), order - vector+vector
fenwick1       59.2      x,y,z - fenwick<~list(n,p,u,i,y,s)>, order - vector+stack+map
fenwick2       32.8      x,y,z - fenwick<~list(n,p,u,i)>, order - vector+vector
fenwick3       33.2      x,y,z - fenwick<~forward_list(n,u,i)>, order - vector+vector
qsort           3.7 FAIL qsort by center mass

@ldpl ldpl changed the title Sprite sorter optimization Improve rendering of large viewport areas Feb 14, 2020
ldpl added a commit to ldpl/OpenTTD that referenced this pull request Feb 14, 2020
ldpl added a commit to ldpl/OpenTTD that referenced this pull request Feb 14, 2020
@TrueBrain TrueBrain added candidate: needs considering This Pull Request needs more opinions size: large This Pull Request is large in size; review will take a while labels Dec 14, 2020
@TrueBrain
Copy link
Member

TrueBrain commented Dec 18, 2020

I like patches like this, but they are always a bit tricky. Like you say it is bug-compatible with the current, but knowing that for sure when looking at the code ... will take a bit of time to figure out :D Having them both side-by-side with an assert between them might help proof that, as a random thought in my mind ;)

Additionally, you say 20 times faster, which means absolutely 0 to me. A function that only took 1ms in a 1000ms run, making 20 times faster is not impressive at all ;) So I miss a broader context here. How fast (or slow, pick your side) is the current sorter (in ms), and how much % of total calculation is this? And that for different games, screensizes, systems, OSes, etc etc :)

What I am aiming at .. rewriting code that has been stable for years now, always have to be approached with caution. And if your code is only a 0.1% FPS increase in 99% of the cases, it might not be worth it. But I know you a bit longer, so I suspect the gain from this PR is a lot bigger .. but I fail to read that here :D This lack of facts has been hinted in earlier comments, but I do not see a follow-up :) (it might have been on IRC, but ... it has been a year since :P Sorry if it has!)

Would you be up to provide some more numbers on this? I know that is not an easy task, but it would help out if it is worth spending the time reviewing this PR at this moment in time :) (As you know, we are a bit strained in review capacity atm)

Again, to be clear, I love patches like this :) But I have to be a bit cautious here :D Tnx!

@ldpl
Copy link
Contributor Author

ldpl commented Dec 18, 2020

@TrueBrain well, tbh I'm kind of done with this thing and don't want spend any more time on it. Also I probably wouldn't have done it at all had I known about #7968 beforehand. But still, there are some numbers and testing code one post above yours ;) And it does check that sprite order is the same as in the original sorter (you can even uncomment random_shuffle there, just ignore run times in that case as these sorters rely on initial array being somewhat sorted for performance). Admittedly it only uses one test case but it's some work to get more as they need to be saved from the real game and I don't think they'll make any difference anyway.

P.S. old vs new:

original_sse 1046.0      search - for, order - quadratic (sse)
list9_fancy    12.6      list8_fancy with preceding_prev optimization (SSE)

@TrueBrain
Copy link
Member

@TrueBrain well, tbh I'm kind of done with this thing and don't want spend any more time on it. Also I probably wouldn't have done it at all had I known about #7968 beforehand.

Does this mean we can better close this PR? Or am I misreading it? (again, I don't mind these PRs, just trying to get up-to-date on current state of some PRs, like this one).

But still, there are some numbers and testing code one post above yours ;)

There are numbers, that is for sure. But that was not what I was asking for ;) I am missing context of those numbers. They don't even tell me what they mean :P Jiffies? Iterations? But mostly, I miss how this fits in OpenTTD as a whole. Is this only 0.00001% of the time spent on drawing? Or is this 90%? As that matters .. are we optimizing something very small, or something very prominent. That is what I am after :)

And it does check that sprite order is the same as in the original sorter (you can even uncomment random_shuffle there, just ignore run times in that case as these sorters rely on initial array being somewhat sorted for performance). Admittedly it only uses one test case but it's some work to get more as they need to be saved from the real game and I don't think they'll make any difference anyway.

Sweet! And sorry, I did not check out the zips, as before I delve into such PRs, I like to know if it is worth considering to start with :D Matter of prioritization ;)

P.S. old vs new:

original_sse 1046.0      search - for, order - quadratic (sse)
list9_fancy    12.6      list8_fancy with preceding_prev optimization (SSE)

Still means little to me :) 1046 oranges? :D Guessing by the numbers it is either iterations or jiffies or something in between, and clearly it is faster. But making 0.1% of your application 100x faster still is only a ~0.1% increase in speed :D

PS: I know, this is a 10+ month old PR, and I know you did your work here (I know your work long enough to know this :D), so I am not debating any of that :) But I am trying to judge if this is worth our time here and now :) Tnx for the the fast reply, much appreciated :)

@ldpl
Copy link
Contributor Author

ldpl commented Dec 18, 2020

@TrueBrain well, tbh I'm kind of done with this thing and don't want spend any more time on it. Also I probably wouldn't have done it at all had I known about #7968 beforehand.

Does this mean we can better close this PR? Or am I misreading it? (again, I don't mind these PRs, just trying to get up-to-date on current state of some PRs, like this one).

Well, I understand your concerns but I guess you'll have to either close it or find someone else to do the testing as I'm sick tired of these sorters so I'm ok fixing small stuff if needed but not doing anything substatial in any foreseeable future. Also I did enough testing to convince myself to include it into cmclient so whether it makes into vanilla or not doesn't matter much to me as I'm not even gonna use it anyway.

There are numbers, that is for sure. But that was not what I was asking for ;) I am missing context of those numbers. They don't even tell me what they mean :P Jiffies? Iterations?

Some kind of processor cycles as I understand them, you can check cycle.h for more details. I just googled some code that seemed the most appropriate for precise performance measurement. But I only used it fore relative comparison, in absolute values without #7968 it was already pretty clear with frames taking seconds to render solely because of the sprite sorter.

@TrueBrain
Copy link
Member

This was not easy to balance (cost vs reward), for various of reasons. Read-back can be found here:

https://webster.openttdcoop.org/index.php?channel=openttd&date=1608249600#1608318064

@ldpl : can you rebase this please? We will merge after :) Cheers!

@TrueBrain TrueBrain removed the candidate: needs considering This Pull Request needs more opinions label Dec 18, 2020
@TrueBrain TrueBrain added the candidate: yes This Pull Request is a candidate for being merged label Dec 18, 2020
ldpl added a commit to ldpl/OpenTTD that referenced this pull request Dec 18, 2020
ldpl added a commit to ldpl/OpenTTD that referenced this pull request Dec 18, 2020
ldpl added a commit to ldpl/OpenTTD that referenced this pull request Dec 18, 2020
ldpl added a commit to ldpl/OpenTTD that referenced this pull request Dec 18, 2020
Copy link
Member

@TrueBrain TrueBrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a few spelling typos and a missing space; nothing code related :)

@TrueBrain TrueBrain merged commit e82333c into OpenTTD:master Dec 20, 2020
@ldpl ldpl deleted the sprite-sorter-optimization branch December 20, 2020 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
candidate: yes This Pull Request is a candidate for being merged size: large This Pull Request is large in size; review will take a while
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants