Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use intrinsics for bit twiddling? #7042

Closed
btzy opened this issue Jan 12, 2019 · 12 comments
Closed

Use intrinsics for bit twiddling? #7042

btzy opened this issue Jan 12, 2019 · 12 comments
Labels
stale Stale issues

Comments

@btzy
Copy link
Contributor

btzy commented Jan 12, 2019

I noticed that in bitmath_func.hpp there is some code at the bottom that uses compiler intrinsics. However, there are other functions in that file that are not using intrinsics but could be made faster with intrinsics, e.g. FindFirstBit, FindLastBit, and CountBits. Using intrinsics would lead to an algorithmic speedup for these three functions. Things like ROL and ROR would also benefit from intrinsics, but there will not be any improvement in time complexity.

Would intrinsics be preferred here? I can modify this file to use compiler intrinsics where available if the consensus is to use intrinsics.

@LordAro
Copy link
Member

LordAro commented Jan 12, 2019

Remember that it's got to be supported by all platforms, but I don't see why not otherwise

@Eddi-z
Copy link
Contributor

Eddi-z commented Jan 12, 2019

We have lots of legacy code that never was updated with newer features, so if you have some improvements that

  • make it clearer/simpler
  • make it not worse (profiling)
  • make it not incompatible (platform testing)

then go ahead

@btzy
Copy link
Contributor Author

btzy commented Jan 12, 2019

make it clearer/simpler

Hmm... well I would need three times the existing amount of code (one for MSVC, one for GCC/Clang, and one for fallback/unrecognized compiler)

make it not worse (profiling)

Definitely won't be worse

make it not incompatible (platform testing)

Won't be incompatible, because I can keep the existing code as a fallback for unrecognized compilers.

@Eddi-z
Copy link
Contributor

Eddi-z commented Jan 12, 2019

well, you don't necessarily need to fulfill all of these, just you will have more arguing to do. when you deviate from one, you must argue that it is better on another one

@nielsmh
Copy link
Contributor

nielsmh commented Jan 12, 2019

Before embarking on using intrinsics, I would look at the code generated by common compilers to check if they might recognize the pattern and generate identical code.

@fsimonis
Copy link
Contributor

@btzy There are SIMD wrapper libraries to simplify this task. They use intrinsics if possible and provide fallbacks when necessary.
The C++11 library Vc forms the base of a standards proposal which has good chances of getting into C++20. So it would be a good idea to pick that one.

@btzy
Copy link
Contributor Author

btzy commented Jan 23, 2019

@fsimonis I don't think we are looking at SIMD intrinsics like SSE/AVX over here. Instead, we are looking at things like popcount and findfirstset/findlastset, which have existed since very long ago (Intel i386 and ARM both have such instructions), and will operate on a single 32 or 64-bit integer.

@fsimonis
Copy link
Contributor

fsimonis commented Jan 23, 2019 via email

@nielsmh
Copy link
Contributor

nielsmh commented Jan 24, 2019

FindFirstBit compiles to naive code (no intrinsics) in Visual C++ 2019 preview, x64 Release target, and takes 39 instructions to determine the result 1 for input 2. There are two places this function is used that may be performance relevant: In railroad signal handling (signal.cpp ExploreSegment), and road handling (road_cmd.cpp GetTileTrackStatus_Road).

CountBits also compiles to naive code, and is used significantly more.

@nielsmh
Copy link
Contributor

nielsmh commented Jan 24, 2019

I tried implementing CountBits via the popcnt instruction and found no performance benefits at all, testing on my Core i5-4460 CPU. On a second inspection this is what one should expect, since the only place in the common game loop CountBits is used is in DrawTile_Trees.

I also tried implementing FirstFirstBit, FindLastBit, and the macro/specialized versions, via the intrinsics for BSF and BSR instructions, and see at most a 1% improvement.

For testing I used ProZone Game 13, which has 2400 trains running, and Wentboune Transport, which has 4833 trains, 5499 rvs, 2818 ships, 749 aircraft.

@btzy
Copy link
Contributor Author

btzy commented Jan 26, 2019

Would it still be reasonable to change the code to use intrinsics then (perhaps on grounds that intrinsics might be clearer to the reader than the existing algorithms)?

Also, my PR #7080 uses FindFirstSet for rebuilding the linkgraph GUI cache. But while it would get a reduction in time complexity, I don't think there will be significant speedup there either.

@stale
Copy link

stale bot commented Mar 27, 2019

This issue has been automatically marked as stale because it has not had any activity in the last two months.
If you believe the issue is still relevant, please test on the latest nightly and report back.
It will be closed if no further activity occurs within 7 days.
Thank you for your contributions.

@stale stale bot added the stale Stale issues label Mar 27, 2019
@stale stale bot closed this as completed Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale issues
Projects
None yet
Development

No branches or pull requests

5 participants