-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring some parallelism to boolean code #616
Conversation
I have tested this with sanitizers and found no issues. Using a test model with some booleans I saw generation going from 270ms -> 150ms -> 120ms on a 4-core 8 thread machine. |
Looking for a quick review from @ruevs. You've been studying booleans so your opinion on this seems important. I don't think this pollutes the boolean code much at all. I spent a lot of time looking at things to make sure the "into" shell would not be written to in SSurface::MakeCopyTrimAgainst and it does not unless there is an error so there is one critical section for that and the associated printing of error message. |
@phkahler Looks like CI fails; I believe the Linux build fails because your code leaks, and the Windows build fails because of a broken invariant, and both of these are caused by using AllocTemporary from multiple threads. We can fix this by vendoring a better allocator, like jemalloc, and using arenas. I was going to postpone that until a later release but I have no objection to doing it now-ish. |
In a debug build on VS2019 this crashes on HeapAlloc here (00B798C4):
Call Stack:
|
Yep, that's the exact same bug that I fixed earlier (allocator with |
Removing
According to MSDN "There is a small performance cost to serialization, but it must be used whenever multiple threads allocate and free memory from the same heap.".
Of course other code paths using AllocTemporary that have not been parallelized will suffer a bit. I don't know how to measure it though. |
That's a reasonable fix if you only consider Windows, but there are several issues with it:
|
Points 1 and 3 - I agree. |
Oh sorry, I misread. Can you redo the comparison without this PR merged, where one arm of the comparison has |
@ruevs Try the attached model. Drag the original hole around, all 8 will follow. The change here doubled my framerate on this one with chord tolerance 0.03 (0.1 is better too, but that's not how I timed it). Also had to quiet the "didn't converge" messages to see Generate times. @whitequark It's up to you. These OMP changes will help in various cases. I also think there is more performance to be had with it. Tying to keep the changes as non-invasive as possible while still making decent gains. |
I think we should absolutely merge them. I am only explaining the problems that the current state of the codebase presents; I am not arguing against this change. We can and should solve these problems, especially given that the speedup is quite significant. |
I did - on |
SolveSpace should print generation times in the terminal and in the debug output pane in MSVC (or wherever OutputDebugStringA ends in on your system), so you don't have to do that... |
@whitequark I know, I was lazy to comment out all the "trim was unterminated", "trim was empty", "didn't converge ????" etc... (the model fails many things :-) ) But it is cleaner to do it without all the debug output so I did now.
For "A_wheel.zip" dragging the rotated circle with 0.03 chord tolerance
For LatheSplineIntersectCylinder.zip chaging from 0.5 to 0.001 chord tolerance:
So the effect of |
This confirms my earlier suspicion that |
@whitequark Are you certain it's leaking on Linux? It runs fine and I don't see increasing memory usage. Also ran across this sanitizer bug: https://gcc.gnu.org/bugzilla//show_bug.cgi?id=85081 It says fix was targeted for gcc 7.4 which is what Travis is using but I haven't seen that confirmed. I also - perhaps foolishly - have a lot of faith in the GCC developers and it seems odd that the tools would have a problem with this. I may have screwed something up, but I trust the tools. OTOH if you're sure, I trust you too. |
Yes, I am. (It's not just a leak but actually UB that leaks as a side effect.) Consider two threads racing on solvespace/src/platform/utilunix.cpp Lines 32 to 34 in d857e3e
You probably can't see increasing memory usage because the threads don't allocate that much, but, unlike the Windows code, it's very much not safe. We could replace it with an atomic CAS loop, or a mutex, but I'm quite wary of writing our own multithreaded allocator on top of the system one for no especially good reason. Plus even if we do, it'll still be slower than a proper arena, like the one Windows code uses. |
My understanding of OpenMP is (to say the least) superficial, but can't we just put a AllocTempHeader *h =
(AllocTempHeader *)malloc(n + sizeof(AllocTempHeader));
h->prev = NULL;
#pragma omp critical
{
h->next = Head;
if(Head) Head->prev = h;
Head = h;
}
memset(&h[1], 0, n);
return (void *)&h[1]; I have no Linux to try it under (apart from a RaspberryPi that I do not want to mess with) so I am not sure if this is allowed. All the examples I found put critical pargmas inside By the way I think that the undefined behaviour caused by multiple threads "racing each other" on adding blocks to the list will at worst cause the list to be brake up into multiple disjoint lists and thus |
What is an "arena"? A separate heap like in |
The standard is clear that data races are undefined behavior. It doesn't matter whether it happens to work or not on your compiler; the important part is that you are violating the language contract, and when it is violated, the compiler gives you no guarantees whatsoever about the behavior of your program, either before or after the data race happens. In other words, given the way C and C++ define undefined behavior, there is no "just" in the nature of the bug, and you are not allowed to reason about the consequences of the bug, not even in terms of simple causality. If UB is ever invoked at any time then the entire execution of the program gets cursed. That said, while it is very important to understand the concepts behind UB, in practice you will rarely see strange consequences like acausality. It does still happen though. |
Yes.
Yes.
We can use an existing allocator like jemalloc or tcmalloc. They also tend to perform significantly better than the stock system allocator. |
Just to be clear, implementing
We can definitely solve the immediate problem here (thread-unsafety and unnecessary overhead on POSIX platforms) with mmap and a simple bump pointer allocator, but I'm thinking further than that. |
According to the standard at a high level this is true. And it is absolutely unacceptable.
I'm not familiar with jemalloc or tcmalloc - I'll have to look. If they are fast on Windows (we can not get any faster than the current HepAlloc on Win32 but at least close), and fast on *nix (presumably using mmap internally), and portable, and integrated with Valgring and sanitizers, they would probably be worth looking at. On the other hand if we just want to integrate this parallel code - I agree that writing our own just for How about the critical section that will be needed in any case - would |
That is of course not true. Open the manual for your CPU and check if it has some variation on "if some conditions are fulfilled, the operation is UNPREDICTABLE". If we're talking about data races, then you can get fundamentally nondeterministic behavior if two of the cores of your multiprocessor system run at clocks that do not have a defined phase relationship. You could also get nondeterministic behavior if you're reading uninitialized SRAM after power-up. If you were to revise the "particular processor" to some variation of "particular processor that is implemented as a single-clock fully synchronous fully resettable design" I would agree but that excludes virtually every CPU that's currently shipping.
This is a consequence of having a non-checked contract in your language. In case of C and C++, this is the combination of the lack of memory safety with the desire to have optimizing compilers for multiple architectures. (What's the result of overwriting stack memory on an architecture that didn't even exist when your language got standardized? Does it even have a stack? C, of course, doesn't.) I agree that this is unacceptable for most code, which is why we should write most code in memory-safe languages. But even Rust has an unsafe subset, which not only includes undefined behavior, but it is also exploited by rustc even more aggressively than a C++ compiler would. This is well into off-topic territory, but as a language designer working on embedded systems I felt compelled to comment on the intersection of my two current areas of work :)
Why can't we? HeapAlloc does more accounting than is necessary; a per-thread bump allocator would be optimal, but HeapAlloc is nowhere close.
Yes, that's the idea. The main annoyance is that jemalloc is a PITA to build as it uses autoconf (there's a jemalloc-cmake "official fork" but it's outdated), and tcmalloc is a PITA to build as it uses bazel, which is even worse. So I'd have to rewrite their build system if we were to ship them as a part of SolveSpace.
If we implement our own per-thread bump allocator it will as fast or faster than any library we can use, but that's a bit annoying because we'd have to use pthread_key_create and some logic specific for the main thread for cleanup. If we do a global bump allocator there'll be contention on the head and the elements, but that's probably okay.
You don't need a critical section; you can update the pointer of a bump allocator with an atomic CAS loop. |
I gave this approach a try. It yields a very nice compact implementation using standard C++11 atomics, but only if the backing storage for the allocator is reserved (but not committed) ahead of time. On 64-bit platforms this isn't really a problem since there is essentially no cost to reserveing a few GB (or a few hundred) of virtual memory, most of which will never be used. But on 32-bit platforms you have to be careful, or you'll step on the toes of the libc allocator. It doesn't seem very nice to have a more limited allocator on Linux than we have on Windows, where it's just a normal heap that doesn't interfere with other heaps. (macOS is always 64-bit these days.) If you have to manage multiple |
I've implemented the approach with a critical section in #620. |
Nice! I like the Since I was curious whether it would work I also tried replacing the C++11 mutex with Results from my test as above: Debug build
Release build
So in debug mode the C++ mutex is a tiny bit faster than the other two? Interesting... Bottom line - in my opinion it is OK to merge #620 and then #616 . |
@ruevs Cool! Thanks for confirming this. I'm actually quite surprised that there seems to be essentially no difference between |
@phkahler Could you please rebase this on master? Then merge once the tests pass. |
Off-topic
True all that. I did not mean undefined behaviour in general but from "stupid" software. As long as the startup assembler initializes the
By "absolutely unacceptable." I meant "it is absolutely unacceptable to (knowingly) leave undefined behaviour in the software (SolveSpace)". I was not at all complaining about the standards allowing undefined behaviour - as you said it is very useful.
Rust... ehhh... I'll find some time to go beyond "Hello world" at some point. I have it on two of my machines sitting collecting dust :-(
Do you work on embedded Rust? Both the "Discovey" and "The Embedded Rust Book" are very well written. I've also folowed https://github.com/phil-opp/blog_os from almost day one, unfortunately just reading. On-topic
You are right.
Ufff (arcane/baroque/custom) build systems put me off. You can say I am a "stupid user" of any/all build system(s)...
OK, the former will be complicated. The latter (somewhat) slower. The third party ones - PITA to integrate,... |
A little bit; I've ported rustc to OR1K, wrote a memory-safe TCP/IP stack for it, and ported a C firmware to pure Rust. But I don't really use it much these days.
Arguably pretty much every build system is arcane, baroque, and/or custom. :)
Based on your results I think we actually don't need to urgently do anything here. (And perhaps it would be also nice to avoid such drastic changes before 3.0.) So let's not do anything beyond #620 until such a time when a need clearly arises. |
Use OpenMP to run SShell::CopySurfacesTrimAgainst() and SShell::MakeClassifyingBsps() in parallel at the surface level.