-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use OpenMP for triangulation #600
Conversation
Looks good. Can we use range based for loops here? We've been migrating away from using |
Is this better? I even used auto ;-) My testing seems to show force-to-triangle-mesh performing slower but indicating faster (lower time) at the bottom. Any idea why either of those might be? I'm using a single helix for testing so there shouldn't be any group combining going on. |
Yep, that's actually a lot easier to understand for me! Less things to track while reading the code.
I suspect it might be a measurement artifact, since the frame timer at the bottom is only supposed to measure rendering performance (the OpenGL display list stuff and such). Is triangulation even included in that timer? |
Looks like range-for on random-access container is ok for parallel for in open mp 5, but possibly not before? https://stackoverflow.com/questions/17848521/using-openmp-with-c11-range-based-for-loops |
Thanks for looking @rpavlik! I don't think we have OpenMP 5 on our version of MSVC (I'm not sure if any version of MSVC implements it?) so as I suspected that loop is best kept in a simpler form. |
@whitequark The frame timer seem to measure actual frame rate. Rendering time from OpenGL is utterly meaningless these days. I have AMD integrated graphics (APU) and I can rotate or zoom full screen at 4K with 1ms or less frame time. Stretching an extrusion shows sensible numbers in the 10ms or 100ms range and going into the 1000ms range for really sluggish stuff (that still rotates blazingly fast). Not sure what it's measuring when forced to mesh but the indicated frame time is lower while visually it's slower. |
A lot of effort, mostly by @Evil-Spirit, went into making that happen. The old immediate mode code was much slower.
I think it might be converting NURBS to mesh within the region measured by the timer (since we do it lazily and cache), but performing booleans on meshes outside of that region, or something like that. |
How should we handle updating the builds to take advantage of this change? |
Something like: diff --git a/CMakeLists.txt b/CMakeLists.txt
index 3e98325d..05c1bbf0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -96,6 +96,14 @@ if(CMAKE_SYSTEM_PROCESSOR STREQUAL "i686" OR CMAKE_SYSTEM_PROCESSOR STREQUAL "X8
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${FLOAT_FLAGS}")
endif()
+# We use OpenMP to speed up some geometric operations, but it's optional.
+include(FindOpenMP)
+if(OpenMP_FOUND)
+ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
+else()
+ message(WARNING "OpenMP not found, geometric operations will be single-threaded")
+endif()
+
if(APPLE OR CMAKE_SYSTEM_NAME STREQUAL "FreeBSD")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -stdlib=libc++")
endif()
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 928b4522..4834a655 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -224,6 +224,7 @@ add_dependencies(solvespace-core
q3d_header)
target_link_libraries(solvespace-core
+ ${OpenMP_CXX_LIBRARIES}
dxfrw
${util_LIBRARIES}
${ZLIB_LIBRARY} |
Did you forget to update the PR with the CMake OpenMP changes? |
Oops yes. |
Fixed in 47e8279. |
That broke the tests on Windows... could you take a look? |
I looked. Don't see any reason for it :( |
Do you have a VS12 install you can use to check the backtrace? |
I haven't used Windows or VS anything at home since 2005. If not for your awesome porting efforts I wouldn't even be participating! :-) |
Ah, OK! I'll take a look at it later then. |
@phkahler The testsuite crashes with error C0000374, which is |
The issue is most certainly real, you can reproduce it with ThreadSanitizer (try
|
Ah, those are likely false positives. |
I figured it out. The OMP code is probably correct. The problem was that the heap on Windows was created with diff --git a/src/platform/utilwin.cpp b/src/platform/utilwin.cpp
index 4e452802..b43a84ee 100644
--- a/src/platform/utilwin.cpp
+++ b/src/platform/utilwin.cpp
@@ -60 +60 @@ void *MemAlloc(size_t n) {
- void *p = HeapAlloc(PermHeap, HEAP_NO_SERIALIZE | HEAP_ZERO_MEMORY, n);
+ void *p = HeapAlloc(PermHeap, 0 | HEAP_ZERO_MEMORY, n);
@@ -65 +65 @@ void MemFree(void *p) {
- HeapFree(PermHeap, HEAP_NO_SERIALIZE, p);
+ HeapFree(PermHeap, 0, p);
@@ -70 +70 @@ void vl() {
- ssassert(HeapValidate(PermHeap, HEAP_NO_SERIALIZE, NULL), "Corrupted heap");
+ ssassert(HeapValidate(PermHeap, 0, NULL), "Corrupted heap");
@@ -85 +85 @@ std::vector<std::string> InitPlatform(int argc, char **argv) {
- PermHeap = HeapCreate(HEAP_NO_SERIALIZE, 1024*1024*20, 0);
+ PermHeap = HeapCreate(0, 1024*1024*20, 0); |
@jwesthues Do you have any objection to getting rid of the I'm going to leave Long term, unfortunately the way Thoughts? |
@phkahler I came up with this simplification / optimization: void SShell::TriangulateInto(SMesh *sm) {
- std::vector<SMesh> tm(surface.n);
-
-#pragma omp parallel for
+ #pragma omp parallel for
for(int i=0; i<surface.n; i++) {
SSurface *s = &surface[i];
- s->TriangulateInto(this, &tm[i]);
- }
-
- // merge the per-surface meshes
- for (auto& m : tm) {
+ SMesh m;
+ s->TriangulateInto(this, &m);
+ #pragma omp critical
sm->MakeFromCopyOf(&m);
m.Clear();
} Do you agree this is valid? Could you measure the performance impact? Also, I wonder if OpenMP fares any better if you LD_PRELOAD jemalloc into SolveSpace. |
No objection to switching the permanent heap to
|
That's precisely what jemalloc provides: it keeps per-thread memory pools (segregated by size as well, to reduce fragmentation), and uses optimistic locking so that most of the time it doesn't have to take a mutex on
I would actually be tempted to use a smart pointer--either |
I see the benefit of smart pointers for integration with a scripting language, but otherwise I doubt it would get us much. The structure of the code makes use-after-free unlikely, since there's a pretty clear boundary between state that does vs. doesn't persist across regenerations. Maybe we could get usefully more contiguous use of memory (thus better cache performance) from earlier freeing, but I doubt it. Perhaps worth considering, but I wouldn't personally touch it without really good reason--feels like a lot of work, and you could solve your immediate problem pretty easily with some slightly smarter allocator and the same global free.... |
The reason I wanted to use smart pointers for
I agree that if we stick solely to the current use of Also, while full support for scripting is of course very challenging, I think there's a specific niche where small improvements can lead to large gains--that would be the testing harness. It currently uses C++ (which I went with just to get it off the ground), but that's very unpleasant to write API (or worse, UI) acceptance tests in. With the recent addition of many new features I think testing is a very important area to focus on.
I've actually reviewed a patch that did this change--the changes required to get an allocator reference everywhere a new |
For #77 just re-parsing every time seems kind of ugly but easy and perhaps fine; but certainly no specific objection to reference counting here, if you really want to take on all that work. |
@whitequark No, I don't think that change is OK. The line: sm->MakeFromCopyOf(&m); Was pulled out of the parallel loop because it's not actually making a copy, it's adding triangle from m into sm. I can't say for certain, but I don't think adding to a list is thread-safe. I found it surprising that SovleSpace is actually written in a very functional style, which makes many things in there thread-safe. Sometimes whole data structures are created and deleted inside a complex function. It sometimes seems inefficient, but results in no shared/modified state or side effects. I'd say that's best for preventing bugs, while being able to use OMP is a side effect. BTW if it is causing a real bug this PR is not huge performance win, but using OMP opens the door to similar improvements. |
With "force NURBS booleans..." on a helix for me it crashes here solvespace/src/platform/utilwin.cpp Line 23 in c95a07a
With
Call stack:
If I remove "HEAP_NO_SERIALIZE" then it starts crashing with access violation
on the same spot. |
Hmmm even with HEAP_NO_SERIALIZE in place and -openmp removed it still crashes...
It's too late I'm missing something... |
@phkahler Of course it's not. That's why there is a |
@whitequark Your change works fine but it's hard to tell if it's any faster (might be). I was concerned about the time to restart the threads if they're blocked. Then I remembered, the vector of SMeshes might run into cache coherence issues as the lists update their lengths. It's probably good. |
This implements the change suggested in #481. Performance gain with simple helical test models was 15-25 percent on current master. Gain is more apparent at lower chord tolerance. Existing builds should not see any difference until they are told to use OpenMP. For GCC that just means passing -fopenmp.