Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System::CalculateRank() is slow #386

Closed
ghost opened this issue Mar 3, 2019 · 31 comments
Closed

System::CalculateRank() is slow #386

ghost opened this issue Mar 3, 2019 · 31 comments

Comments

@ghost
Copy link

ghost commented Mar 3, 2019

System information

SolveSpace version: master, 9d1601e
Operating system: Debian stable

In a file I'm working on (I'll attach it tomorrow when I'll have more proofs), adding a 'same length' constraint between the top of the 1st square of 1st line and the top of the 1st square of 2nd line is extremely slow (on my low end computer).

Using LD_PRELOAD, I've nailed the problem to the method in the title, and implemented some optimisations.
I don't have enough tests to have reliable numbers, but the number of samples gperftools took moved from 47.3% of 3009 to 36.8% of 827 (those numbers make it obvious to me that I need more numbers).

I am opening this ticket in advance to ask for hints about how to automate the runs, which would provide me more reliable numbers, and to provide a 1st patch of the optimized code I have so far (too lazy to do a usual fork/clone/create branch/commit/merge/push/pull request, for just 10 lines of code):

diff --git a/src/system.cpp b/src/system.cpp
index e90611a..24d9ebe 100644
--- a/src/system.cpp
+++ b/src/system.cpp
@@ -146,22 +146,31 @@ int System::CalculateRank() {
     for(i = 0; i < mat.m; i++) {
         // Subtract off this row's component in the direction of any
         // previous rows
+        double* curr_mat_num;
+        double* prev_mat_num;
+        const auto num_mat = mat.n;
         for(iprev = 0; iprev < i; iprev++) {
             if(rowMag[iprev] <= tol) continue; // ignore zero rows
 
             double dot = 0;
-            for(j = 0; j < mat.n; j++) {
-                dot += (mat.A.num[iprev][j]) * (mat.A.num[i][j]);
+            prev_mat_num = mat.A.num[iprev];
+            curr_mat_num = mat.A.num[i];
+            for(j = 0; j < num_mat; ++j, ++curr_mat_num, ++prev_mat_num) {
+                dot += (*prev_mat_num) * (*curr_mat_num);
             }
-            for(j = 0; j < mat.n; j++) {
-                mat.A.num[i][j] -= (dot/rowMag[iprev])*mat.A.num[iprev][j];
+            double tmp = dot/rowMag[iprev];
+            prev_mat_num = mat.A.num[iprev];
+            curr_mat_num = mat.A.num[i];
+            for(j = 0; j < num_mat; ++j, ++curr_mat_num, ++prev_mat_num) {
+                (*curr_mat_num) -= tmp*(*prev_mat_num);
             }
         }
         // Our row is now normal to all previous rows; calculate the
         // magnitude of what's left
         double mag = 0;
-        for(j = 0; j < mat.n; j++) {
-            mag += (mat.A.num[i][j]) * (mat.A.num[i][j]);
+        curr_mat_num = mat.A.num[i];
+        for(j = 0; j < num_mat; j++, ++curr_mat_num) {
+            mag += pow( *curr_mat_num, 2 );
         }
         if(mag > tol) {
             rank++;

If there are other things to improve, feel free to say so.

@Evil-Spirit
Copy link
Collaborator

Evil-Spirit commented Mar 4, 2019

  1. Please, ensure what using pow is not decrease performance.
  2. Please, consider using loop unrolling Evil-Spirit@6584093
  3. Please, don't optimize non-inner loops, this is pointless.
  4. Look here for my other approaches to solve this issue FindWhichToRemoveToFixJacobian can be very slow #131.

@whitequark
Copy link
Contributor

@bmorel Can you attach the file?

@ghost
Copy link
Author

ghost commented May 30, 2019

I have few days to work on that anew, I'll update my repo and redo the tests, maybe spend some time trying to automate them.
@whitequark you mean the measurement files?

@whitequark
Copy link
Contributor

Yes, the file on which you see speedups.

@ghost
Copy link
Author

ghost commented Oct 20, 2019

Found the file I used for the report... thought I had it already linked it and deleted it from PC, but neither of those were true. I built a .zip archive containing a readme, that source file and the results I have now.

profiling.zip

@baryluk
Copy link

baryluk commented Jan 16, 2020

  1. Please, ensure what using pow is not decrease performance.

x*x is definitively going to be faster and more accurate than pow(x, 2). In general.

x*x has defined guaranteed accuracy in C and hardware. pow doesn't.

The major reason is that pow(x, y) is actually implemented as exp(y * log(x)). Plus the 'y' is a double, not int.

Compiler can sometimes optimize pow(x, 2) a little bit, by doing some partial inlining and constant propagation, but not much, because there are still some things like nan handling, errno, etc to deal with. But for special case of y==2, the errno should never happen, and nan/inf handling should be the same as pow.

I did test gcc -O2, and pow(x, 2), with x being runtime variable actually does convert to single instruction (mulsd %xmm0,%xmm0), which was a nice surprise, but I wouldn't want to relay on that.

However, this only happens with -O2 (or higher). The code with no -O will be significantly slower, because conversion of 2 from int to double and runtime conditionals to check the int, and run different algorithms depending on the value of y (repeated squaring, etc).

Examples:

https://godbolt.org/z/875WfD (no optimization flags or -O0)

https://godbolt.org/z/3ZXaM3 (gcc-9.2 -O1 does work nicely)

https://godbolt.org/z/ZAA-4P (clang-9 -O1 doesn't do smart transformation, so will be VERY slow).

clang-9 -O2 works.

Anyway, just my two cents.

@baryluk
Copy link

baryluk commented Jan 16, 2020

As a person that has quite a bit of experience with numerical algorithms, I am never a fan of seeing something like this

double x = 0.0;
for (i for a lot of iterations) {
  x += something(i);
}

It is okish if the number of iterations is small (<10), and data are not pathological (mix of very small and very big values).

In all other cases (including 100 or more of seemingly normal values), it is better to use something like Kahan's summation algorithm. It can be easily abstracted in C++ too, and is rather fast.

class Accumulator {
 private: double sum_{0.0}, correction_{0.0};
 public: void add(double v) {
        const double corrected_term = v - correction_;
        const double new_sum = sum_ + corrected_term;
        correction_ = ((new_sum - sum_)) - corrected_term;
        sum_ = new_sum;
 }
 double get() { return sum_ + correction_; }
};

Accumulator x;
for (i for a lot of iterations) {
  x.add(something(i));
}
x.get();

It does work correctly in gcc and clang even with -O3, and is very fast with optimal assembly code even at -O1. (Do not compile with -Ofast or -fassociative-math, as it is WILL break it by design, and is not a bug).

https://godbolt.org/z/HVf2Af

@Evil-Spirit
Copy link
Collaborator

Evil-Spirit commented Jan 23, 2020

Kahan's summation algorithm

My best question for inteviewing developers: How to summ an array of floats?
The answers:

  1. What the problem? Just iterate and summ! (failed)
  2. Hm... May be sort and summ from small to big? (my own answer ~ 14 years ago)
  3. Kahan! (never ever heard such answer)

@whitequark
Copy link
Contributor

Oh, Kahan is like delta sigma modulation but for software. Very nice.

@ghost
Copy link
Author

ghost commented Feb 18, 2020

My best question for inteviewing developers: How to summ an array of floats?

My best reply: check that this is where the performance problem lies, otherwise it's easier to have easy to read code.
Well, now, the changes I posted are not about adding some additions, they are about unrolling loops and storing temp values. I even provided some real-world measurements of the changes.
I may not be as good as you guys for floating number algos, but I did my best to provide proofs that my patch actually improves things.

Now, if there are lines that should be changed, considered the knowledge you guys have to improve more, I think you should just apply those changes, maybe without applying the patches I've sent. Otherwise, that issue should be closed.

@Evil-Spirit
Copy link
Collaborator

@bmorel the problem with perfromance is the algorithm complexity o(n^3) or o(n^4) - don't actually remember. Any optimization of inner loops can be pointless with such complexity. We don't want to say your code is bad, etc. But just nor mine nor your code solves this problem. The real solution is use Eigen library with sparse-matix algorithms with better alogrithm complexity.

@ruevs
Copy link
Member

ruevs commented Dec 21, 2021

@bmorel Indeed System::CalculateRank() is the slowest function for the attached test case (keyb_case.slvs). However in both with and without your optimization it loads in about three seconds on this particular machine. Granted the number of unknowns is limited to 1024 (and this sketch is just a few below this limit). So I increased MAX_UNKNOWNS to 2048 and copied all the squares a second time (keyb_case_MAX_UNKNOWNS_2048.zip).

keyb_case_MAX_UNKNOWNS_2048.slvs takes 25 seconds to load and your optimizations do help. At least in my case (VS 2019 release with debug info) the fastest is your full-1 with 21s to load. I agree with @baryluk 's argument about pow - and the profiling shows it.

master 25 seconds

Function Name Total CPU [unit, %] Self CPU [unit, %] Module
solvespace.exe (PID: 25076) 203448 (100.00%) 0 (0.00%) solvespace.exe
SolveSpace::System::CalculateRank 112811 (55.45%) 112615 (55.35%) solvespace.exe
SolveSpace::System::SolveLeastSquares 84205 (41.39%) 71577 (35.18%) solvespace.exe
SolveSpace::System::SolveLinearSystem 12526 (6.16%) 12508 (6.15%) solvespace.exe
[External Code] 203447 (100.00%) 4390 (2.16%) Multiple modules
[System Call] ntoskrnl.exe 356 (0.17%) 356 (0.17%) ntoskrnl.exe
SolveSpace::Expr::Substitute 483 (0.24%) 343 (0.17%) solvespace.exe
SolveSpace::Expr::Children 259 (0.13%) 259 (0.13%) solvespace.exe
SolveSpace::Expr::Eval 197 (0.10%) 197 (0.10%) solvespace.exe
SolveSpace::System::SolveBySubstitution 716 (0.35%) 140 (0.07%) solvespace.exe

full-1 21 seconds

Function Name Total CPU [unit, %] Self CPU [unit, %] Module
solvespace.exe (PID: 14024) 174927 (100.00%) 0 (0.00%) solvespace.exe
SolveSpace::System::CalculateRank 84259 (48.17%) 84102 (48.08%) solvespace.exe
SolveSpace::System::SolveLeastSquares 84299 (48.19%) 71546 (40.90%) solvespace.exe
SolveSpace::System::SolveLinearSystem 12639 (7.23%) 12623 (7.22%) solvespace.exe
[External Code] 174926 (100.00%) 4546 (2.60%) Multiple modules
SolveSpace::Expr::Substitute 483 (0.28%) 344 (0.20%) solvespace.exe
SolveSpace::Expr::Children 281 (0.16%) 280 (0.16%) solvespace.exe
[System Call] ntoskrnl.exe 166 (0.09%) 166 (0.09%) ntoskrnl.exe
SolveSpace::Expr::Eval 162 (0.09%) 161 (0.09%) solvespace.exe

This is a decent gain and the code is simple - just taking the "row" indexing out of the inner loops. I'm inclined to make a pull request for this. @phkahler @rpavlik what do you think?

With the above in mind I agree with @Evil-Spirit - this is just an optimization of an O(n^3) algorithm. "Doomed" by definition - for example in our case it would not "allow" us to increase the unknowns to 2048.

@ghost
Copy link
Author

ghost commented Dec 21, 2021

this is just an optimization of an O(n^3) algorithm. "Doomed" by definition

Well, the point was not to have a perfect solution, I perfectly knew (and still know) that I lack knowledge (and time) for bigger, more efficient changes. The point was to improve situation.
From the numbers I provided, it does (and yours confirm it, 2 years after, on a very likely a lot more powerful system), but the answer was that the solution is to do intensive work to switch to a library I never heard of, instead of accepting a patch that improves situation.
Understand that this is not very encouraging.

@ruevs
Copy link
Member

ruevs commented Dec 21, 2021

From the numbers I provided, it does (and yours confirm it, 2 years after, on a very likely a lot more powerful system),

The speed of the system is "irrelevant" - this algorithm will be "slow" on any system - by it's nature (by slow I do not mean "the human will be annoyed" - even though that will happen with enough objects) .

but the answer was that the solution is to do intensive work to switch to a library I never heard of, instead of accepting a patch that improves situation.
Understand that this is not very encouraging.

I understand. And as I said "I'm inclined to make a pull request for this." - 16% less time is 16% less time :-) But let's see what @phkahler and @rpavlik will say.

If it will make you feel better Eigen (the library in question) has been "waiting" since April 2017 ;-)

@ghost
Copy link
Author

ghost commented Dec 21, 2021

The speed of the system is "irrelevant" [...] by slow I do not mean "the human will be annoyed"

Yep. That's what I had in mind :)

If it will make you feel better Eigen (the library in question) has been "waiting" since April 2017 ;-)

It does not. I think this is sad to freeze improvements because of the big nice silver bullet is «coming soon(tm)», and I like solvespace :)

@phkahler
Copy link
Member

@ruevs I'm inclined not to make changes like this in the solver because it will make the @Evil-Spirit solver patches less likely to apply cleanly. And those changes will definitely be a better performance increase, but last I tried there were some problems. I'd also like his patch series split into two - the solver changes and then the Eigen integration. IIRC somewhere in all that a memory leak was introduced that was never tracked down.

My last attempt at doing part one:
https://github.com/phkahler/solvespace/commits/solver-patches

BTW I think solver performance is becoming a bigger issue than it used to be since so many other parts of the code are faster now. PRs that work toward better algorithmic complexity and use of Eigen sparse matrix code are very much welcome. Just be aware of what has been done before but never merged.

@ghost
Copy link
Author

ghost commented Dec 21, 2021

so... the problem is that it might be harder to rebase? If that's the only problem to accept the kind of patches I proposed years ago, then I'm ok to do it.
At that time, I had no good reason to use github that often (also was doubtful about how GH projects are managed... doubts that I'm slowly changing into certainty), but this has changed.
Thus, there's no reason I can't put some efforts at improving solvespace, considering the fact I'm using it from time to time. It would be justice to give a hand back, if that hand will be accepted.

I even have feature requests in mind, to help my own usage, that I might find time to implement myself. An easy task would certainly help to (de)motivate (depending on how things goes on) me to find willpower to implement or even just post issues on them.

@phkahler
Copy link
Member

Thus, there's no reason I can't put some efforts at improving solvespace, considering the fact I'm using it from time to time. It would be justice to give a hand back, if that hand will be accepted.

@bmorel - Your help will be gladly accepted. I wasn't a contributor back when this issue was opened, nor when the other solver work was done (but not merged). I consider all the open issues on github to be either nice ideas when someone has time, or backlog of things that really should be done. As with most things I find the challenge is balancing short and long term goals. Leaning toward long term in this case can mean very long term unfortunately. Any help in pulling things together sooner is good. Even the resurfacing of this issue is making me see the importance of solver improvements.

I even have feature requests in mind, to help my own usage, that I might find time to implement myself. An easy task would certainly help to (de)motivate (depending on how things goes on) me to find willpower to implement or even just post issues on them.

Please open those feature requests. If they're reasonable and not redundant they will be kept open even if nobody gets to them in the near term. Sometimes a feature request leads to a different solution to the same problem. Pick an area that interests you and jump in - just not the solver right now unless you want to take on the big changes ;-)

@rpavlik
Copy link
Contributor

rpavlik commented Dec 22, 2021

I think I had a crack at integrating the two eigen branches. Mine was a direct translation, while the other was a greater improvement along with the use of Eigen. (we have an exploded view setting now? Fancy!) I've pushed it here, though my most recent commit messages on the branch make me nervous that it's possibly nonfunctional: https://github.com/rpavlik/solvespace/tree/eigen_over_master

@rpavlik
Copy link
Contributor

rpavlik commented Dec 22, 2021

OK, and I split stuff up a bit: #1156 is a pretty minimal introduction of Eigen, from long ago (my previous job email!) that noticeably improved performance at that time. Tests pass. Doesn't touch anything dramatic so shouldn't introduce any leaks. I've updated my eigen_over_master branch now to build on it: if it's interesting to review the big refactor from @Evil-Spirit separately, it doesn't actually depend on Eigen before the "Convert main system Jacobian code to using Eigen" commit so it can be rebased away and evaluated separately. Unfortunately the weird List and IdList structures are and will continue to be leak-prone as long as they silently share memory, but my previous attempts (to replace their backing stores with std::shared_ptr<std::vector<T>>) were not fully successful. Perhaps an RAII-style Owned<> wrapper around them would be a better approach.

@77maxikov
Copy link
Contributor

Eigen 3.4.0 wants C++14 by default - have to add #define EIGEN_MAX_CPP_VER 14 there to be able to compile. Playing with keyb_case.slvs got strange thing while first load of this file - Generate::ALL (for bounding box) took 12285 ms on master while on eigen_over_master Generate::ALL (for bounding box) took 14128 ms.

@rpavlik
Copy link
Contributor

rpavlik commented Dec 22, 2021

#432 was my previous PR, so I went back to using that one

@77maxikov
Copy link
Contributor

Sorry for my previous comment - all timings for keyb_case.slvs sample were obtained for debug build. For release build have got 10x speedup: 3727 ms on master vs. 369 ms on eigen_over_master branch.

@ruevs
Copy link
Member

ruevs commented Dec 23, 2021

And for my keyb_case_MAX_UNKNOWNS_2048.zip the https://github.com/rpavlik/solvespace/tree/eigen_over_master takes about 2.7 seconds (down from 25 in trunk):

Function Name Total CPU [unit, %] Self CPU [unit, %] Module
solvespace.exe (PID: 26804) 21732 (100.00%) 0 (0.00%) solvespace.exe
Eigen::SparseMatrixBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::dot<Eigen::Matrix<double,-1,1,0,-1,1> > 9625 (44.29%) 9469 (43.57%) solvespace.exe
Eigen::SparseQR<Eigen::SparseMatrix<double,0,int>,Eigen::COLAMDOrdering >::factorize 17015 (78.29%) 6767 (31.14%) solvespace.exe
[External Code] 21732 (100.00%) 4138 (19.04%) Multiple modules
inflate_fast 144 (0.66%) 140 (0.64%) solvespace.exe

However if I do 4096 keyb_case_MAX_UNKNOWNS_4096.zip it is still rather slow - 18 seconds. So @rpavlik we should not remove the "too many unknowns" restriction rpavlik@2847f2c and partly in rpavlik@349de5f and partly rpavlik@349de5f, but increase it to about 2048. Anyway it is a very bad habit doing "copy-paste like crazy" in sketches.

@rpavlik
Copy link
Contributor

rpavlik commented Dec 23, 2021

The removal of too-many-unknowns is associated with the replacement of a fixed-size array with a std::vector and other things of variable size. I suppose we could still add back an artificial limit of unknowns.

@ruevs
Copy link
Member

ruevs commented Dec 23, 2021

Yes I think an artificial limit is in order. Essentially put back this contition:
rpavlik@349de5f#diff-d34d69445971e623152a1884dfe7d22e7b59ebc1edf082008dd4b1cc5b5b5acaL28

@ruevs
Copy link
Member

ruevs commented Dec 23, 2021

I'm looking at the @Evil-Spirit optimizations other than Eigen #1158 now. They do not help with the test case in this issue, but there are some interesting things there. I'll need some time to "parse" the math and make sure it makes sense for these two 8a7996b 5b014aa.

And if nothing else we should consider whether we really want a "suppress DOF calculation" flag what do you guys think? @phkahler @ppd @vespakoen

But overall I think that with your two fresh pull requests #1158 #1159 we have a perfect place to discuss test, review, clean up and finally merge this stuff.

@rpavlik
Copy link
Contributor

rpavlik commented Dec 23, 2021

cool, thanks for looking into this. I'd love to get any portion of this merged if it makes it faster (or more maintainable with minimal speed penalty).

I have noticed that many of the times that we call into WriteJacobian, we actually end up with at least one dimension 0. It would be great to figure out how to short-circuit as much of the code as possible in those cases, because I'm sure there are allocations we don't need then.

We may also benefit from a "smallvec" type - basically, like std::vector<> but defaults to no dynamic allocation, instead using an internal fixed-capacity buffer if possible. (There's a lot of cases where our lists are less than 10 elements, for instance.) If anybody knows of a good one, properly licensed, and header only, I'd love to hear about it.

@phkahler
Copy link
Member

@baryluk Since you've confessed to having some experience with numeric algorithms, we have a few pull requests pending that integrate Eigen into the solver. If you've got some time it might be worth having a look and offering any suggestions you might have. In particular #1158, #1159, and #1160. If not, then carry on ;-)

@ruevs
Copy link
Member

ruevs commented Jan 4, 2022

Since #1158 and #1159 are now merged and the performance is much better I will close this.

Just for reference the 18s loading time with the 4096 test file:

Function Name Total CPU [unit, %] Self CPU [unit, %] Module
solvespace.exe (PID: 22372) 138841 (100.00%) 0 (0.00%) solvespace.exe
Eigen::SparseMatrixBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::dot<Eigen::Matrix<double,-1,1,0,-1,1> > 76911 (55.40%) 76347 (54.99%) solvespace.exe
Eigen::SparseQR<Eigen::SparseMatrix<double,0,int>,Eigen::COLAMDOrdering >::factorize 132404 (95.36%) 53254 (38.36%) solvespace.exe
[External Code] 138841 (100.00%) 6112 (4.40%) Multiple modules
Eigen::SparseCompressedBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::InnerIterator::InnerIterator 453 (0.33%) 451 (0.32%) solvespace.exe
[System Call] ntoskrnl.exe 297 (0.21%) 297 (0.21%) ntoskrnl.exe
Eigen::SparseMatrix<double,0,int>::insertUncompressed 238 (0.17%) 198 (0.14%) solvespace.exe
SolveSpace::IdListSolveSpace::Entity,SolveSpace::hEntity::FindByIdNoOops 158 (0.11%) 158 (0.11%) solvespace.exe
Eigen::internal::coletree<Eigen::SparseMatrix<double,0,int>,Eigen::Matrix<int,-1,1,0,-1,1> > 158 (0.11%) 155 (0.11%) solvespace.exe
[External Call] ucrtbase.dll 154 (0.11%) 154 (0.11%) ucrtbase.dll
Eigen::internal::assign_sparse_to_sparse<Eigen::SparseMatrix<double,0,int>,Eigen::SparseMatrix<double,0,int> > 433 (0.31%) 139 (0.10%) solvespace.exe
inflate_fast 141 (0.10%) 138 (0.10%) solvespace.exe
SolveSpace::IdListSolveSpace::Param,SolveSpace::hParam::FindByIdNoOops 108 (0.08%) 108 (0.08%) solvespace.exe
Eigen::internal::sparse_solve_triangular_selector<Eigen::Block<Eigen::SparseMatrix<double,0,int> const ,-1,-1,0> const ,Eigen::Block<Eigen::Matrix<double,-1,1,0,-1,1>,-1,1,0>,2,2,0>::run 93 (0.07%) 93 (0.07%) solvespace.exe
Eigen::SparseMatrix<double,0,int>::insert 317 (0.23%) 87 (0.06%) solvespace.exe
SolveSpace::SolveSpaceUI::LoadUsingTable 157 (0.11%) 80 (0.06%) solvespace.exe

Overall the code is now spending 93% of the time in Eigen doing multiplications (dot products) which sounds correct to me :-)

@ruevs ruevs closed this as completed Jan 4, 2022
@ruevs
Copy link
Member

ruevs commented Jan 4, 2022

In addition the new check box in the sketch group properties suppress dof calculation (improves solver performance) vastly speeds things up when pasting thousands of objects (e.g. going from 2048 to 409 in the examples from this issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants