Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring some parallelism to boolean code #616

Merged
merged 1 commit into from May 23, 2020

Conversation

phkahler
Copy link
Member

Use OpenMP to run SShell::CopySurfacesTrimAgainst() and SShell::MakeClassifyingBsps() in parallel at the surface level.

@phkahler
Copy link
Member Author

I have tested this with sanitizers and found no issues. Using a test model with some booleans I saw generation going from 270ms -> 150ms -> 120ms on a 4-core 8 thread machine.

@phkahler
Copy link
Member Author

Looking for a quick review from @ruevs. You've been studying booleans so your opinion on this seems important. I don't think this pollutes the boolean code much at all.

I spent a lot of time looking at things to make sure the "into" shell would not be written to in SSurface::MakeCopyTrimAgainst and it does not unless there is an error so there is one critical section for that and the associated printing of error message.

@whitequark
Copy link
Contributor

whitequark commented May 21, 2020

@phkahler Looks like CI fails; I believe the Linux build fails because your code leaks, and the Windows build fails because of a broken invariant, and both of these are caused by using AllocTemporary from multiple threads.

We can fix this by vendoring a better allocator, like jemalloc, and using arenas. I was going to postpone that until a later release but I have no objection to doing it now-ish.

@ruevs
Copy link
Member

ruevs commented May 21, 2020

In a debug build on VS2019 this crashes on HeapAlloc here (00B798C4):

void *AllocTemporary(size_t n)
{
00B798A0  push        ebp  
00B798A1  mov         ebp,esp  
00B798A3  sub         esp,8  
00B798A6  push        esi  
00B798A7  mov         dword ptr [ebp-8],0CCCCCCCCh  
00B798AE  mov         dword ptr [v],0CCCCCCCCh  
    void *v = HeapAlloc(TempHeap, HEAP_NO_SERIALIZE | HEAP_ZERO_MEMORY, n);
00B798B5  mov         esi,esp  
00B798B7  mov         eax,dword ptr [n]  
00B798BA  push        eax  
00B798BB  push        9  
00B798BD  mov         ecx,dword ptr [SolveSpace::TempHeap (0135AFB0h)]  
00B798C3  push        ecx  
00B798C4  call        dword ptr [__imp__HeapAlloc@12 (010D507Ch)]       <<<<<<<<<<<<
00B798CA  cmp         esi,esp  
00B798CC  call        _RTC_CheckEsp (01039650h)  

Call Stack:

 	ntdll.dll!772d54a7()	Unknown
 	ntdll.dll![Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]	Unknown
 	[External Code]	
>	solvespace.exe!SolveSpace::AllocTemporary(unsigned int n) Line 23	C++
 	solvespace.exe!SolveSpace::SBspUv::Alloc() Line 787	C++
 	solvespace.exe!SolveSpace::SBspUv::InsertEdge(SolveSpace::Point2d ea, SolveSpace::Point2d eb, SolveSpace::SSurface * srf) Line 868	C++
 	solvespace.exe!SolveSpace::SBspUv::InsertOrCreateEdge(SolveSpace::SBspUv * where, SolveSpace::Point2d ea, SolveSpace::Point2d eb, SolveSpace::SSurface * srf) Line 859	C++
 	solvespace.exe!SolveSpace::SBspUv::InsertEdge(SolveSpace::Point2d ea, SolveSpace::Point2d eb, SolveSpace::SSurface * srf) Line 888	C++
 	solvespace.exe!SolveSpace::SBspUv::InsertOrCreateEdge(SolveSpace::SBspUv * where, SolveSpace::Point2d ea, SolveSpace::Point2d eb, SolveSpace::SSurface * srf) Line 859	C++
 	solvespace.exe!SolveSpace::SBspUv::InsertEdge(SolveSpace::Point2d ea, SolveSpace::Point2d eb, SolveSpace::SSurface * srf) Line 888	C++
 	solvespace.exe!SolveSpace::SBspUv::InsertOrCreateEdge(SolveSpace::SBspUv * where, SolveSpace::Point2d ea, SolveSpace::Point2d eb, SolveSpace::SSurface * srf) Line 859	C++
 	solvespace.exe!SolveSpace::SBspUv::From(SolveSpace::SEdgeList * el, SolveSpace::SSurface * srf) Line 805	C++
 	solvespace.exe!SolveSpace::SSurface::MakeClassifyingBsp(SolveSpace::SShell * shell, SolveSpace::SShell * useCurvesFrom) Line 779	C++
 	solvespace.exe!SolveSpace::SShell::MakeClassifyingBsps$omp$1() Line 771	C++
Exception thrown at 0x772D54A7 (ntdll.dll) in solvespace.exe: 0xC0000005: Access violation reading location 0x00000004.

@whitequark
Copy link
Contributor

Yep, that's the exact same bug that I fixed earlier (allocator with HEAP_NO_SERIALIZE called from multiple threads), but now for the temporary heap.

@ruevs
Copy link
Member

ruevs commented May 21, 2020

Removing HEAP_NO_SERIALIZE fixes the problem.

void *AllocTemporary(size_t n)
{
    void *v = HeapAlloc(TempHeap, HEAP_ZERO_MEMORY, n);
    ssassert(v != NULL, "Cannot allocate memory");
    return v;
}
void FreeAllTemporary()
{
    if(TempHeap) HeapDestroy(TempHeap);
    TempHeap = HeapCreate(0, 1024*1024*20, 0);
}

According to MSDN "There is a small performance cost to serialization, but it must be used whenever multiple threads allocate and free memory from the same heap.".
Despite this performance hit there is still benefit from parallel execution - with a quick test I did by changing chord tolerance form 0.5 to 0.001 on the attached model (in debug mode on a 4core/4thread CPU) and timing how long it rakes to render:

  • 17 seconds for phkahler/parallel with HEAP_NO_SERIALIZE removed
  • 20 seconds for master

Of course other code paths using AllocTemporary that have not been parallelized will suffer a bit. I don't know how to measure it though.

LatheSplineIntersectCylinder.zip

@whitequark
Copy link
Contributor

whitequark commented May 21, 2020

That's a reasonable fix if you only consider Windows, but there are several issues with it:

  • There is no equivalent fix for our Linux code;
  • The ~17% performance penalty is quite significant, though not enough to offset the gains from using OpenMP;
  • The change makes single-threaded builds (e.g. all MinGW ones) strictly slower.

@ruevs
Copy link
Member

ruevs commented May 21, 2020

Points 1 and 3 - I agree.
Point 2 - what do you mean by "~17% performance penalty"? The parallel version is faster in my test.

@whitequark
Copy link
Contributor

whitequark commented May 21, 2020

Point 2 - what do you mean by "~17% performance penalty"? The parallel version is faster in my test.

Oh sorry, I misread. Can you redo the comparison without this PR merged, where one arm of the comparison has HEAP_NO_SERIALIZE and the other arm does not?

@phkahler
Copy link
Member Author

@ruevs Try the attached model.
A_wheel.zip

Drag the original hole around, all 8 will follow. The change here doubled my framerate on this one with chord tolerance 0.03 (0.1 is better too, but that's not how I timed it). Also had to quiet the "didn't converge" messages to see Generate times.

@whitequark It's up to you. These OMP changes will help in various cases. I also think there is more performance to be had with it. Tying to keep the changes as non-invasive as possible while still making decent gains.

@whitequark
Copy link
Contributor

It's up to you. These OMP changes will help in various cases. I also think there is more performance to be had with it. Tying to keep the changes as non-invasive as possible while still making decent gains.

I think we should absolutely merge them. I am only explaining the problems that the current state of the codebase presents; I am not arguing against this change. We can and should solve these problems, especially given that the speedup is quite significant.

@ruevs
Copy link
Member

ruevs commented May 21, 2020

Point 2 - what do you mean by "~17% performance penalty"? The parallel version is faster in my test.

Oh sorry, I misread. Can you redo the comparison without this PR merged, where one arm of the comparison has HEAP_NO_SERIALIZE and the other arm does not?

I did - on master with and without HEAP_NO_SERIALIZE - I can not measure a difference. However I am "measuring" by looking at the second hand on my clock and in this 20 second interval I can not notice a difference. So the penalty for removing HEAP_NO_SERIALIZE on this one test on one system in debug mode is under 1 second or 5%.

@whitequark
Copy link
Contributor

However I am "measuring" by looking at the second hand on my clock

SolveSpace should print generation times in the terminal and in the debug output pane in MSVC (or wherever OutputDebugStringA ends in on your system), so you don't have to do that...

@ruevs
Copy link
Member

ruevs commented May 21, 2020

@whitequark I know, I was lazy to comment out all the "trim was unterminated", "trim was empty", "didn't converge ????" etc... (the model fails many things :-) ) But it is cleaner to do it without all the debug output so I did now.

@ruevs Try the attached model.
A_wheel.zip

Drag the original hole around, all 8 will follow. The change here doubled my framerate on this one with chord tolerance 0.03 (0.1 is better too, but that's not how I timed it). Also had to quiet the "didn't converge" messages to see Generate times.

For "A_wheel.zip" dragging the rotated circle with 0.03 chord tolerance

master
Generate::DIRTY took ~1320 ms

master with HEAP_NO_SERIALIZE removed
Generate::DIRTY took ~1312 ms

phkahler/parallel
Generate::DIRTY took ~480 ms

For LatheSplineIntersectCylinder.zip chaging from 0.5 to 0.001 chord tolerance:

master
Generate::ALL took ~19200 ms

master with HEAP_NO_SERIALIZE removed
Generate::ALL took ~19200 ms

phkahler/parallel
Generate::ALL took ~16400 ms

So the effect of HEAP_NO_SERIALIZE is in the noise on my system.

@whitequark
Copy link
Contributor

This confirms my earlier suspicion that HEAP_NO_SERIALIZE probably doesn't do much on Windows anymore. So we can keep using heap functions from kernel32 to make an arena on Windows. Unfortunately we still have to come up with a better approach on Linux.

@phkahler
Copy link
Member Author

@whitequark Are you certain it's leaking on Linux? It runs fine and I don't see increasing memory usage. Also ran across this sanitizer bug:

https://gcc.gnu.org/bugzilla//show_bug.cgi?id=85081

It says fix was targeted for gcc 7.4 which is what Travis is using but I haven't seen that confirmed. I also - perhaps foolishly - have a lot of faith in the GCC developers and it seems odd that the tools would have a problem with this. I may have screwed something up, but I trust the tools. OTOH if you're sure, I trust you too.

@whitequark
Copy link
Contributor

whitequark commented May 22, 2020

Are you certain it's leaking on Linux?

Yes, I am. (It's not just a leak but actually UB that leaks as a side effect.) Consider two threads racing on AllocTemporary here:

h->next = Head;
if(Head) Head->prev = h;
Head = h;

You probably can't see increasing memory usage because the threads don't allocate that much, but, unlike the Windows code, it's very much not safe. We could replace it with an atomic CAS loop, or a mutex, but I'm quite wary of writing our own multithreaded allocator on top of the system one for no especially good reason. Plus even if we do, it'll still be slower than a proper arena, like the one Windows code uses.

@ruevs
Copy link
Member

ruevs commented May 22, 2020

My understanding of OpenMP is (to say the least) superficial, but can't we just put a #pragma omp critical around those lines and thus avoid race conditions while updating the Head from causing corruption of the linked list of blocks?

    AllocTempHeader *h =
        (AllocTempHeader *)malloc(n + sizeof(AllocTempHeader));
    h->prev = NULL;
#pragma omp critical
    {
        h->next = Head;
        if(Head) Head->prev = h;
        Head = h;
    }
    memset(&h[1], 0, n);
    return (void *)&h[1];

I have no Linux to try it under (apart from a RaspberryPi that I do not want to mess with) so I am not sure if this is allowed. All the examples I found put critical pargmas inside #pragma omp parallel so maybe they are not allowed outside of one?

By the way I think that the undefined behaviour caused by multiple threads "racing each other" on adding blocks to the list will at worst cause the list to be brake up into multiple disjoint lists and thus FreeAllTemporary will not free everything. It will free only whatever happened to remain "haging" off of Head. So it boils down to "just" a memory leak. When I say "just" I do not mean that it is a small problem - I hate both race conditions and memory leaks - so in my opinion it is unacceptable to allow it.

@ruevs
Copy link
Member

ruevs commented May 22, 2020

What is an "arena"? A separate heap like in HeapCreate in Win32? I think there is no such thing under POSIX? Maybe an mmap but then we have to manage the memory inside ourselves. As you said - writing our own allocator...

@whitequark
Copy link
Contributor

By the way I think that the undefined behaviour caused by multiple threads "racing each other" on adding blocks to the list will at worst cause the list to be brake up into multiple disjoint lists and thus FreeAllTemporary will not free everything.

The standard is clear that data races are undefined behavior. It doesn't matter whether it happens to work or not on your compiler; the important part is that you are violating the language contract, and when it is violated, the compiler gives you no guarantees whatsoever about the behavior of your program, either before or after the data race happens.

In other words, given the way C and C++ define undefined behavior, there is no "just" in the nature of the bug, and you are not allowed to reason about the consequences of the bug, not even in terms of simple causality. If UB is ever invoked at any time then the entire execution of the program gets cursed.

That said, while it is very important to understand the concepts behind UB, in practice you will rarely see strange consequences like acausality. It does still happen though.

@whitequark
Copy link
Contributor

What is an "arena"? A separate heap like in HeapCreate in Win32?

Yes.

I think there is no such thing under POSIX?

Yes.

Maybe an mmap but then we have to manage the memory inside ourselves. As you said - writing our own allocator...

We can use an existing allocator like jemalloc or tcmalloc. They also tend to perform significantly better than the stock system allocator.

@whitequark
Copy link
Contributor

Maybe an mmap but then we have to manage the memory inside ourselves. As you said - writing our own allocator...

Just to be clear, implementing AllocTemporary on POSIX platforms with mmap and an atomically bumped pointer isn't very hard. The reason I'm interested in using a more proper allocator is threefold:

  • First, system malloc got better lately, but it's still system-dependent (so in practice we have some OS on which the allocator happens to be fastest, and the rest are unfairly penalized), and often not that good with multithreaded workloads.
  • Second, writing our own allocator means we'll need to integrate with the memory checking tools like valgrind, AddressSanitizer, and LeakSanitizer, and I'm less fond of that idea;
  • Third, eventually I'd like to be able to use Exprs via some scripting language, and using a proper allocator would also help there.

We can definitely solve the immediate problem here (thread-unsafety and unnecessary overhead on POSIX platforms) with mmap and a simple bump pointer allocator, but I'm thinking further than that.

@ruevs
Copy link
Member

ruevs commented May 22, 2020

By the way I think that the undefined behaviour [...]

The standard is clear that data races are undefined behavior. It doesn't matter whether it happens to work or not on your compiler [...]
If UB is ever invoked at any time then the entire execution of the program gets cursed.

According to the standard at a high level this is true. And it is absolutely unacceptable.
Because of my day-to-day work I am used to think about what would happen in such cases. On embedded systems (it is very obvious that) there is no such thing as "undefined behavior". When there is a race condition or a deadlock or someone is wiping over memory they should not be touching (most fun - the stack) the particular code on the particular processor with the particular compiler with the particular optimization level does something specific. And one needs to debug it and understand it (sometimes at assembler level) in order to find and fix the bug. But I agree in this case with "some compiler" with a modern optimizer on "some" system this way of thinking it is not "useful".

We can definitely solve the immediate problem here (thread-unsafety and unnecessary overhead on POSIX platforms) with mmap and a simple bump pointer allocator, but I'm thinking further than that.

I'm not familiar with jemalloc or tcmalloc - I'll have to look. If they are fast on Windows (we can not get any faster than the current HepAlloc on Win32 but at least close), and fast on *nix (presumably using mmap internally), and portable, and integrated with Valgring and sanitizers, they would probably be worth looking at.

On the other hand if we just want to integrate this parallel code - I agree that writing our own just for utilunix.cpp with mmap should be easy and faster than the current malloc one (and, because it would be so specific and simple maybe as fast as any third party allocator library).

How about the critical section that will be needed in any case - would #pragma omp critical work?

@whitequark
Copy link
Contributor

On embedded systems (it is very obvious that) there is no such thing as "undefined behavior".

That is of course not true. Open the manual for your CPU and check if it has some variation on "if some conditions are fulfilled, the operation is UNPREDICTABLE". If we're talking about data races, then you can get fundamentally nondeterministic behavior if two of the cores of your multiprocessor system run at clocks that do not have a defined phase relationship. You could also get nondeterministic behavior if you're reading uninitialized SRAM after power-up.

If you were to revise the "particular processor" to some variation of "particular processor that is implemented as a single-clock fully synchronous fully resettable design" I would agree but that excludes virtually every CPU that's currently shipping.

According to the standard at a high level this is true. And it is absolutely unacceptable.

This is a consequence of having a non-checked contract in your language. In case of C and C++, this is the combination of the lack of memory safety with the desire to have optimizing compilers for multiple architectures. (What's the result of overwriting stack memory on an architecture that didn't even exist when your language got standardized? Does it even have a stack? C, of course, doesn't.)

I agree that this is unacceptable for most code, which is why we should write most code in memory-safe languages. But even Rust has an unsafe subset, which not only includes undefined behavior, but it is also exploited by rustc even more aggressively than a C++ compiler would.

This is well into off-topic territory, but as a language designer working on embedded systems I felt compelled to comment on the intersection of my two current areas of work :)


I'll have to look. If they are fast on Windows (we can not get any faster than the current HeapAlloc on Win32 but at least close)

Why can't we? HeapAlloc does more accounting than is necessary; a per-thread bump allocator would be optimal, but HeapAlloc is nowhere close.

and fast on *nix (presumably using mmap internally), and portable, and integrated with Valgring and sanitizers

Yes, that's the idea. The main annoyance is that jemalloc is a PITA to build as it uses autoconf (there's a jemalloc-cmake "official fork" but it's outdated), and tcmalloc is a PITA to build as it uses bazel, which is even worse. So I'd have to rewrite their build system if we were to ship them as a part of SolveSpace.

and, because it would be so specific and simple maybe as fast as any third party allocator library

If we implement our own per-thread bump allocator it will as fast or faster than any library we can use, but that's a bit annoying because we'd have to use pthread_key_create and some logic specific for the main thread for cleanup. If we do a global bump allocator there'll be contention on the head and the elements, but that's probably okay.

How about the critical section that will be needed in any case - would #pragma omp critical work?

You don't need a critical section; you can update the pointer of a bump allocator with an atomic CAS loop.

@whitequark
Copy link
Contributor

whitequark commented May 22, 2020

You don't need a critical section; you can update the pointer of a bump allocator with an atomic CAS loop.

I gave this approach a try. It yields a very nice compact implementation using standard C++11 atomics, but only if the backing storage for the allocator is reserved (but not committed) ahead of time. On 64-bit platforms this isn't really a problem since there is essentially no cost to reserveing a few GB (or a few hundred) of virtual memory, most of which will never be used. But on 32-bit platforms you have to be careful, or you'll step on the toes of the libc allocator.

It doesn't seem very nice to have a more limited allocator on Linux than we have on Windows, where it's just a normal heap that doesn't interfere with other heaps. (macOS is always 64-bit these days.)

If you have to manage multiple mmapped chunks of virtual memory then the complexity of this approach quickly spirals out of control and you're better off with a critical section. Maybe we should go with that for now.

@whitequark
Copy link
Contributor

I've implemented the approach with a critical section in #620.

@ruevs
Copy link
Member

ruevs commented May 22, 2020

Nice! I like the calloc instead of malloc + memset ;-).
I tested the calloc version on Widows with #616 also merged in.

Since I was curious whether it would work I also tried replacing the C++11 mutex with #pragma omp critical in both places. And it also works.

Results from my test as above:

Debug build

calloc with std::lock_guard<std::mutex> guard(TempArenaMutex);
Generate::ALL took 15545 ms
Generate::ALL took 15540 ms
Generate::ALL took 15551 ms

calloc with #pragma omp critical
Generate::ALL took 15799 ms
Generate::ALL took 15855 ms
Generate::ALL took 15912 ms

HeapAlloc
Generate::ALL took 15793 ms
Generate::ALL took 15803 ms
Generate::ALL took 15765 ms

Release build

calloc with std::lock_guard<std::mutex> guard(TempArenaMutex);
Generate::ALL took 3978 ms
Generate::ALL took 3969 ms
Generate::ALL took 3965 ms

calloc with #pragma omp critical
Generate::ALL took 3979 ms
Generate::ALL took 3985 ms
Generate::ALL took 3985 ms

HeapAlloc
Generate::ALL took 3975 ms
Generate::ALL took 3972 ms
Generate::ALL took 3973 ms

So in debug mode the C++ mutex is a tiny bit faster than the other two? Interesting...
There is practically no difference in the release build.

Bottom line - in my opinion it is OK to merge #620 and then #616 .

@whitequark
Copy link
Contributor

@ruevs Cool! Thanks for confirming this. I'm actually quite surprised that there seems to be essentially no difference between HeapFree that frees the entire chunk at once, and the list-walking with piecewise free. If that holds for a wide range of workloads then it's good that we didn't bother with jemalloc or tcmalloc since for now the benefit is minimal but the amount of integration work is not.

@whitequark
Copy link
Contributor

@phkahler Could you please rebase this on master? Then merge once the tests pass.

@ruevs
Copy link
Member

ruevs commented May 22, 2020

Off-topic

On embedded systems (it is very obvious that) there is no such thing as "undefined behavior".

That is of course not true. Open the manual for your CPU and check if it has some variation on "if some conditions are fulfilled, the operation is UNPREDICTABLE". If we're talking about data races, then you can get fundamentally nondeterministic behavior if two of the cores of your multiprocessor system run at clocks that do not have a defined phase relationship. You could also get nondeterministic behavior if you're reading uninitialized SRAM after power-up.

True all that. I did not mean undefined behaviour in general but from "stupid" software. As long as the startup assembler initializes the .bss and .data and you don't read from random addresses and don't intentionally/by mistake hit the "undefined in the data sheet" stuff the rest it is pretty deterministic (and still hard to debug).
As for data races on shared RAM from different cores (I've had it both on Infineon TriCore and Renesas multi core MCUs) I have not seen truly "random" behaviour, only typical "simple" race conditions - just as multiple tasks/interrupts preempting each other on the same core.

According to the standard at a high level this is true. And it is absolutely unacceptable.

This is a consequence of having a non-checked contract in your language. In case of C and C++, this is the combination of the lack of memory safety with the desire to have optimizing compilers for multiple architectures.

By "absolutely unacceptable." I meant "it is absolutely unacceptable to (knowingly) leave undefined behaviour in the software (SolveSpace)". I was not at all complaining about the standards allowing undefined behaviour - as you said it is very useful.

I agree that this is unacceptable for most code, which is why we should write most code in memory-safe languages. But even Rust has an unsafe subset, which not only includes undefined behavior, but it is also exploited by rustc even more aggressively than a C++ compiler would.

Rust... ehhh... I'll find some time to go beyond "Hello world" at some point. I have it on two of my machines sitting collecting dust :-(

This is well into off-topic territory, but as a language designer working on embedded systems I felt compelled to comment on the intersection of my two current areas of work :)

Do you work on embedded Rust? Both the "Discovey" and "The Embedded Rust Book" are very well written. I've also folowed https://github.com/phil-opp/blog_os from almost day one, unfortunately just reading.

On-topic

I'll have to look. If they are fast on Windows (we can not get any faster than the current HeapAlloc on Win32 but at least close)

Why can't we? HeapAlloc does more accounting than is necessary; a per-thread bump allocator would be optimal, but HeapAlloc is nowhere close.

You are right.

[...] The main annoyance is that jemalloc is a PITA to build as it uses autoconf
[,,,] and tcmalloc is a PITA to build as it uses bazel, which is even worse.

Ufff (arcane/baroque/custom) build systems put me off. You can say I am a "stupid user" of any/all build system(s)...

and, because it would be so specific and simple maybe as fast as any third party allocator library

If we implement our own per-thread bump allocator it will as fast or faster than any library we can use, but that's a bit annoying because we'd have to use pthread_key_create and some logic specific for the main thread for cleanup. If we do a global bump allocator there'll be contention on the head and the elements, but that's probably okay.

OK, the former will be complicated. The latter (somewhat) slower. The third party ones - PITA to integrate,...
I'll leave this up to you :-) You obviously have much more experience with heaps than I do. I've been forced by "law"/rules to use only static objects and stack for 80% of my programming experience :-)

@whitequark
Copy link
Contributor

whitequark commented May 22, 2020

Do you work on embedded Rust?

A little bit; I've ported rustc to OR1K, wrote a memory-safe TCP/IP stack for it, and ported a C firmware to pure Rust. But I don't really use it much these days.

Ufff (arcane/baroque/custom) build systems put me off.

Arguably pretty much every build system is arcane, baroque, and/or custom. :)

I'll leave this up to you

Based on your results I think we actually don't need to urgently do anything here. (And perhaps it would be also nice to avoid such drastic changes before 3.0.) So let's not do anything beyond #620 until such a time when a need clearly arises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants