Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of Concept: detect stack overflows (x86_64-linux-gnu target only) #6376

Closed

Conversation

ysbaddaden
Copy link
Contributor

I noticed that memory accesses that segfaults are near the bottom of stack, that is within the stack guard allocated by the kernel —I initially expected the guard to be after the stack, but it's actually within it. Checking if addr in the segfault handler (the faulting memory address) is within the stack of the current fiber stack is enough to detect a stack overflow.

Since the main fiber doesn't have a @stack, I use getrlimit(2) to get the default stack size and determine the main stack limit. I only added bindings for x86_64-linux-gnu, but they should be the same for all Linux libc.

I don't know how other POSIX systems (Darwin, OpenBSD, FreeBSD) handle their stack guard.

NOTE: there could a Crystal::System.within_current_stack?(addr) or stack_overflow? method.

refs #271

@wooster0
Copy link
Contributor

wooster0 commented Jul 12, 2018

Or why not checking if the signal (sig) is 11? When I do this:

def method
  method
end
method

then I always get signal 11.

@jhass
Copy link
Member

jhass commented Jul 12, 2018

@r00ster91 because you also get it if you do Pointer(String).new(0).value https://carc.in/#/r/4i4r

@straight-shoota
Copy link
Member

@r00ster91 SIGSEGV (signal 11) can have a number of reasons, one of them can be a stack overflow. More details at https://en.wikipedia.org/wiki/Segmentation_fault

stack = Pointer(Void).new(Fiber.current.@stack_bottom.address - stack_size)
end

if stack <= addr < Fiber.current.@stack_top
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we've come from a segfault, wouldn't @stack_top be out of date? What about detecting if the fault is anywhere inside @stack_bottom - stack_size <= addr <= @stack_bottom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's a typo: it should be stack_bottom instead of stack_top. It still works thought, since stack_top was a valid stack pointer, but stack_bottom would be better, yeah.

Yet we must check with stack which is the far limit of the stack:

                       <- stack grows down
+-------x--+----------+---------------------------------+
^       ^  ^          ^                                 ^
|       |  |          |                                 |
+ stack |  + guard    + top                             + bottom
        |
        + invalid memory access: stack overflow

@RX14
Copy link
Contributor

RX14 commented Aug 3, 2018

@ysbaddaden ping, this seems like a couple characters fix then ready to merge. And doing the extra lib_c definitions.

@RX14
Copy link
Contributor

RX14 commented Sep 11, 2018

@ysbaddaden are you interested in continuing this PR? It'd solve a long-standing problem in crystal and the code looks good

@ysbaddaden
Copy link
Contributor Author

If someone is willing to add the missing sys/resource bindings, please do :-)

@damaxwell
Copy link
Contributor

damaxwell commented Oct 4, 2018

@RX14 , @ysbaddaden: On macOS it turns out that the SEGV always happens at an address lower than the top of the stack, but near the top of the stack. The actual amount depends on the size of the stack frames being allocated in recursion. So the check being done here never detects stack overflow.

Even on x86_64-linux-gnu, the SEGV isn't always within the body of the stack. E.g., in some calls of a recursive function that allocates 4kb on the stack, sometimes I see something like

top 0x7fff4a664000 addr 0x7fff4a665c98 bottom 0x7fff4ae64000

where addr is the location of the SEGV. This triggers the check. But other times it's something like

top 0x7ffc2cfd6000 addr 0x7ffc2cfd57e8 bottom 0x7ffc2d7d6000

which does not.

I think the thing to do is

  1. Expand the size of the check region to include a few reasonable sized stack frames (4K or so?) beyond the top of the stack.

  2. If the check passes, report an error message like:

Invalid memory access (signal 11) at address 0x7fff4a665c98
Potential stack overflow (e.g., infinite or very deep recursion)

I.e. keep the original message (after all, although unlikely, it might have just been a vanilla bad pointer dereference). But add on a clause that it looked like a stack overrun to help the diagnosis. The aim isn't to be perfect, just helpful much of the time.

@damaxwell
Copy link
Contributor

About those bindings, I'd offer to help but I'm a little intimidated. Presumably the linux versions will all be the same, and the darwin and BSD versions should be the same. But that's a lot of hardware to test and verify on (esp. arm-linux-gnueabihf). What have you guys done in the past?

Oh, and Windows. Windows doesn't have the POSIX getrlimit but has GetCurrentThreadStackLimits which might work instead.

@RX14
Copy link
Contributor

RX14 commented Oct 4, 2018

I'd say the last page in the stack and then the page after the stack.

@ysbaddaden
Copy link
Contributor Author

We use mprotect to protect the last 4KB. This can be bypassed, but it's weird that macOS is always below the stack, it should usually trigger a SIGSEGV within the last 4KB as on Linux.

@RX14
Copy link
Contributor

RX14 commented Oct 4, 2018

@damaxwell mentioned functions which allocate 4k on the stack, which could bypass a whole page.

@damaxwell
Copy link
Contributor

I took a look and I don't think there is a call to mprotect for the main fiber; that happens in allocate_stack, which isn't called. My initials tests only involved overrunning the stack on the main fiber, and even the trivial recursive program on macOS seg faults at an address below the top of the stack.

I just checked, and for spawned fibers on macOS, the seg faults are indeed within the stack for small recursive functions, just like Linux.

So the difference seems to be that the two OS's protect the end of the main stack differently, one within and one outside.

I think the easiest thing to do is just expand the check zone a page or two regardless. This will address the macOS main thread, and the extra buffer will capture some other cases where stack allocation leads to a SIGSEGV beyond the end of the stack.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Oct 5, 2018

Oh. We can't call mprotect on the main stack obviously. Good catch!

Since the program segfaulted, and if that happened within the stack or the next 4KB below it, but we can safely say "Potential stack overflow (e.g., infinite or very deep recursion)", yes.

@damaxwell
Copy link
Contributor

I've got the resource.h bindings tested on all of the POSIX platforms except for aarch64-linux-musl. Alpine looks like a reasonable distro, but I haven't managed to find the right incantation for QEMU to work with it. Hints on what you've used in the past?

@j8r
Copy link
Contributor

j8r commented Oct 6, 2018

@damaxwell you can try https://github.com/multiarch/alpine

@straight-shoota
Copy link
Member

With #6928 merged, this can be closed.

@ysbaddaden ysbaddaden closed this Oct 18, 2018
@ysbaddaden ysbaddaden deleted the poc-detect-stack-overflow branch October 23, 2018 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants