Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: fork and signal child handlers #6426

Merged

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jul 22, 2018

Fixes an issue where a parent process could receive signals directed to the child or vice-versa.

fixes #6416

@RX14
Copy link
Contributor

RX14 commented Jul 22, 2018

Thanks for doing the debugging! It really shows how dangerous fork is with these subtle ordering and error handling issues.

@forksaber
Copy link

I am not sure if this should be handled as part of this PR, but I think there still exists a subtle race condition in the signal handler implementation. Consider the following order of events:

Parent  : fork()
Child   : spawns
Child   : receives a signal
Child   : writes the signal to parent's pipe
Child   : Signal.after_fork is called
Parent  : processes the signal meant for child (**BUG**)

I haven't actually tried writing down a fix but something like this should work:

#Pseudo code

# block all the signals for the current thread
pthread_sigmask(block all signals) 
pid =LibC.fork
if pid > 0
  #Parent
  pthread_sigmask(unblock all signals)
else
 #child
  if run_hooks       # (Process.fork)
    Signal.after_fork
    pthread_sigmask(unblock all signals)
    run_remaining_hooks
  else                    #(Process.exec)
    # reset all signal handlers  to SIG_DFL or  call Signal.after_fork  
    (signals.each { |signal| signal.reset } ) OR (Signal.after_fork) 
    pthread_sigmask(unblock all signals)
    execvp
  end
end
   

References:
https://www.redhat.com/archives/libvir-list/2008-August/msg00303.html
https://github.com/libvirt/libvirt/blob/e83da1990cfb75ae9a83559b6b702bb9fc4bc61e/src/util/vircommand.c#L297

@ysbaddaden
Copy link
Contributor Author

You're right again!

I added a commit to disable signals then reenable them after fork (parent and child). Now we should be fine 🤞

I chose to continue to not reset signal handlers in the child process. Either developers don't care about signals and we'll keep the default handlers (i.e. SIGSEGV, SIGCHLD) or she cares about them and she's responsible for resetting them if needed.

@forksaber
Copy link

Thank you @ysbaddaden for working on this PR. Everything is fine except that the race condition still exists in fork+exec scenario as signals are not blocked before forking when run_hooks is false. So the following changes should make this patch perfect:

module Crystal::Signal
  def self.clear_signal_handlers
    @@handlers.keys.each { |sig| sig.reset }
  end
end
-if run_hooks
-  LibC.sigfillset(pointerof(newmask))
-  ret = LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(newmask), pointerof(oldmask))
-  raise Errno.new("pthread_sigmask") unless ret == 0
-end
+LibC.sigfillset(pointerof(newmask))
+ret = LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(newmask), pointerof(oldmask))
+raise Errno.new("pthread_sigmask") unless ret == 0
 
 pid = LibC.fork
 errno = Errno.value
 
 case pid
 when 0
   # child:
   pid = nil
   if run_hooks
     Process.after_fork_child_callbacks.each(&.call)
-    LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil)
+  else
+    # exec scenario
+    Crystal::Signal.clear_signal_handlers
   end
+  LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil)
 when -1
   # error:
-  LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil) if run_hooks
+  LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil)
   raise Errno.new("fork", errno)
 else
   # parent:
-  LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil) if run_hooks
+  LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil)
 end

Someone experienced a similar in go:
golang/go@df0892c

PS: Ideally, the method clear_signal_handlers should be implemented using an array of atomic flags for registered handlers as when multiple threads are present the @@handlers hash could be in an inconsistent state and might cause segfaults. But that can be tackled in a later PR.

# Psuedo code for thread-safe  clear_signal_handlers
module Crystal::Signal
  @@handled = StaticArray(AtomicFlag, 32).new(AtomicFlag.new(0_u8))
  
  def self.trap(signal, handler) : Nil
    @@handled[signal].set(1_u8)
    previous_def
  end
       
  def self.clear_signal_handlers
    @@handled.each_with_index do |flag, signal|
      next if flag.get == 0_u8
      signal.reset
     end
   end
end

@bararchy
Copy link
Contributor

@RX14 or @jhass
Two approves here, LGTM?

@Sija
Copy link
Contributor

Sija commented Jul 26, 2018

@bararchy Hold your horses, let's wait for conclusions regarding newest @forksaber's comment.

@bararchy
Copy link
Contributor

@Sija sure, I was under the impression this has already been reviewed.

@Sija
Copy link
Contributor

Sija commented Jul 26, 2018

Well, it was. Before that comment ;)

@ysbaddaden
Copy link
Contributor Author

Yes, @forksaber is right again! I'm fixing this.

src/process.cr Outdated
if run_hooks
Process.after_fork_child_callbacks.each(&.call)
LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil)
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In chose to not reenable signals in a fork/exec situation. I just leave the handlers and mask in place, and let execve(2) override that for the new process.

Copy link

@forksaber forksaber Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you don't reenable the signals, they will remain blocked across the exec. So, it is necessary to reenable the signals and reset them.

The new process shall inherit at least the following attributes from 
the calling process image:

Nice value (see nice())
semadj values (see semop())
Process ID
Parent process ID
Process group ID
Session membership
Real user ID
Real group ID
Supplementary group IDs
Time left until an alarm clock signal (see alarm())
Current working directory
Root directory
File mode creation mask (see umask())
File size limit (see ulimit())
Process signal mask (see sigprocmask())
Pending signal (see sigpending())
tms_utime, tms_stime, tms_cutime, and tms_cstime (see times())
Resource limits
Controlling terminal
Interval timers

Reference:
http://pubs.opengroup.org/onlinepubs/009695399/functions/exec.html

@ysbaddaden
Copy link
Contributor Author

Sigh... Okay.

@ysbaddaden ysbaddaden force-pushed the fix-fork-and-signal-child-handlers branch from 87e6430 to 5d101bf Compare July 27, 2018 20:30
@forksaber
Copy link

forksaber commented Jul 28, 2018

Sorry for being such an annoying person/reviewer :(
I can't help but point out that the race condition(fork/exec scenario) has shifted from

# run_hooks: false
pid = LibC.fork
case pid
when 0
  # child
  # RACE here
  exec
end

to

# run_hooks: false
block_signals
pid = LibC.fork
case pid
when 0
  # child
  unblock_signals
  # RACE here : since parent's pipe exists, any signal 
  # delivered here will end up in parent process
  exec
end

A crucial step is missing to fix this race condition is to reset the signal handlers to SIG_DFL before unblocking the signals:

case pid
when 0
  # child
+  Crystal::Signal.reset_signal_handlers!
  unblock_signals
  # RACE here : since parent's pipe exists, any signal 
  # delivered here will end up in parent process
  exec
end

An attempt to fix this:

module Crystal::Signal
  def self.mutex
    @@mutex
  end

  # safe to invoke only in post-fork/pre-exec state
  def self.reset_signal_handlers!
    @@handlers.keys.each { |sig| sig.reset }
  end
end
 protected def self.fork_internal(run_hooks : Bool = true)
   newmask = uninitialized LibC::SigsetT
   oldmask = uninitialized LibC::SigsetT
 
   LibC.sigfillset(pointerof(newmask))
   ret = LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(newmask), pointerof(oldmask))
   raise Errno.new("Failed to disable signals") unless ret == 0
 
+  # lock signal mutex before forking to ensure @@handlers is in
+  # a consistent state
+  mutex = Crystal::Signal.mutex
+  mutex.lock if ! run_hooks
+
   case pid = LibC.fork
   when 0
     # child:
     pid = nil
     if run_hooks
       Process.after_fork_child_callbacks.each(&.call)
       LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil)
     else
+      # no need to unlock mutex as we won't be registering
+      # any custom signal handlers before exec.
+      # Also unlocking mutex not safe here as scheduler is not reinitialized
+      Crystal::Signal.reset_signal_handlers!
       # sigmask is inherited: must reset it
       LibC.sigemptyset(pointerof(newmask))
       LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(newmask), nil)
     end
   when -1
+    # unlock mutex immediately after fork in parent
+    mutex.unlock if ! run_hooks
     # error:
     errno = Errno.value
     LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil)
     raise Errno.new("fork", errno)
   else
+    # unlock mutex immediately after fork in parent
+    mutex.unlock if ! run_hooks
     # parent:
     LibC.pthread_sigmask(LibC::SIG_SETMASK, pointerof(oldmask), nil)
   end
 
   pid
 end

@RX14
Copy link
Contributor

RX14 commented Jul 28, 2018

@forksaber honestly we really appreciate your comments! It's much nicer to sort this out now instead of debugging an intermittent race condition in production years later.

@ysbaddaden
Copy link
Contributor Author

Yes, you're right again! Sigh me, too focused on the child state than on the child -> parent leak.

@ysbaddaden ysbaddaden force-pushed the fix-fork-and-signal-child-handlers branch from 5d101bf to a146666 Compare July 29, 2018 09:08
@ysbaddaden
Copy link
Contributor Author

Now I hope it's fine 🤞

@forksaber
Copy link

The patch LGTM 👍 when there exists only a single thread (which is currently the case).

However, consider this multi-threaded case

Thread-1 : locks Crystal::Signal mutex

Thread-1 : is in the middle of modifying @@handlers hash....execution state could be 
           inside any method of the Hash implementation ( hash's internal state might 
           be inconsistent depending on the implementation)

Thread-2 : fork

Child : now accesses said inconsistent @@handlers hash possibly 
        causing segfault or undefined behaviour

Thread-1 : finishes modifying @@handlers

So, mutex.lock is needed before fork to ensure that no other thread modifies the @@handlers hash while the current thread forks.

I can understand that you may not want to include the mutex.lock since Crystal is currently single threaded.
An alternative is to add some TODO comment related to this, so we can revisit this once thread-support lands.

@ysbaddaden ysbaddaden force-pushed the fix-fork-and-signal-child-handlers branch 2 times, most recently from 81d93d5 to 9aa18c2 Compare July 30, 2018 10:16
@ysbaddaden
Copy link
Contributor Author

Let's reset all signals and be done with it now.

@forksaber
Copy link

@ysbaddaden Agreed

@forksaber
Copy link

Assumption: An average crystal program defines 3 custom signal handlers

To make this decision easier, I am adding a benchmark that compares the following approaches:

  • hash: @@handlers.keys is used to find signals and reset them
  • reset all: reset all signal handlers irrespective of whether they have non default values
  • array of atomic flags: used alongside @@handlers hash to register and query handler state change
require "benchmark"
class AtomicFlag
  @flag = Atomic(UInt8).new(0_u8)

  def get : Bool
    @flag.get == 1_u8
  end

  def set(val : Bool)
    if val
      @flag.set(1_u8)
    else
      @flag.set(0_u8)
    end
  end
end

arr = StaticArray(AtomicFlag, 32).new { AtomicFlag.new }
handlers = Hash(Int32, Proc(Signal, Nil)).new

n = 3
(1..n).each do |i|
  arr[i].set true
  handlers[i] = ->(sig : Signal) { nil }
end

Benchmark.ips do |x|
  x.report("reset changed - hash  ") do
    handlers.each_key do |i|
      LibC.signal(i, LibC::SIG_DFL)
    end
  end

  x.report("reset changed - atomic") do
    (1..31).each do |i|
      flag = arr[i-1].get
      LibC.signal(i, LibC::SIG_DFL) if flag
    end
  end

  x.report("reset all             ") do
    (1..31).each { |i| LibC.signal(i, LibC::SIG_DFL) }
  end
end

Output

reset changed - hash     1.47M (678.12ns) (± 0.42%)  0 B/op        fastest
reset changed - atomic   1.42M (705.08ns) (± 0.29%)  0 B/op   1.04× slower
reset all              144.33k (  6.93µs) (± 0.42%)  0 B/op  10.22× slower

I am fine with either solution....using atomics to reset select signals or resetting all the signals.

@ysbaddaden ysbaddaden force-pushed the fix-fork-and-signal-child-handlers branch from 1905a26 to 17c7fee Compare August 1, 2018 09:30
@ysbaddaden
Copy link
Contributor Author

Thanks for the benchmark!

I maintain a sigset_t, because signals are tricky: they usually range 1-32, but there ain't no guarantee about that. In addition the POSIX functions are fast enough. With this patch it's only 1.07× slower than the unsafe @@handlers.each_key, so Process.run won't see a significant performance loss.

The set is only mutated within a mutex so mutation is thread safe. It's also modified before the actual trap and after the actual reset. I assume it's fine to use without the mutex in fork/exec. It's just a fixed set of longs, and either a thread is adding a signal to the set and we don't care (the signal isn't trapped, yet) or it's being removed and we'll just have a useless reset.

All cases look fine to me. I hope you agree @forksaber 🤞

@ysbaddaden ysbaddaden force-pushed the fix-fork-and-signal-child-handlers branch from 17c7fee to 7d17917 Compare August 1, 2018 09:56
@bcardiff bcardiff added this to the 0.26.0 milestone Aug 4, 2018
@bcardiff
Copy link
Member

bcardiff commented Aug 4, 2018

@ysbaddaden unless the last message from @forksaber push you to do something else in this PR we are good to go and include this PR as is.

@forksaber you probably know this already, but you input was of great great value. Thanks! 🙇

@ysbaddaden
Copy link
Contributor Author

It condos my understanding: Go is using an atomic to avoid a mutex that would block.

The race is quite edge case: it requires that a thread would modify a signal while another is exactly calling fork.

I naively assume that a program does setup a few signals at startup and won't change them much afterwards.

Maybe we can just close the pipe in fork/exec and add a "write byte unless closed" in the signal handler. That would avoid passing child signal to parent and exec will reset any signals we missed to SIG_DFL anyway.

@forksaber
Copy link

forksaber commented Aug 5, 2018

Nice idea @ysbaddaden
Building on your idea....something like this should work great.

module Crystal::Signal
  @@after_fork_before_exec = false

  def self.after_fork_before_exec
    @@after_fork_before_exec = true
    ::Signal.each do |signal|
      LibC.signal(signal, LibC::SIG_DFL) if signal.set?
    end
  end

  def self.trap(signal, handler) : Nil
    @@mutex.synchronize do
      unless @@handlers[signal]?
        signal.set_add
####
        LibC.signal(signal.value, ->(value : Int32) {
          if @@after_fork_before_exec
             # signal handler was not reset due to the above race...lets reset it
             LibC.signal(value, LibC::SIG_DFL)
             # re raise the signal so that its handled with SIG_DFL semantics
             LibC.raise(value)
          else
             writer.write_bytes(value)
          end
        })
####
      end
      @@handlers[signal] = handler
    end
  end

end

We can even double down on this idea and forget about maintaining a sigset. Then, every signal handled in after_fork_before_exec state will be reset dynamically. 😁

Thanks @bcardiff, I appreciate it 😃

Copy link
Member

@sdogruyol sdogruyol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some really great improvements here 🌮 Thank you @ysbaddaden @forksaber 👍

@bcardiff
Copy link
Member

bcardiff commented Aug 6, 2018

@ysbaddaden given that we should release soon, maybe either we merge as is this pr and defer future work, or we release 0.26 without this. Your call.

@bcardiff bcardiff removed this from the 0.26.0 milestone Aug 8, 2018
@ysbaddaden ysbaddaden force-pushed the fix-fork-and-signal-child-handlers branch from 7d17917 to fa84884 Compare August 8, 2018 20:10
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Aug 8, 2018

I pushed a commit to close the pipe anyway, and to check if it's closed when writing the signal. I'm fixing I fixed the conflict.

@ysbaddaden ysbaddaden force-pushed the fix-fork-and-signal-child-handlers branch from fa84884 to 8184d6e Compare August 8, 2018 20:15
@ysbaddaden
Copy link
Contributor Author

I think it's okay for now. An atomic sigset would allow to avoid a mutex, but that can wait for multithreaded fibers. We can probably avoid more race conditions, but that's really an edge case, and we figure something better someday if it proves to really be an issue —I doubt an app would trap/reset signals in loops.

@ghastmask
Copy link

Hi it sounds like you guys already got everything figured out but I wanted to point you to the facebook folly Subprocess c++ implementation which is pretty well documented in case you hadn't seen it:

https://github.com/facebook/folly/blob/master/folly/Subprocess.cpp

@RX14 RX14 added this to the 0.26.1 milestone Aug 21, 2018
@RX14 RX14 requested a review from bcardiff August 21, 2018 23:31
@bcardiff
Copy link
Member

I would rather include this in 0.27. It is a bit sensible and I would prefer to minimize the risk of needing a 0.26.2 for this.

@bcardiff bcardiff modified the milestones: 0.26.1, 0.27.0 Aug 22, 2018
@bcardiff
Copy link
Member

bcardiff commented Sep 6, 2018

@ysbaddaden would you rebase on master so we merge this without squashing?

Fixes read operation after fork, forcing the signal handler fiber to
retry on the new pipe.

Also avoids the child from receiving a parent signal from the pipe.

refs crystal-lang#6416
This will avoid a future issue where a thread could process signal
in parallel to the signal resets.
Always block signal reception before forking, even in fork/exec
cases, so the child will never receive a signal and wrongly forward
it to the parent process.

Always restores the sigmask after fork. During fork/exec the signal
pipe isn't replaced but signal handlers are resetted to `SIG_DFL` so
they won't be received anymore (and won't be forwarded to the
parent) before we restore the sigmask (inherited after exec).
Improves fork/exec executation by limiting the number of signals to
reset to SIG_DFL to the signals that have actually been trapped. We
now reset only 1 signal by default, instead of 32.
@ysbaddaden ysbaddaden force-pushed the fix-fork-and-signal-child-handlers branch from 8184d6e to e5b401f Compare September 6, 2018 08:11
@ysbaddaden
Copy link
Contributor Author

I just did. There was no conflicts, thought.

@bcardiff bcardiff merged commit a350db8 into crystal-lang:master Sep 6, 2018
@ysbaddaden ysbaddaden deleted the fix-fork-and-signal-child-handlers branch September 7, 2018 09:31
oprypin added a commit to oprypin/crystal that referenced this pull request Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Process.run() not returning in forked process
9 participants