Skip to content

Commit

Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[WIP] subprocesses: unify termination logic
Browse files Browse the repository at this point in the history
jordens committed Feb 1, 2016
1 parent cf465da commit 34403b4
Showing 2 changed files with 55 additions and 38 deletions.
54 changes: 32 additions & 22 deletions artiq/devices/ctlmgr.py
Original file line number Diff line number Diff line change
@@ -106,30 +106,40 @@ async def launcher(self):
await self._terminate()

async def _terminate(self):
logger.info("Terminating controller %s", self.name)
if self.process is not None and self.process.returncode is None:
if self.process is None or self.process.returncode is not None:
logger.info("Controller %s already terminated", self.name)
return
logger.debug("Terminating controller %s", self.name)
try:
await asyncio.wait_for(self.call("terminate"), self.term_timeout)
await asyncio.wait_for(self.process.wait(), self.term_timeout)
logger.info("Controller %s terminated", self.name)
return
except:
logger.warning("Controller %s did not exit on request, "
"ending the process", self.name)
if os.name != "nt":
try:
await asyncio.wait_for(self.call("terminate"),
self.term_timeout)
except:
logger.warning("Controller %s did not respond to terminate "
"command, killing", self.name)
try:
self.process.kill()
except ProcessLookupError:
pass
self.process.terminate()
except ProcessLookupError:
pass
try:
await asyncio.wait_for(self.process.wait(),
self.term_timeout)
except:
logger.warning("Controller %s failed to exit, killing",
self.name)
try:
self.process.kill()
except ProcessLookupError:
pass
await self.process.wait()
logger.debug("Controller %s terminated", self.name)
await asyncio.wait_for(self.process.wait(), self.term_timeout)
logger.info("Controller process %s terminated", self.name)
return
except asyncio.TimeoutError:
logger.warning("Controller process %s did not terminate, "
"killing", self.name)
try:
self.process.kill()
except ProcessLookupError:
pass
try:
await asyncio.wait_for(self.process.wait(), self.term_timeout)
logger.info("Controller process %s killed", self.name)
return
except asyncio.TimeoutError:
logger.warning("Controller process %s failed to die", self.name)


def get_ip_addresses(host):
39 changes: 23 additions & 16 deletions artiq/master/worker.py
Original file line number Diff line number Diff line change
@@ -115,30 +115,37 @@ async def close(self, term_timeout=1.0):
" (RID %s)", self.ipc.process.returncode,
self.rid)
return
obj = {"action": "terminate"}
try:
await self._send(obj, cancellable=False)
await self._send({"action": "terminate"}, cancellable=False)
await asyncio.wait_for(self.ipc.process.wait(), term_timeout)
logger.debug("worker exited on request (RID %s)", self.rid)
return
except:
logger.debug("failed to send terminate command to worker"
" (RID %s), killing", self.rid, exc_info=True)
logger.debug("worker failed to exit on request"
" (RID %s), ending the process", self.rid,
exc_info=True)
if os.name != "nt":
try:
self.ipc.process.kill()
self.ipc.process.terminate()
except ProcessLookupError:
pass
await self.ipc.process.wait()
return
try:
await asyncio.wait_for(self.ipc.process.wait(), term_timeout)
logger.debug("worker terminated (RID %s)", self.rid)
return
except asyncio.TimeoutError:
logger.warning("worker did not terminate (RID %s), killing",
self.rid)
try:
self.ipc.process.kill()
except ProcessLookupError:
pass
try:
await asyncio.wait_for(self.ipc.process.wait(), term_timeout)
logger.debug("worker killed (RID %s)", self.rid)
return
except asyncio.TimeoutError:
logger.debug("worker did not exit by itself (RID %s), killing",
self.rid)
try:
self.ipc.process.kill()
except ProcessLookupError:
pass
await self.ipc.process.wait()
else:
logger.debug("worker exited by itself (RID %s)", self.rid)
logger.warning("worker refuses to die (RID %s)", self.rid)
finally:
self.io_lock.release()

13 comments on commit 34403b4

@sbourdeauducq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there drivers that actually do something useful with SIGTERM on Linux?

@jordens
Copy link
Member Author

@jordens jordens commented on 34403b4 Feb 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if there is not, that would not be a reason not to use it here. If I was writing a controller, I would always want to handle SIGTERM as it is the only thing that I can resonably expect to see before being killed, e.g. if the system is shutting down.

On windows we could (and maybe should) send_signal(CTRL_BREAK_EVENT).

@sbourdeauducq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you typically want to handle it? Install a signal handler that raises an exception (similar to KeyboardInterrupt) so that e.g. finally clauses are properly executed?

This adds boilerplate to every controller, and it's still an approximate solution as asynchronous exceptions in Python have races (http://dept.cs.williams.edu/~freund/papers/python.pdf).

@sbourdeauducq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that clean shutdown is already possible with the current mechanism. Call terminate on the controller manager using artiq_rpctool in your system's shutdown scripts.

@jordens
Copy link
Member Author

@jordens jordens commented on 34403b4 Feb 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who will write those scripts? SIGTERM is already there and is something that you see without any further work and without having to invent any further buggy scripts.
Also remember that terminate() relies on the entire networking stack working and the interface being up.

@sbourdeauducq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARTIQ servers are always bound on localhost now in addition to any specified network address, so I do not think the network setup should be a problem.

SIGTERM is generally difficult to handle reliably (without race conditions) as it is asynchronous. Python messed it up. In addition Unix signals have subtle issues wrt the restarting of system calls. Do you really want to use SIGTERM as the recommended graceful process termination mechanism?

@sbourdeauducq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I will concede that the ability to interrupt a blocking system call at all is an advantage for signals.

@jordens
Copy link
Member Author

@jordens jordens commented on 34403b4 Feb 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is --no-localhost-bind... Why was that added anyway? I seem to have needed it because of some ITAC networking problems.

I would not use SIGTERM as the primary way (among other things it does not exist on Windows) but as a (non-ideal) fallback for the cases where I don't get a terminate().

@sbourdeauducq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, it makes it easier to use clients when you are logged on the same machine as the server, as they work without any network options then.

@jordens
Copy link
Member Author

@jordens jordens commented on 34403b4 Feb 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting address already in use from the server without that option. Even though there was nothing using that port.

@sbourdeauducq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh. Seems my usual pessimism about computers that gave me the idea to add that --no-localhost-bind option was totally justified. How did that happen though? It tries to bind on both 127.0.0.1 and ::1 by default, maybe there are systems on which binding a on v4 address also binds on the v6?

@jordens
Copy link
Member Author

@jordens jordens commented on 34403b4 Feb 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it if I see it again. My guess would be IPv6 related in some way.

@sbourdeauducq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the problem, we could bind on v6 only.

Please sign in to comment.