Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 34403b4

Browse files
committedFeb 1, 2016
[WIP] subprocesses: unify termination logic
1 parent cf465da commit 34403b4

File tree

2 files changed

+55
-38
lines changed

2 files changed

+55
-38
lines changed
 

‎artiq/devices/ctlmgr.py

+32-22
Original file line numberDiff line numberDiff line change
@@ -106,30 +106,40 @@ async def launcher(self):
106106
await self._terminate()
107107

108108
async def _terminate(self):
109-
logger.info("Terminating controller %s", self.name)
110-
if self.process is not None and self.process.returncode is None:
109+
if self.process is None or self.process.returncode is not None:
110+
logger.info("Controller %s already terminated", self.name)
111+
return
112+
logger.debug("Terminating controller %s", self.name)
113+
try:
114+
await asyncio.wait_for(self.call("terminate"), self.term_timeout)
115+
await asyncio.wait_for(self.process.wait(), self.term_timeout)
116+
logger.info("Controller %s terminated", self.name)
117+
return
118+
except:
119+
logger.warning("Controller %s did not exit on request, "
120+
"ending the process", self.name)
121+
if os.name != "nt":
111122
try:
112-
await asyncio.wait_for(self.call("terminate"),
113-
self.term_timeout)
114-
except:
115-
logger.warning("Controller %s did not respond to terminate "
116-
"command, killing", self.name)
117-
try:
118-
self.process.kill()
119-
except ProcessLookupError:
120-
pass
123+
self.process.terminate()
124+
except ProcessLookupError:
125+
pass
121126
try:
122-
await asyncio.wait_for(self.process.wait(),
123-
self.term_timeout)
124-
except:
125-
logger.warning("Controller %s failed to exit, killing",
126-
self.name)
127-
try:
128-
self.process.kill()
129-
except ProcessLookupError:
130-
pass
131-
await self.process.wait()
132-
logger.debug("Controller %s terminated", self.name)
127+
await asyncio.wait_for(self.process.wait(), self.term_timeout)
128+
logger.info("Controller process %s terminated", self.name)
129+
return
130+
except asyncio.TimeoutError:
131+
logger.warning("Controller process %s did not terminate, "
132+
"killing", self.name)
133+
try:
134+
self.process.kill()
135+
except ProcessLookupError:
136+
pass
137+
try:
138+
await asyncio.wait_for(self.process.wait(), self.term_timeout)
139+
logger.info("Controller process %s killed", self.name)
140+
return
141+
except asyncio.TimeoutError:
142+
logger.warning("Controller process %s failed to die", self.name)
133143

134144

135145
def get_ip_addresses(host):

‎artiq/master/worker.py

+23-16
Original file line numberDiff line numberDiff line change
@@ -115,30 +115,37 @@ async def close(self, term_timeout=1.0):
115115
" (RID %s)", self.ipc.process.returncode,
116116
self.rid)
117117
return
118-
obj = {"action": "terminate"}
119118
try:
120-
await self._send(obj, cancellable=False)
119+
await self._send({"action": "terminate"}, cancellable=False)
120+
await asyncio.wait_for(self.ipc.process.wait(), term_timeout)
121+
logger.debug("worker exited on request (RID %s)", self.rid)
122+
return
121123
except:
122-
logger.debug("failed to send terminate command to worker"
123-
" (RID %s), killing", self.rid, exc_info=True)
124+
logger.debug("worker failed to exit on request"
125+
" (RID %s), ending the process", self.rid,
126+
exc_info=True)
127+
if os.name != "nt":
124128
try:
125-
self.ipc.process.kill()
129+
self.ipc.process.terminate()
126130
except ProcessLookupError:
127131
pass
128-
await self.ipc.process.wait()
129-
return
132+
try:
133+
await asyncio.wait_for(self.ipc.process.wait(), term_timeout)
134+
logger.debug("worker terminated (RID %s)", self.rid)
135+
return
136+
except asyncio.TimeoutError:
137+
logger.warning("worker did not terminate (RID %s), killing",
138+
self.rid)
139+
try:
140+
self.ipc.process.kill()
141+
except ProcessLookupError:
142+
pass
130143
try:
131144
await asyncio.wait_for(self.ipc.process.wait(), term_timeout)
145+
logger.debug("worker killed (RID %s)", self.rid)
146+
return
132147
except asyncio.TimeoutError:
133-
logger.debug("worker did not exit by itself (RID %s), killing",
134-
self.rid)
135-
try:
136-
self.ipc.process.kill()
137-
except ProcessLookupError:
138-
pass
139-
await self.ipc.process.wait()
140-
else:
141-
logger.debug("worker exited by itself (RID %s)", self.rid)
148+
logger.warning("worker refuses to die (RID %s)", self.rid)
142149
finally:
143150
self.io_lock.release()
144151

13 commit comments

Comments
 (13)

sbourdeauducq commented on Feb 1, 2016

@sbourdeauducq
Member

Are there drivers that actually do something useful with SIGTERM on Linux?

jordens commented on Feb 2, 2016

@jordens
MemberAuthor

Even if there is not, that would not be a reason not to use it here. If I was writing a controller, I would always want to handle SIGTERM as it is the only thing that I can resonably expect to see before being killed, e.g. if the system is shutting down.

On windows we could (and maybe should) send_signal(CTRL_BREAK_EVENT).

sbourdeauducq commented on Feb 2, 2016

@sbourdeauducq
Member

How do you typically want to handle it? Install a signal handler that raises an exception (similar to KeyboardInterrupt) so that e.g. finally clauses are properly executed?

This adds boilerplate to every controller, and it's still an approximate solution as asynchronous exceptions in Python have races (http://dept.cs.williams.edu/~freund/papers/python.pdf).

sbourdeauducq commented on Feb 2, 2016

@sbourdeauducq
Member

Note that clean shutdown is already possible with the current mechanism. Call terminate on the controller manager using artiq_rpctool in your system's shutdown scripts.

jordens commented on Feb 2, 2016

@jordens
MemberAuthor

Who will write those scripts? SIGTERM is already there and is something that you see without any further work and without having to invent any further buggy scripts.
Also remember that terminate() relies on the entire networking stack working and the interface being up.

sbourdeauducq commented on Feb 2, 2016

@sbourdeauducq
Member

ARTIQ servers are always bound on localhost now in addition to any specified network address, so I do not think the network setup should be a problem.

SIGTERM is generally difficult to handle reliably (without race conditions) as it is asynchronous. Python messed it up. In addition Unix signals have subtle issues wrt the restarting of system calls. Do you really want to use SIGTERM as the recommended graceful process termination mechanism?

sbourdeauducq commented on Feb 2, 2016

@sbourdeauducq
Member

But I will concede that the ability to interrupt a blocking system call at all is an advantage for signals.

jordens commented on Feb 2, 2016

@jordens
MemberAuthor

There is --no-localhost-bind... Why was that added anyway? I seem to have needed it because of some ITAC networking problems.

I would not use SIGTERM as the primary way (among other things it does not exist on Windows) but as a (non-ideal) fallback for the cases where I don't get a terminate().

sbourdeauducq commented on Feb 2, 2016

@sbourdeauducq
Member

Essentially, it makes it easier to use clients when you are logged on the same machine as the server, as they work without any network options then.

jordens commented on Feb 2, 2016

@jordens
MemberAuthor

I was getting address already in use from the server without that option. Even though there was nothing using that port.

sbourdeauducq commented on Feb 2, 2016

@sbourdeauducq
Member

Meh. Seems my usual pessimism about computers that gave me the idea to add that --no-localhost-bind option was totally justified. How did that happen though? It tries to bind on both 127.0.0.1 and ::1 by default, maybe there are systems on which binding a on v4 address also binds on the v6?

jordens commented on Feb 2, 2016

@jordens
MemberAuthor

I'll look into it if I see it again. My guess would be IPv6 related in some way.

sbourdeauducq commented on Feb 2, 2016

@sbourdeauducq
Member

If that is the problem, we could bind on v6 only.

Please sign in to comment.