Create PID namespace without root

o9000 · 2017-09-30 13:46:49

I've often had difficulties stopping complex shell scripts which run several children in parallel, that may start their own children etc. The problem is when something in the "middle" crashes, some children end up reparented to init and staying there forever. Killing the parent does not stop these children; in fact, the parent may exit and lose track of them.

One idea is to fix the scripts and track children. But what if the scripts are used, for example, in unit testing, and some parts are expected to break in unknown ways? For example in a testing pipeline for tint2, with a config that creates executors, that start various scripts running long-running processes. These scripts or programs should be considered broken.

A possible idea is to use process groups (wikipedia) and send the kill signal to the process group, which essentially broadcasts it to all processes in the group. The problem is that if one of the descendants starts its own process group, the signals are no longer delivered there.

What comes to mind are containers, as provided by unshare, systemd-nspawn etc. They isolate processes in a new "PID namespace", effectively running a new process which is viewed as init for its descendants; when this process dies, all its descendants are killed by the Linux kernel.

The problem with containers are:

1. One needs to be root to use them. No safe way to use this functionality from a user's shell script, for example.
2. They support a gazillion of options for changing mount points, networking etc., which (i) are not needed for this problem and (ii) make it impossible to give an user access to this functionality without compromising the security of the system (for example, setuid or sudo without password on unshare is a very bad idea).

I thought about trimming down `unshare` to remove everything but PID namespaces, but I wasn't happy about its internal logic (it was always exiting after its direct child would exit, which is not always the desirable thing to do in my case; and it was not passing the exit code of the child to the parent).

So I rewrote from scratch something similar to `unshare -fp --mount-proc` and I called it runpidns: https://gitlab.com/o9000/runpidns

During make install, it gives the executable CAP_SYS_ADMIN privileges. These privileges are dropped right after creating the PID namespace and mounting /proc for the children.

Example usage:

(Shell 1)

runpidns bash -c "(sleep 300&)"

(blocks)

(Shell 2)

pstree -a $(pidof runpidns)
runpidns bash -c (sleep 300&)
  └─ns-init  bash -c (sleep 300&)
      └─sleep 300

runpidns starts ns-init, which watches over all its descendant processes. When all of them exit, ns-init exits with the exit code of the direct child (here it is bash). If runpidns exits for any reason, ns-init exits and kills all its children. Also if ns-init is killed in any way, all its children are be killed by the kernel. It is impossible for children to escape the PID namespace and get reparented to real init.

I am wondering if this approach is secure. runpidns will have CAP_SYS_ADMIN from startup, during a malloc call and a clone call, then it is dropped. ns-init will have CAP_SYS_ADMIN while doing mount("proc", "/proc", "proc", 0, NULL); and then drops it. To me it seems safe, but I'm not sure. Should I worry about environment variables (e.g. LD_PRELOAD etc.)? Should I worry about /proc permissions etc.?

Also, do you have any other comments?

Last edited by o9000 (2017-09-30 13:52:33)

o9000 · 2017-09-30 13:57:15

A possible use case for me is to start a script running Xvfb, openbox, xsettingsd, compton and tint2 plus other graphical applications in a PID namespace; let them running for a while, maybe interacting with them in various ways (e.g. restarting tint2, killing the compositor, starting new programs etc.), and eventually checking if tint2 is alive, check memory usage etc., then killing everything. This is a pain to clean up after when something goes wrong. Could be solved with proper containers, but that seems overkill.

tknomanzr · 2017-09-30 14:59:42

This may sound crazy, but as a suggestion, you could try wrapping your shell scripts into python via the subprocess module. This would allow you to use a structure like this:

import subprocess

try:
	result = subprocess.check_output(["your_bash_script", args1, args 2, etc], shell=False)
except subprocess.CalledProcessError as e:
	handle the error

Also, if you are just interested in the return value you can use the subprocess.call() method. This method would give you exception handling and a way to handle any children. I would use python as the logic controller in this situation. The python environment becomes the container, so long as you handle any exceptions that get raised. So instead of starting a bash script that starts other scripts, start a python script that manages the suite of scripts you require. This way you get exception handling and garbage collection as well.

You other options to get around root the requirement are policykit or setting up and sending messages across dbus when something dies. This would basically be a signal handling mechanism that would monitor your processes and clean your mess up if something dies unexpectedly. I am just now learning how to set that sort of thing up, however, so would probably not be much help with this method. Dbus is supposed to be our IPC mechanism, however.

o9000 · 2017-09-30 15:22:29

Actually I am using Python, the problem is that it keeps track only of direct children, not also their descendants.

tknomanzr · 2017-10-01 01:09:43

I think python-psutils might provide you with what you need. Coincidentally, it gives me a bunch of stuff to work with while I play with executors in tint2. I haven't read the full documentation yet, but does seem to offer quite a bit in terms of process management. I keep thinking of a data structure like a deque or stack that you can push and pop pid's off of as they spawn. It would likely need to be a thread that watches the processs spawning off of the main thread.

o9000 · 2017-10-01 17:12:02

firejail is definitely interesting, but I didn't want to use a full-blown container and/or sandbox solution.

I've used psutil before, it cannot track forked processes after they've detached from parent. Think about something like this: bash -c '(sleep 1000 &)'. After you run that command, sleep is going to be detached from the current tree and attached as a child to init. The direct child (bash) exits. There's no way to find out that sleep "belongs" to you (except maybe by tracing fork and clone syscalls for all children, but that has its own problems).

#1 2017-09-30 13:46:49

Create PID namespace without root

#2 2017-09-30 13:57:15

Re: Create PID namespace without root

#3 2017-09-30 14:59:42

Re: Create PID namespace without root

#4 2017-09-30 15:22:29

Re: Create PID namespace without root

#5 2017-10-01 01:09:43

Re: Create PID namespace without root

#6 2017-10-01 17:12:02

Re: Create PID namespace without root

Board footer