- Notifications
You must be signed in to change notification settings - Fork1
Tiny implementation of the GNU/Linux CGroupFS (sans resource controllers) as a PUFFS or FUSE filesystem for BSD platforms
License
InitWare/CGrpFS
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
CGrpFS is a tiny implementation of the GNU/Linux CGroup filesystem for BSDplatforms. It takes the form of a either a PUFFS or FUSE filesystem, andimplements robust tracking of processes. Resource control, however, is notpresent; the different BSD platforms each provide different mechanisms for this,none of which are trivially adapted to CGroups semantics. The process trackingalone is sufficient for the main user of CGrpFS,InitWare, a service manager derived fromsystemd.
CGrpFS is available under the Modified BSD Licence. It is not as-yet very welltested, but seems to work fine for InitWare's purposes.
CGrpFS was implemented quickly and is not necessarily the most efficient indesign.
Process tracking is implemented with the process filter for Kernel Queues. Thisprovides CGrpFS with notifications whenever a process forks or exits to which ithas attached a filter. On all BSD platforms except macOS, the filter isautomatically applied to all the transitive subprocesses spawned by a processafter the filter is attached. A filter is attached as soon as a PID is added toa CGroup, so the Linux semantics are matched.
For simplicity, all the files and directories of the CGroup filesystem arebacked by node structures, which are akin to a combination of aninode
anddirent
structure. These nodes are hierarchically ordered and each stores aname,stat
structure, a type (CGroup directory,cgroup.procs
file, ...) andtype-specific data. A CGroup directory node, for example, stores a linked listof all PIDs within it. It might be better to take an approach that maintainsless data, but bear in mind that at least permissions data must be storedfor nodes, as the GNU/Linux CGroup filesystem allows changing permissions, e.g.to facilitate delegation.
To try to ensure consistency of file contents over the course of multiple reads,eachopen
operation in the FUSE version of CGrpFS allocates a buffer intowhich the contents of the associated file is generated in full, and this bufferis used for each read with that FUSE file handle. This may not work properly inevery case because the SunOS VFS (as imitated by BSD) enforces a distinctionbetween the file and vnode levels absent from GNU/Linux. The likely result ofthis distinction is that read operations may not be mapped to the right filehandle during read operations. The only viable fix (which would also work forPUFFS) would be the generation of a fresh vnode for every open.
A mini-ProcFS is also provided with only a minimalcgroup
file present in eachPID's directory. The nodes for directories (and the containedcgroup
file)within that hierarchy are generated dynamically in response to getattr() eventsto eliminate the need to preallocate the entire lot, and these might feasibly bepruned if unused for some time. Their purpose is to allow InitWare to determinethe containing CGroup of a PID. If a PID is inquired about which does notcurrently belong to any CGroup, it is automatically added to the root CGroup,in line with the behaviour on Linux.
Because only NetBSD's PUFFS (and its FUSE emulation, PERFUSE) support poll()(but not the installation of Kernel Queues filters), while FUSE for other BSDsdoesn't, and because therelease_agent
mechanism is fundamentally fragile,CGrpFS listens on a sequenced-packet socket in the Unix domain at/var/run/cgrpfs.notify
. On a process exiting, asiginfo_t
structure isprepared and sent as a message to every peer connected to that socket. InitWareuses this to help track process lifecycle.
Some effort is made to be resilient to out-of-memory conditions. This isuntested and may not work. Whether libfuse is similarly resilient is anotherquestion. There is also the problem that under OOM conditions, it is no longerpossible to update the structures in CGrpFS which describe which processesbelong to what CGroup. This might be mitigated in part by keeping some sparememory around to use under OOM conditions, and hoping that the number of trackedprocesses doesn't grow beyond its capacity while the OOM state persists.Finally, the process filter itself can fail in-kernel under OOM conditions, andreturn NOTE_TRACKERR. There is no easy way out of this without modifying thekernel itself.
There are several ways in which CGrpFS could be improved.
The mini-ProcFS is immutable by users and stateless, only providing informationmaintained by the actual CGroups tree; it could therefore be implementedwithout backing nodes to save some memory use.
Much more data than necessary is stored in each node (a full structstat
);this can be reduced. And proper nodes for each pseudo-file in a CGroup directorycould be abolished too.
Much unnecessary copying goes on due to CGrpFS using the high-level libfuseinterface. Lowering to the fuse_lowlevel interface (or even directly to the/dev/fuse
device) could help reduce that, and hence reduce the risk of OOMconditions causing a crash. Needless lookups also occur with the high-levelinterface because it's based on path strings; the archictecture of CGrpFS morereadily fits the lower-level inode-based interface. Path lookup would alsobecome simpler since there would be one lookup request for each component ofthe path; currently it has ugly special-cases for e.g.mkdir
.
OOM resilience could be improved in line with the notes in the Architecturesection above.
Release agent support should be implemented for compatibility, though it's nota reliable mechanism.
FreeBSD provides hierarchical resource control via therctl
system. It's notclear whether this usefully maps to CGroups semantics, but it certainly isworth exploring whether it could be used to provide some CGroup resourcecontrollers.
CGrpFS could be implemented as an in-kernel filesystem within the various BSDkernels. CGrpFS could be hooked up more directly with the kernel's processmanagement, and benefit from the kernel's capacity to to deal with OOMconditions more aggressively.
Furthering an in-kernel implementation of CGrpFS, hierarchical resource controlmechanisms could be implemented in those BSDs without them.
Contributing poll() and kevent() supprt to each BSD's FUSE/PUFFS implementationwould allow the CGroups 2.0cgroup.events
file to be implemented.