From e2191095c3c60b10908d36f8ef081f63e72500a9 Mon Sep 17 00:00:00 2001 From: Rob Landley Date: Sun, 24 Feb 2019 11:36:00 -0600 Subject: A document I wrote ages ago about how mount works under the covers. --- www/doc/mount.txt | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 163 insertions(+) create mode 100644 www/doc/mount.txt (limited to 'www/doc') diff --git a/www/doc/mount.txt b/www/doc/mount.txt new file mode 100644 index 00000000..f538c467 --- /dev/null +++ b/www/doc/mount.txt @@ -0,0 +1,163 @@ +Here's how mount actually works: + +The mount comand calls the mount system call, which has five arguments you +can see on the "man 2 mount" page: + + int mount(const char *source, const char *target, const char *filesystemtype, + unsigned long mountflags, const void *data); + +The command "mount -t ext2 /dev/sda1 /path/to/mntpoint -o ro,noatime", +parses its command line arguments to feed them into those five system call +arguments. In this example, the source is "/dev/sda1", the target is +"/path/to/mountpoint", and the filesystemtype is "ext2". + +The other two syscall arguments (mountflags and data) come from the +"-o option,option,option" argument. The mountflags argument goes to the VFS +(explained below), and the data argument is passed to the filesystem driver. + +The mount command's options string is a list of comma separated values. If +there's more than one -o argument on the mount command line, they get glued +together (in order) with a comma. The mount command also checks the file +/etc/fstab for default options, and the options you specify on the command +line get appended to those defaults (if any). Most other command line mount +flags are just synonyms for adding option flags (for example +"mount -o remount -w" is equivalent to "mount -o remount,rw"). Behind the +scenes they all get appended to the -o string and fed to a common parser. + +VFS stands for "Virtual File System" and is the common infrastructure shared +by different filesystems. It handles common things like making the filesystem +read only. The mount command assembles an option string to supply to the "data" +argument of the option syscall, but first it parses it for VFS options +(ro,noexec,nodev,nosuid,noatime...) each of which corresponds to a flag +from #include . The mount command removes those options from the +sting and sets the corresponding bit in mountflags, then the remaining options +(if any) form the data argument for the filesystem driver. + +A few quick implementation details: the mountflag MS_SILENCE gets set by +default even if there's nothing in /etc/fstab. Some actions (such as --bind +and --move mounts, I.E. -o bind and -o move) are just VFS actions and don't +require any specific filesystem at all. The "-o remount" flag requires looking +up the filesystem in /proc/mounts and reassembling the full option string +because you don't _just_ pass in the changed flags but have to reassemble +the complete new filesystem state to give the system call. Some of the options +in /etc/fstab are for the mount command (such as "user" which only does +anything if the mount command has the suid bit set) and don't get passed +through to the system call. + +When mounting a new filesystem, the "filesystem" argument to the mount system +call specifies which filesystem driver to use. All the loaded drivers are +listed in /proc/filesystems, but calling mount can also trigger a module load +request to add another. A filesystem driver is responsible for putting files +and subdirectories under the mount point: any time you open, close, read, +write, truncate, list the contents of a directory, move, or delete a file, +you're talking to a filesystem driver to do it. (Or when you call +ioctl(), stat(), statvfs(), utime()...) + +Different drivers implement different filesystems, which have four categories: + +1) Block device backed filesystems, such as ext2 and vfat. + +This kind of filesystem driver acts as a lens to look at a block device +through. The source argument for block backed filesystems is a path to a +block device, such as "/dev/hda1", which stores the contents of the +filesystem in a fixed block of sequential storage, and there's a seperate +driver providing that block device. + +Block backed filesystems are the "conventional" filesystem type most people +think of when they mount things. The name means that the "backing store" +(where the data lives when the system is switched off) is on a block device. + +2) Server backed filesystems, such as cifs/samba or fuse. + +These drivers convert filesystem operations into a sequential stream of +bytes, which it can send through a pipe to talk to a program. The filesystem +server could be a local Filesystem in Userspace daemon (connected to a local +process through a pipe filehandle), behind a network socket (CIFS and v9fs), +behind a char device (/dev/ttyS0), and so on. The common attribute is there's +some program on the other end sending and receiving a sequential bytestream. +The backing store is a server somewhere, and the filesystem driver is talking +to a process that reads and writes data in some known protocol. + +The source argument for these filesystems indicates where the filesystem lives. It's often in a URL-like format for network filesystems, but it's really just a blob of data that the filesystem driver understands. + +A lot of server backed filesystems want to open their own connection so they +don't have to pass their data through a persistent local userspace process, +not really for performance reasons but because in low memory situations a +chicken-and-egg situation can develop where all the process's pages have +been swapped out but the filesystem needs to write data to its backing +store in order to free up memory so it can swap the process's pages back in. +If this mechanism is providing the root filesystem, this can deadlock and +freeze the system solid. So while you _can_ pass some of them a filehandle, +more often than not you don't. + +These are also known as "pipe backed" filesystems (or "network filesystems" +because that's a common case, although a network doesn't need to be inolved). +Conceptually they're char device backed filesystems (analogus to the block +device backed ones), but you don't commonly specify a character device in +/dev when mounting them because you're talking to a specific server process, +not a whole machine. + +3) Ram backed filesystems, such as ramfs and tmpfs. + +These are very simple filesystems that don't implement a backing store. Data +written to these gets stored in the disk cache, and the driver ignores requests +to flush it to backing store (reporting all the pages as pinned and +unfreeable). + +These drivers essentially mount the VFS's page/dentry cache as if it was a +filesystem. (Page cache stores file contents, dentry cache stores directory +entries.) They grow and shrink dynamically, as needed: when you write files +into them they allocate more memory to store it, and when you delete files +the memory is freed. + +There's a simple one (ramfs) that does only that, and a more complex one (tmpfs) +which adds a size limitation (by default 50%, but it's adjustable as a mount +option) so the system doesn't run out of memory and lock up if you +"cat /dev/zero > file", and can also report how much space is remaining +when asked (ramfs always says 0 bytes free). The other thing tmpfs does +is write its data out to swap space (like processes do) when the system +is under memory proessure. + +Note that "ramdisk" is not the same as "ramfs". The ramdisk driver uses a +chunk of memory to implement a block device, and then you can format that +block device and mount it with a block device backed filesystem driver. +(This is the same "two device drivers" approach you always have with block +backed filesystems: one driver provides /dev/ram0 and the second driver mounts +it as vfat.) Ram disks are significantly less efficient than ramfs, +allocating a fixed amount of memory up front for the block device instead of +dynamically resizing itself as files are written into an deleted from the +page and dentry caches the way ramfs does. + +Note: initramfs cpio, tmpfs as rootfs. + +4) Synthetic filesystems, such as proc, sysfs, devpts... + +These filesystems don't have any backing store either, because they don't +store arbitrary data the way the first three types of filesystems do. + +Instead they present artificial contents, which can represent processes or +hardware or anything the driver writer wants them to show. Listing or reading +from these files calls a driver function that produces whatever output it's +programmed to, and writing to these files submits data to the driver which +can do anything it wants with it. + +Synthetic ilesystems are often implemented to provide monitoring and control +knobs for parts of the operating system. It's an alternative to adding more +system calls (or ioctl, sysctl, etc), and provides a more human friendly user +interface which programs can use but which users can also interact with +directly from the command line via "cat" and redirecting the output of +"echo" into special files. + + +Those are the four types of filesystems: backing store can be a fixed length +block of storage, backing store can be some server the driver connects to, +backing store can not exist and the files merely reside in the disk cache, +or the filesystem driver can just make up its contents programmatically. + +And that's how filesystems get mounted, using the mount system call which has +five arguments. The "filesystem" argument specifies the driver implementing +one of those filesystems, and the "source" and "data" arguments get fed to +that driver. The "target" and "mountflags" arguments get parsed (and handled) +by the generic VFS infrastructure. (The filesystem driver can peek at the +VFS data, but generally doesn't need to care. The VFS tells the filesystem +what to do, in response to what userspace said to do.) -- cgit v1.2.3