From e2191095c3c60b10908d36f8ef081f63e72500a9 Mon Sep 17 00:00:00 2001
From: Rob Landley <rob@landley.net>
Date: Sun, 24 Feb 2019 11:36:00 -0600
Subject: A document I wrote ages ago about how mount works under the covers.

---
 www/doc/mount.txt | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 163 insertions(+)
 create mode 100644 www/doc/mount.txt

(limited to 'www/doc')

diff --git a/www/doc/mount.txt b/www/doc/mount.txt
new file mode 100644
index 00000000..f538c467
--- /dev/null
+++ b/www/doc/mount.txt
@@ -0,0 +1,163 @@
+Here's how mount actually works:
+
+The mount comand calls the mount system call, which has five arguments you
+can see on the "man 2 mount" page:
+
+  int mount(const char *source, const char *target, const char *filesystemtype,
+            unsigned long mountflags, const void *data);
+
+The command "mount -t ext2 /dev/sda1 /path/to/mntpoint -o ro,noatime",
+parses its command line arguments to feed them into those five system call
+arguments. In this example, the source is "/dev/sda1", the target is
+"/path/to/mountpoint", and the filesystemtype is "ext2".
+
+The other two syscall arguments (mountflags and data) come from the
+"-o option,option,option" argument. The mountflags argument goes to the VFS
+(explained below), and the data argument is passed to the filesystem driver.
+
+The mount command's options string is a list of comma separated values. If
+there's more than one -o argument on the mount command line, they get glued
+together (in order) with a comma. The mount command also checks the file
+/etc/fstab for default options, and the options you specify on the command
+line get appended to those defaults (if any). Most other command line mount
+flags are just synonyms for adding option flags (for example
+"mount -o remount -w" is equivalent to "mount -o remount,rw"). Behind the
+scenes they all get appended to the -o string and fed to a common parser.
+
+VFS stands for "Virtual File System" and is the common infrastructure shared
+by different filesystems. It handles common things like making the filesystem
+read only. The mount command assembles an option string to supply to the "data"
+argument of the option syscall, but first it parses it for VFS options
+(ro,noexec,nodev,nosuid,noatime...) each of which corresponds to a flag
+from #include <sys/mount.h>. The mount command removes those options from the
+sting and sets the corresponding bit in mountflags, then the remaining options
+(if any) form the data argument for the filesystem driver.
+
+A few quick implementation details: the mountflag MS_SILENCE gets set by
+default even if there's nothing in /etc/fstab. Some actions (such as --bind
+and --move mounts, I.E. -o bind and -o move) are just VFS actions and don't
+require any specific filesystem at all. The "-o remount" flag requires looking
+up the filesystem in /proc/mounts and reassembling the full option string
+because you don't _just_ pass in the changed flags but have to reassemble
+the complete new filesystem state to give the system call. Some of the options
+in /etc/fstab are for the mount command (such as "user" which only does
+anything if the mount command has the suid bit set) and don't get passed
+through to the system call.
+
+When mounting a new filesystem, the "filesystem" argument to the mount system
+call specifies which filesystem driver to use. All the loaded drivers are
+listed in /proc/filesystems, but calling mount can also trigger a module load
+request to add another. A filesystem driver is responsible for putting files
+and subdirectories under the mount point: any time you open, close, read,
+write, truncate, list the contents of a directory, move, or delete a file,
+you're talking to a filesystem driver to do it. (Or when you call
+ioctl(), stat(), statvfs(), utime()...)
+
+Different drivers implement different filesystems, which have four categories:
+
+1) Block device backed filesystems, such as ext2 and vfat.
+
+This kind of filesystem driver acts as a lens to look at a block device
+through. The source argument for block backed filesystems is a path to a
+block device, such as "/dev/hda1", which stores the contents of the
+filesystem in a fixed block of sequential storage, and there's a seperate
+driver providing that block device.
+
+Block backed filesystems are the "conventional" filesystem type most people
+think of when they mount things. The name means that the "backing store"
+(where the data lives when the system is switched off) is on a block device.
+
+2) Server backed filesystems, such as cifs/samba or fuse.
+
+These drivers convert filesystem operations into a sequential stream of
+bytes, which it can send through a pipe to talk to a program. The filesystem
+server could be a local Filesystem in Userspace daemon (connected to a local
+process through a pipe filehandle), behind a network socket (CIFS and v9fs),
+behind a char device (/dev/ttyS0), and so on. The common attribute is there's
+some program on the other end sending and receiving a sequential bytestream.
+The backing store is a server somewhere, and the filesystem driver is talking
+to a process that reads and writes data in some known protocol.
+
+The source argument for these filesystems indicates where the filesystem lives. It's often in a URL-like format for network filesystems, but it's really just a blob of data that the filesystem driver understands.
+
+A lot of server backed filesystems want to open their own connection so they
+don't have to pass their data through a persistent local userspace process,
+not really for performance reasons but because in low memory situations a
+chicken-and-egg situation can develop where all the process's pages have
+been swapped out but the filesystem needs to write data to its backing
+store in order to free up memory so it can swap the process's pages back in.
+If this mechanism is providing the root filesystem, this can deadlock and
+freeze the system solid. So while you _can_ pass some of them a filehandle,
+more often than not you don't.
+
+These are also known as "pipe backed" filesystems (or "network filesystems"
+because that's a common case, although a network doesn't need to be inolved).
+Conceptually they're char device backed filesystems (analogus to the block
+device backed ones), but you don't commonly specify a character device in
+/dev when mounting them because you're talking to a specific server process,
+not a whole machine.
+
+3) Ram backed filesystems, such as ramfs and tmpfs.
+
+These are very simple filesystems that don't implement a backing store. Data
+written to these gets stored in the disk cache, and the driver ignores requests
+to flush it to backing store (reporting all the pages as pinned and
+unfreeable).
+
+These drivers essentially mount the VFS's page/dentry cache as if it was a
+filesystem. (Page cache stores file contents, dentry cache stores directory
+entries.) They grow and shrink dynamically, as needed: when you write files
+into them they allocate more memory to store it, and when you delete files
+the memory is freed.
+
+There's a simple one (ramfs) that does only that, and a more complex one (tmpfs)
+which adds a size limitation (by default 50%, but it's adjustable as a mount
+option) so the system doesn't run out of memory and lock up if you
+"cat /dev/zero > file", and can also report how much space is remaining
+when asked (ramfs always says 0 bytes free). The other thing tmpfs does
+is write its data out to swap space (like processes do) when the system
+is under memory proessure.
+
+Note that "ramdisk" is not the same as "ramfs". The ramdisk driver uses a
+chunk of memory to implement a block device, and then you can format that
+block device and mount it with a block device backed filesystem driver.
+(This is the same "two device drivers" approach you always have with block
+backed filesystems: one driver provides /dev/ram0 and the second driver mounts
+it as vfat.) Ram disks are significantly less efficient than ramfs,
+allocating a fixed amount of memory up front for the block device instead of
+dynamically resizing itself as files are written into an deleted from the
+page and dentry caches the way ramfs does.
+
+Note: initramfs cpio, tmpfs as rootfs.
+
+4) Synthetic filesystems, such as proc, sysfs, devpts...
+
+These filesystems don't have any backing store either, because they don't
+store arbitrary data the way the first three types of filesystems do.
+
+Instead they present artificial contents, which can represent processes or
+hardware or anything the driver writer wants them to show. Listing or reading
+from these files calls a driver function that produces whatever output it's
+programmed to, and writing to these files submits data to the driver which
+can do anything it wants with it.
+
+Synthetic ilesystems are often implemented to provide monitoring and control
+knobs for parts of the operating system. It's an alternative to adding more
+system calls (or ioctl, sysctl, etc), and provides a more human friendly user
+interface which programs can use but which users can also interact with
+directly from the command line via "cat" and redirecting the output of
+"echo" into special files.
+
+
+Those are the four types of filesystems: backing store can be a fixed length
+block of storage, backing store can be some server the driver connects to,
+backing store can not exist and the files merely reside in the disk cache,
+or the filesystem driver can just make up its contents programmatically.
+
+And that's how filesystems get mounted, using the mount system call which has
+five arguments. The "filesystem" argument specifies the driver implementing
+one of those filesystems, and the "source" and "data" arguments get fed to
+that driver. The "target" and "mountflags" arguments get parsed (and handled)
+by the generic VFS infrastructure. (The filesystem driver can peek at the
+VFS data, but generally doesn't need to care. The VFS tells the filesystem
+what to do, in response to what userspace said to do.)
-- 
cgit v1.2.3