Friday, August 24, 2007

The Solaris/BSD Virtual File System Paradigm

A brief prefatory note: I thought I would have covered the xv6 file system prior to jumping to the virtual file system, but I got bored and decided to shake things up. Next file system we shall study it will be the xv6 file system...I hope.

Introduction

All right, after having studied some file systems, we move on to the next big advancement in the theory of file systems: the virtual file system.

The virtual file system was made in late '85, early '86, in order to let a single mechanism access various disks that are formatted with different file systems. Sun was the one who made it for the SunOS system, which was little more than a glorified, proprietary BSD system...so it wasn't too hard to get it working for BSD. For this reason, I shall call this first approach the Solaris/BSD paradigm.

The idea was to generalize the UNIX file system of inodes to have virtual inodes, or "vnodes". The virtual file system had a linked list of mounted file systems (represented by the struct vfs data structure). The first struct vfs in the linked list was always the native file system.

Despite being written in C, the Solaris/BSD paradigm used an object oriented approach. Roughly put, the classes were structs and the methods were function pointers.

The approach that we will take to investigate the Solaris/BSD paradigm will be loosely based on the influential technical paper explaining it [1]. We'll explore the data structures and their respective "member functions". The data structures that we shall examine will be given to us verbatim from the paper. There are more modern additions to the virtual file system in Solaris (e.g. the introduction of virtual events as a vnode type) and in BSD. We will not discuss it here.

The VFS Data Structure

The vfs data structure can be thought of as a sort of virtual file system equivalent to the mount table entry. They are represented by directories on the root file system. So if we were running a native ext3 file system (it would be the first in the linked list of struct vfs mind you), and we mounted say a UFS partition we would have a directory dedicated to it (e.g. /ufs_partition/ or something).

The VFS struct is rather simple to look at:
struct vfs {
        struct vfs    *vfs_next;
        struct vfsops *vfs_op;
        struct vnode  *vfs_vnodecovered;
        int           vfs_flag;
        int           vfs_bsize;
        caddr_t       vfs_data;
};
The first field, struct vfs *vfs_next is a pointer to the next mounted file system. The struct vfs is a linked list after all.

The next field is a pointer to the operations of the struct vfs. This is the object orientedness of the virtual file system implementation.

The struct vnode *vfs_vnodecovered is the mount point for the file system. It is worthy to note that this is NULL for the root file system (the head of the linked list).

The next several fields are rather...straight forward (read: dull). The int vfs_flag is any flag for the struct vfs, the int vfs_bsize is the block size for the native file system.

The caddr_t vfs_data is somewhat interesting...only because I had no clue what the hell a caddr_t type was! It turns out it is a character pointer. This field is supposed to be for private, file system dependent data. The example given in the text was - for the 4.2 BSD file system - vfs_data points to a mount table entry.

It should be noted that - in general - the struct vfs should be thought of as a sort of virtual file system mount table entry.

There are some fundamental data structures which should be inspected prior to investigating struct vfs_ops: the struct statfs data structure (which holds the results for a vfs_statfs() operation) and the struct fid data structure (which is a file identifier).

The struct statfs is the first thing we shall examine:
struct statfs {
        long f_type;
        long f_bsize;
        long f_blocks;
        long f_bfree;
        long f_bavail;
        long f_files;
        long f_ffree;
        fsid_t f_fsid;
        long f_spare[7];
};
This is basically the result of the statfs() operation. The entries are self-explanatory: the file type (ordinary file, directory, character or block device, socket, etc.) represented by long f_type; the native block size of the file system is represented by the long f_bsize; the long f_bfree is the number of free blocks; the long f_bavail is the "non-su blocks"; the total number of files is then represented by long f_files; the long f_ffree is the free nodes in the file system; the fsid_t f_sid is the file system id; and the last field long f_spare[7] is spare space used for later.

The file identifier is the next and last data structure we need to investigate prior to going on to the struct vfs_ops:
struct fid {
        u_short fid_len;
        char fid_data[1];
};
The unsigned short fid_len is the length of the data, and char fid_data is the actual data encapsulated in the file identifier.

The operations for the struct vfs is:
struct vfsops {
        int     (*vfs_mount)(struct vfs* vfs_ptr, char *path, char *data);
        int     (*vfs_unmount)(struct vfs* vfs_ptr, struct vnode* stuffResultsHere);
        int     (*vfs_root)(struct vfs* vfs_ptr, struct vnode* stuffResultsHere);
        int     (*vfs_statfs)(struct vfs* vfs_ptr, struct statfs* putResultsHere);
        int     (*vfs_sync)(struct vfs* vfs_ptr);
        int     (*vfs_fid)(struct vfs* vfs_ptr, struct vnode *file, struct fid* fid_ptr);
        int     (*vfs_vget)(struct vfs* vfs_ptr, struct vnode** vpp, struct fid* file);
};
This is a struct full of function pointers! Let us investigate each one in turn.

The vfs_mount() function mounts the vfs pointer (that is to say, it reads the superblock, etc.). The char *path points to the path name to be mounted for the sake of recording purposes. The char *data points to file system dependent data.

The vfs_unmount() function simply unmounts the vfs (syncs the superblock, etc.).

Our next function/method/whatever is vfs_root() which returns the root vnode for the file system represented by struct vfs* vfs_ptr. The struct vnode* stuffResultsHere vnode is a pointer to a vnode for the results.

Now we have int vfs_statfs() which returns the file system information. The struct statfs* putResultsHere argument is a pointer to a statfs structure for the results.

Then int vfs_sync() writes out all cached information for the struct vfs* vfs_ptr. This is not necessarily done synchronously. When the operation returns, all data has not been necessarily been written out...but it has been scheduled.

Next the int vfs_fid() gets a unique file identifier for the struct vnode* file which represents a file in this file system. The results are put in a struct fid and then struct fid* fid_ptr - the argument in the vfs_fid() function - points to the resulting struct fid.

Last but not least we have int vfs_vget() which turns a unique file identifier struct fid* file into a vnode representing the file which the file identifier identifies. The struct vnode** vpp points to a pointer to a vnode for the result.

The VNODE Data Structure

The vnode data structure is given to us, from the aforementioned paper, as:
enum vtype      { VNON, VREG, VDIR, VBLK, VCHR, VLINK, VSOCK, VBAD };
struct vnode {
        u_short         v_flag;
        u_short         v_count;
        u_short         v_shlockc;
        u_short         v_exlockc;
        struct vfs      *v_vfsmountedhere;
        struct vnodeops *v_op;
        union {
                struct socket   *v_Socket;
                struct stdata   *v_Stream;
        };
        struct vfs      *v_vfsp;
        enum vtype      v_type;
        caddr_t         v_data;
};
The various vnode types are given to us by an enumeration of all the various types.

The u_short v_flag points to the standard flags. The u_short v_count is the reference count for the vnode. It is maintained by generic vnode macros VN_HOLD and VN_RELE.

The next two fields deal with the number of shared locks and exclusive locks used by the vnode.

The struct vfs *v_vfsmountedhere points to a vfs if and only if the vnode is a mount point for the vfs. Otherwise, it is null and struct vfs* v_vfsp points to the vfs which the vnode is in.

The private data pointer (caddr_t v_data) which holds file dependent data. E.g. for the 4.2 BSD system, v_data points to an in memory inode data table.

The vnode has an interprocess communication apparatus...that's the anonymous union of the socket and the data stream.

Before continuing on to discuss the vnode_ops structure, we need to investigate a few structures. First the struct vattr data structure:
struct vattr {
        enum vtype     va_type;      /* vnode type */
        u_short        va_mode;      /* acc mode */
        short          va_uid;       /* owner uid */
        short          va_gid;       /* owner gid */
        long           va_fsid;      /* fs id */
        long           va_nodeid;    /* node # */
        short          va_nlink;     /* # links */
        u_long         va_size;      /* file size */
        long           va_blocksize; /* block size */
        struct timeval va_atime;     /* last acc */
        struct timeval va_mtime;     /* last mod */
        struct timeval va_ctime;     /* last chg */
        dev_t          va_rdev;      /* dev */
        long           va_blocks;    /* space used */
};
The various fields are self explanatory, especially since the comments explain all the fields! The only ones worthy of note would be the file system identifier long va_fsid, and the device the vnode's on dev_t va_rdev.

From the Modern openSolaris OS, we find the uio_t type's definition:
1217 typedef struct uio {
1218         struct iovec    *uio_iov;
1219         void    *uio_file;
1220         char    *uio_buf;
1221         int     uio_iovcnt;
1222         int     uio_offset;
1223         size_t  uio_resid;
1224         int     uio_rw;
1225 } uio_t;
I honestly do not understand this, and I suspect that this is far more complicated than it was when the original virtual file system was implemented.

And now a nightmarishly long structure of the operations on the vnode: the vnode_ops! Note that struct ucred cred is the credentials of the user, it is used to check for permissions while performing these operations.
struct vnodeops {
        int (*vn_open)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred);
        int (*vn_close)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred);
        int (*vn_rdwr)(struct vnode* vn_ptr, struct uio* args, bool read, unsigned short flags, struct ucred cred);
        int (*vn_ioctl)(struct vnode* vn_ptr, char* command,void* data, unsigned short flags, struct ucred cred);
        int (*vn_select)(struct vnode* vn_ptr, unsigned short ioDirection, struct ucred cred);
        int (*vn_getattr)(struct vnode* vn_ptr, struct vattr* va, struct ucred cred);
        int (*vn_setattr)(struct vnode* vn_ptr, struct vattr* va, struct ucred cred);
        int (*vn_access)(struct vnode* vn_ptr, unsigned short access_mode, struct ucred cred);
        int (*vn_lookup)(struct vnode* vn_ptr, char* name, struct vnode** vpp, struct ucred cred);
        int (*vn_create)(struct vnode* vn_ptr, char* name, struct vattr* va, bool exclusive, unsigned short open, struct vnode** vpp, struct ucred cred);
        int (*vn_remove)(struct vnode* vn_ptr, char* name, struct ucred cred);
        int (*vn_link)(struct vnode* vn_ptr, struct vnode* targetDir, char* targetName, struct ucred cred);
        int (*vn_rename)(struct vnode* vn_ptr, char* name, struct vnode* target_dir, char* target_name struct ucred cred)
        int (*vn_mkdir)(struct vnode* vn_ptr, char* name, struct vattr* va, struct vnode** vpp, struct ucred cred);
        int (*vn_rmdir)(struct vnode* vn_ptr, char* nm, struct ucred cred);
        int (*vn_readdir)(struct vnode* vn_ptr, struct uio* uiop, struct ucred cred);
        int (*vn_symlink)(struct vnode* vn_ptr, char *linkName, struct vattr* va, char* path, struct ucred cred);
        int (*vn_readlink)(struct vnode* vn_ptr, struct uio* uiop, struct ucred cred);
        int (*vn_fsync)(struct vnode* vn_ptr, struct ucred cred);
        int (*vn_inactive)(struct vnode* vn_ptr, struct ucred cred);
        int (*vn_bmap)(struct vnode* vn_ptr, unsigned int logicalBlockNumber, struct vnode** vpp, unsigned int* block_nmbr);
        int (*vn_strategy)(struct buf* buf_ptr);
        int (*vn_bread)(struct vnode* vn_ptr, unsigned int block_no, struct buf** bpp);
        int (*vn_brelse)(struct vnode* vn_ptr, struct buf* buf_ptr);
};
The int (*vn_open)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred) function performs any open protocol on a vnode pointed to by struct vnode* vn_ptr (for example, devices). If the open is a clone open the operation may return a new vnode. The various open flags is given by unsigned short flags.

Next int (*vn_close)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred) corresponds to the previous operation. This performs any close protocol on a vnode pointed to us by struct vnode* vn_ptr. It is called on the closing of the last reference to the vnode from the file table if the vnode is a device. Otherwise this is called on the last user close of a file descriptor. The flags are the open flags.

THen int (*vn_rdwr)(struct vnode* vn_ptr, struct uio* args, unsigned short flags, bool read, struct ucred cred) reads or writes to the vnode pointed to us by struct vnode* vn_ptr. It reads or writes a number of bytes at a specified offset in the file. The input/output arguments are pointed to by the struct uio* args argument. The bool read argument tells us if the operation is read if true, write if false. The input/output flags is given to us by unsigned short flags which specifies if the input/output is done synchronously (doesn't return until all the volatile data is on disk) and/or in a unit (lock the file to write a large unit).

The infamous int (*vn_ioctl)(struct vnode* vn_ptr, char* command, void* data, unsigned short flags, struct ucred cred) functions performs an ioctl on a vnode point to us by struct vnode* vn_ptr. It performs (or more accurately invokes) the command char* command, with the data given by the argument of void* data. The unsigned short flags deal with the open flags.

Next int (*vn_select)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred) performs a "select" operation on the vnode pointed to us by struct vnode* vn_ptr. The flags specify the input/output direction.

The int (*vn_getattr)(struct vnode* vn_ptr, struct vattr* va, struct ucred cred) operation gets the attributes for the struct vnode* vn_ptr vnode. It is written, I think, to the struct vattr that is given as an argument.

Our next operation int (*vn_setattr)(struct vnode* vn_ptr, struct vattr* va, struct ucred cred) sets the attributes for the struct vnode* vn_ptr. We set the vnode's attributes to be those pointed to by struct vattr* va. The catch is only: mode, uid, gid, file size, and times can be set. This necessarily maps UNIX file attributes to file system dependent attributes.

The int (*vn_access)(struct vnode* vn_ptr, unsigned short access_mode, struct ucred cred) operation checks access permissions for the struct vnode* vn_ptr vnode. If error is denied, an error is returned. The unsigned short access_mode is the mode to check for access (e.g. access, write, execute). It is necessary that this maps UNIX file protectection information to file system dependent protection information.

Next the int (*vn_lookup)(struct vnode* vn_ptr, char* name, struct vnode** vpp, struct ucred cred) operation, which looks up a component name char* name in the directory struct vnode* vn_ptr. The result is put in an vnode, and struct vnode** vpp points to a pointer which points to this resultant vnode.

Now the int (*vn_create)(struct vnode* vn_ptr, char* name, struct vattr* va, bool exclusive, unsigned short open, struct vnode** vpp, struct ucred cred) operation creates a new file char* name in a directory struct vnode* vn_ptr. The attributes of the new file is given by struct vattr* va. The bool exclusive is the exclusive/non-exclusive create flag, unsigned short open is the open mode. The struct vnode** vpp points to a pointer pointing to the resulting file.

The int (*vn_remove)(struct vnode* vn_ptr, char* name, struct ucred cred) operation is simple: it removes a file char* name in a directory struct vnode* vn_ptr.

To link, the int (*vn_link)(struct vnode* vn_ptr, struct vnode* targetDir, char* targetName, struct ucred cred) operation links the struct vnode* vn_ptr to the target name char* targetName in the directory struct vnode* targetDir.

Then the int (*vn_rename)(struct vnode* vn_ptr, char* name, struct vnode* target_dir, char* target_name struct ucred cred) function renames the file char* name in the directory struct vnode* vn_ptr to a new name char* target_name in the target directory struct vnode* target_dir. It is noted that even if the system crashes in the middle of this operation, the vnode's not lost.

Next the int (*vn_mkdir)(struct vnode* vn_ptr, char* name, struct vattr* va, struct vnode** vpp, struct ucred cred) method creates a directory char* name in the directory struct vnode* vn_ptr. The resulting directory's attributes are set to be struct vattr* va, and struct vnode** vpp points to a pointer which points to the resulting directory.

The int (*vn_rmdir)(struct vnode* vn_ptr, char* nm, struct ucred cred) method removes the char* nm directory from the struct vnode* vn_ptr directory.

Now, the int (*vn_readdir)(struct vnode* vn_ptr, struct uio* uiop, struct ucred cred) operation reads entries from the struct vnode* vn_ptr directory. The input/output arguments are given by struct uio* uiop pointer. The uio offset is notionally made to be a file system dependent number...it's supposed to represent the logical offset in the directory when the reading is done. Not only is this a good idea, but it's necessary because the number of bytes returned by vn_readdir is not necessarily the number of bytes in the equivalent part of the on disk directory.

Then int (*vn_symlink)(struct vnode* vn_ptr, char *linkName, struct vattr* va, char* path, struct ucred cred) symbolically links the path char* path to the name char* linkName in the struct vnode* vn_ptr directory.

The int (*vn_readlink)(struct vnode* vn_ptr, struct uio* uiop, struct ucred cred) operation reads the symbolic link struct vnode* vn_ptr with the input/output arguments supplied with the struct uio* uiop pointer.

Next the int (*vn_fsync)(struct vnode* vn_ptr, struct ucred cred) function writes out all cached information for the struct vnode* vn_ptr file...this is synchronous and does not return until the input/output is done.

Then the int (*vn_inactive)(struct vnode* vn_ptr, struct ucred cred) operation checks if the struct vnode* vn_ptr is still used by the vnode layer; if not, it may be deallocated.

The int (*vn_bmap)(struct vnode* vn_ptr, unsigned int logicalBlockNumber, struct vnode** vpp, unsigned int* block_nmbr) operation maps the logical block number unsigned int logicalBlockNumber in the struct vnode* vn_ptr file to a physical block number and a physical device. The unsigned int* block_nmbr points to a block number for the physical device and struct vnode** vpp is a pointer to a vnode pointer for the physical device. The returned vnode may or may not be a physical device.

And now the int (*vn_strategy)(struct buf* buf_ptr) function is a block oriented interface to read or write a logical block from a file into or out of a buffer. The struct buf* buf_ptr pointer is a pointer to a buffer header which contains a pointer to the vnode to be operated on. This does not copy through the buffer cache if the file system uses it. This function is used by the buffer cache routines and the paging system to read blocks into memory.

Next int (*vn_bread)(struct vnode* vn_ptr, unsigned int block_no, struct buf** bpp) reads a logical block unsigned int block_no from the struct vnode* vn_ptr file, returns a pointer to a buffer header in struct buf** bpp which contains a pointer to the data. This does not necessarily imply the use of the buffer cache; this function is useful in avoiding extra data copying on the server side of a remote file system.

Our last function int (*vn_brelse)(struct vnode* vn_ptr, struct buf* buf_ptr) basically releases the buffer returned by vn_bread().

So...What?

Well, this is nice for handling data on various partitions that is formatted in different file systems...but what if one is smart and formats all partitions to have the same file system? What's the advantage of the virtual file system?

One could argue that it's object oriented...that's always nice ;)

A serious advantage of the virtual file system is that it allows one to mount pseudo-file systems as struct vfs-es. It is pointed out in [1] (sections 4.7 and 4.8) that the /dev/ and /proc/ pseudo-file systems are implemented in this manner in SunOS back in the day.

This way, when one types into the command prompt:

$ sudo rm -rf /proc/63

One would kill the process with pid == 63. The usefulness of pseudo-file systems is more in keeping in line with the UNIX philosophy ("Everything is a file" -- something object oriented programmers would like, a sort of parallel to "Everything is an object"). So it only should appeal to zealots ;)

References

[1] Kleiman, S.R. Vnodes: An Architecture for Multiple File System Types in Sun UNIX (1986)

[2] Rosenthal, D. Evolving the Virtual File System (1992?)

[3] A 4.3 BSD vnode header, 4.3 BSD UFS_VNOPS.C

Revision History

Revision 0: 24 August 2007 - published.
Revision 1: 24 August 2007 - revised to fit code snippets on page correctly.

No comments: