Introduction
All right, after having studied some file systems, we move on to the next big advancement in the theory of file systems: the virtual file system.The virtual file system was made in late '85, early '86, in order to let a single mechanism access various disks that are formatted with different file systems. Sun was the one who made it for the SunOS system, which was little more than a glorified, proprietary BSD system...so it wasn't too hard to get it working for BSD. For this reason, I shall call this first approach the Solaris/BSD paradigm.
The idea was to generalize the UNIX file system of inodes to have virtual inodes, or "vnodes". The virtual file system had a linked list of mounted file systems (represented by the
struct vfs
data structure). The first struct vfs
in the linked list was always the native file system.
Despite being written in C, the Solaris/BSD paradigm used an object oriented approach. Roughly put, the classes were structs and the methods were function pointers.
The approach that we will take to investigate the Solaris/BSD paradigm will be loosely based on the influential technical paper explaining it [1]. We'll explore the data structures and their respective "member functions". The data structures that we shall examine will be given to us verbatim from the paper. There are more modern additions to the virtual file system in Solaris (e.g. the introduction of virtual events as a vnode type) and in BSD. We will not discuss it here.
The VFS Data Structure
The vfs data structure can be thought of as a sort of virtual file system equivalent to the mount table entry. They are represented by directories on the root file system. So if we were running a native ext3 file system (it would be the first in the linked list ofstruct vfs
mind you), and we mounted say a UFS partition we would have a directory dedicated to it (e.g. /ufs_partition/
or something).
The VFS struct is rather simple to look at:
The first field,struct vfs { struct vfs *vfs_next; struct vfsops *vfs_op; struct vnode *vfs_vnodecovered; int vfs_flag; int vfs_bsize; caddr_t vfs_data; };
struct vfs *vfs_next
is a pointer to the next mounted file system. The struct vfs
is a linked list after all.
The next field is a pointer to the operations of the
struct vfs
. This is the object orientedness of the virtual file system implementation.
The
struct vnode *vfs_vnodecovered
is the mount point for the file system. It is worthy to note that this is NULL
for the root file system (the head of the linked list).
The next several fields are rather...straight forward (read: dull). The
int vfs_flag
is any flag for the struct vfs
, the int vfs_bsize
is the block size for the native file system.
The
caddr_t vfs_data
is somewhat interesting...only because I had no clue what the hell a caddr_t
type was! It turns out it is a character pointer. This field is supposed to be for private, file system dependent data. The example given in the text was - for the 4.2 BSD file system - vfs_data
points to a mount table entry.
It should be noted that - in general - the
struct vfs
should be thought of as a sort of virtual file system mount table entry.
There are some fundamental data structures which should be inspected prior to investigating
struct vfs_ops
: the struct statfs
data structure (which holds the results for a vfs_statfs()
operation) and the struct fid
data structure (which is a file identifier).
The
struct statfs
is the first thing we shall examine:
This is basically the result of the statfs() operation. The entries are self-explanatory: the file type (ordinary file, directory, character or block device, socket, etc.) represented bystruct statfs { long f_type; long f_bsize; long f_blocks; long f_bfree; long f_bavail; long f_files; long f_ffree; fsid_t f_fsid; long f_spare[7]; };
long f_type
; the native block size of the file system is represented by the long f_bsize
; the long f_bfree
is the number of free blocks; the long f_bavail
is the "non-su blocks"; the total number of files is then represented by long f_files
; the long f_ffree
is the free nodes in the file system; the fsid_t f_sid
is the file system id; and the last field long f_spare[7]
is spare space used for later.
The file identifier is the next and last data structure we need to investigate prior to going on to the
struct vfs_ops
:
Thestruct fid { u_short fid_len; char fid_data[1]; };
unsigned short fid_len
is the length of the data, and char fid_data
is the actual data encapsulated in the file identifier.
The operations for the
struct vfs
is:
This is a struct full of function pointers! Let us investigate each one in turn.struct vfsops { int (*vfs_mount)(struct vfs* vfs_ptr, char *path, char *data); int (*vfs_unmount)(struct vfs* vfs_ptr, struct vnode* stuffResultsHere); int (*vfs_root)(struct vfs* vfs_ptr, struct vnode* stuffResultsHere); int (*vfs_statfs)(struct vfs* vfs_ptr, struct statfs* putResultsHere); int (*vfs_sync)(struct vfs* vfs_ptr); int (*vfs_fid)(struct vfs* vfs_ptr, struct vnode *file, struct fid* fid_ptr); int (*vfs_vget)(struct vfs* vfs_ptr, struct vnode** vpp, struct fid* file); };
The
vfs_mount()
function mounts the vfs
pointer (that is to say, it reads the superblock, etc.). The char *path
points to the path name to be mounted for the sake of recording purposes. The char *data
points to file system dependent data.
The
vfs_unmount()
function simply unmounts the vfs
(syncs the superblock, etc.).
Our next function/method/whatever is
vfs_root()
which returns the root vnode for the file system represented by struct vfs* vfs_ptr
. The struct vnode* stuffResultsHere
vnode is a pointer to a vnode for the results.
Now we have
int vfs_statfs()
which returns the file system information. The struct statfs* putResultsHere
argument is a pointer to a statfs structure for the results.
Then
int vfs_sync()
writes out all cached information for the struct vfs* vfs_ptr
. This is not necessarily done synchronously. When the operation returns, all data has not been necessarily been written out...but it has been scheduled.
Next the
int vfs_fid()
gets a unique file identifier for the struct vnode* file
which represents a file in this file system. The results are put in a struct fid
and then struct fid* fid_ptr
- the argument in the vfs_fid()
function - points to the resulting struct fid
.
Last but not least we have
int vfs_vget()
which turns a unique file identifier struct fid* file
into a vnode representing the file which the file identifier identifies. The struct vnode** vpp
points to a pointer to a vnode for the result.
The VNODE Data Structure
The vnode data structure is given to us, from the aforementioned paper, as:The various vnode types are given to us by an enumeration of all the various types.enum vtype { VNON, VREG, VDIR, VBLK, VCHR, VLINK, VSOCK, VBAD }; struct vnode { u_short v_flag; u_short v_count; u_short v_shlockc; u_short v_exlockc; struct vfs *v_vfsmountedhere; struct vnodeops *v_op; union { struct socket *v_Socket; struct stdata *v_Stream; }; struct vfs *v_vfsp; enum vtype v_type; caddr_t v_data; };
The
u_short v_flag
points to the standard flags. The u_short v_count
is the reference count for the vnode. It is maintained by generic vnode macros VN_HOLD
and VN_RELE
.
The next two fields deal with the number of shared locks and exclusive locks used by the vnode.
The
struct vfs *v_vfsmountedhere
points to a vfs if and only if the vnode is a mount point for the vfs. Otherwise, it is null and struct vfs* v_vfsp
points to the vfs which the vnode is in.
The private data pointer (
caddr_t v_data
) which holds file dependent data. E.g. for the 4.2 BSD system, v_data
points to an in memory inode data table.
The vnode has an interprocess communication apparatus...that's the anonymous union of the socket and the data stream.
Before continuing on to discuss the
vnode_ops
structure, we need to investigate a few structures. First the struct vattr
data structure:
The various fields are self explanatory, especially since the comments explain all the fields! The only ones worthy of note would be the file system identifierstruct vattr { enum vtype va_type; /* vnode type */ u_short va_mode; /* acc mode */ short va_uid; /* owner uid */ short va_gid; /* owner gid */ long va_fsid; /* fs id */ long va_nodeid; /* node # */ short va_nlink; /* # links */ u_long va_size; /* file size */ long va_blocksize; /* block size */ struct timeval va_atime; /* last acc */ struct timeval va_mtime; /* last mod */ struct timeval va_ctime; /* last chg */ dev_t va_rdev; /* dev */ long va_blocks; /* space used */ };
long va_fsid
, and the device the vnode's on dev_t va_rdev
.
From the Modern openSolaris OS, we find the
uio_t
type's definition:
I honestly do not understand this, and I suspect that this is far more complicated than it was when the original virtual file system was implemented.1217 typedef struct uio { 1218 struct iovec *uio_iov; 1219 void *uio_file; 1220 char *uio_buf; 1221 int uio_iovcnt; 1222 int uio_offset; 1223 size_t uio_resid; 1224 int uio_rw; 1225 } uio_t;
And now a nightmarishly long structure of the operations on the vnode: the
vnode_ops
! Note that struct ucred cred
is the credentials of the user, it is used to check for permissions while performing these operations.
Thestruct vnodeops { int (*vn_open)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred); int (*vn_close)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred); int (*vn_rdwr)(struct vnode* vn_ptr, struct uio* args, bool read, unsigned short flags, struct ucred cred); int (*vn_ioctl)(struct vnode* vn_ptr, char* command,void* data, unsigned short flags, struct ucred cred); int (*vn_select)(struct vnode* vn_ptr, unsigned short ioDirection, struct ucred cred); int (*vn_getattr)(struct vnode* vn_ptr, struct vattr* va, struct ucred cred); int (*vn_setattr)(struct vnode* vn_ptr, struct vattr* va, struct ucred cred); int (*vn_access)(struct vnode* vn_ptr, unsigned short access_mode, struct ucred cred); int (*vn_lookup)(struct vnode* vn_ptr, char* name, struct vnode** vpp, struct ucred cred); int (*vn_create)(struct vnode* vn_ptr, char* name, struct vattr* va, bool exclusive, unsigned short open, struct vnode** vpp, struct ucred cred); int (*vn_remove)(struct vnode* vn_ptr, char* name, struct ucred cred); int (*vn_link)(struct vnode* vn_ptr, struct vnode* targetDir, char* targetName, struct ucred cred); int (*vn_rename)(struct vnode* vn_ptr, char* name, struct vnode* target_dir, char* target_name struct ucred cred) int (*vn_mkdir)(struct vnode* vn_ptr, char* name, struct vattr* va, struct vnode** vpp, struct ucred cred); int (*vn_rmdir)(struct vnode* vn_ptr, char* nm, struct ucred cred); int (*vn_readdir)(struct vnode* vn_ptr, struct uio* uiop, struct ucred cred); int (*vn_symlink)(struct vnode* vn_ptr, char *linkName, struct vattr* va, char* path, struct ucred cred); int (*vn_readlink)(struct vnode* vn_ptr, struct uio* uiop, struct ucred cred); int (*vn_fsync)(struct vnode* vn_ptr, struct ucred cred); int (*vn_inactive)(struct vnode* vn_ptr, struct ucred cred); int (*vn_bmap)(struct vnode* vn_ptr, unsigned int logicalBlockNumber, struct vnode** vpp, unsigned int* block_nmbr); int (*vn_strategy)(struct buf* buf_ptr); int (*vn_bread)(struct vnode* vn_ptr, unsigned int block_no, struct buf** bpp); int (*vn_brelse)(struct vnode* vn_ptr, struct buf* buf_ptr); };
int (*vn_open)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred)
function performs any open
protocol on a vnode pointed to by struct vnode* vn_ptr
(for example, devices). If the open is a clone open the operation may return a new vnode. The various open flags is given by unsigned short flags
.
Next
int (*vn_close)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred)
corresponds to the previous operation. This performs any close protocol on a vnode pointed to us by struct vnode* vn_ptr
. It is called on the closing of the last reference to the vnode from the file table if the vnode is a device. Otherwise this is called on the last user close of a file descriptor. The flags are the open flags.
THen
int (*vn_rdwr)(struct vnode* vn_ptr, struct uio* args, unsigned short flags, bool read, struct ucred cred)
reads or writes to the vnode pointed to us by struct vnode* vn_ptr
. It reads or writes a number of bytes at a specified offset in the file. The input/output arguments are pointed to by the struct uio* args
argument. The bool read
argument tells us if the operation is read if true, write if false. The input/output flags is given to us by unsigned short flags
which specifies if the input/output is done synchronously (doesn't return until all the volatile data is on disk) and/or in a unit (lock the file to write a large unit).
The infamous
int (*vn_ioctl)(struct vnode* vn_ptr, char* command, void* data, unsigned short flags, struct ucred cred)
functions performs an ioctl
on a vnode point to us by struct vnode* vn_ptr
. It performs (or more accurately invokes) the command char* command
, with the data given by the argument of void* data
. The unsigned short flags
deal with the open flags.
Next
int (*vn_select)(struct vnode* vn_ptr, unsigned short flags, struct ucred cred)
performs a "select" operation on the vnode pointed to us by struct vnode* vn_ptr
. The flags specify the input/output direction.
The
int (*vn_getattr)(struct vnode* vn_ptr, struct vattr* va, struct ucred cred)
operation gets the attributes for the struct vnode* vn_ptr
vnode. It is written, I think, to the struct vattr
that is given as an argument.
Our next operation
int (*vn_setattr)(struct vnode* vn_ptr, struct vattr* va, struct ucred cred)
sets the attributes for the struct vnode* vn_ptr
. We set the vnode's attributes to be those pointed to by struct vattr* va
. The catch is only: mode, uid, gid, file size, and times can be set. This necessarily maps UNIX file attributes to file system dependent attributes.
The
int (*vn_access)(struct vnode* vn_ptr, unsigned short access_mode, struct ucred cred)
operation checks access permissions for the struct vnode* vn_ptr
vnode. If error is denied, an error is returned. The unsigned short access_mode
is the mode to check for access (e.g. access, write, execute). It is necessary that this maps UNIX file protectection information to file system dependent protection information.
Next the
int (*vn_lookup)(struct vnode* vn_ptr, char* name, struct vnode** vpp, struct ucred cred)
operation, which looks up a component name char* name
in the directory struct vnode* vn_ptr
. The result is put in an vnode, and struct vnode** vpp
points to a pointer which points to this resultant vnode.
Now the
int (*vn_create)(struct vnode* vn_ptr, char* name, struct vattr* va, bool exclusive, unsigned short open, struct vnode** vpp, struct ucred cred)
operation creates a new file char* name
in a directory struct vnode* vn_ptr
. The attributes of the new file is given by struct vattr* va
. The bool exclusive
is the exclusive/non-exclusive create flag, unsigned short open
is the open mode. The struct vnode** vpp
points to a pointer pointing to the resulting file.
The
int (*vn_remove)(struct vnode* vn_ptr, char* name, struct ucred cred)
operation is simple: it removes a file char* name
in a directory struct vnode* vn_ptr
.
To link, the
int (*vn_link)(struct vnode* vn_ptr, struct vnode* targetDir, char* targetName, struct ucred cred)
operation links the struct vnode* vn_ptr
to the target name char* targetName
in the directory struct vnode* targetDir
.
Then the
int (*vn_rename)(struct vnode* vn_ptr, char* name, struct vnode* target_dir, char* target_name struct ucred cred)
function renames the file char* name
in the directory struct vnode* vn_ptr
to a new name char* target_name
in the target directory struct vnode* target_dir
. It is noted that even if the system crashes in the middle of this operation, the vnode's not lost.
Next the
int (*vn_mkdir)(struct vnode* vn_ptr, char* name, struct vattr* va, struct vnode** vpp, struct ucred cred)
method creates a directory char* name
in the directory struct vnode* vn_ptr
. The resulting directory's attributes are set to be struct vattr* va
, and struct vnode** vpp
points to a pointer which points to the resulting directory.
The
int (*vn_rmdir)(struct vnode* vn_ptr, char* nm, struct ucred cred)
method removes the char* nm
directory from the struct vnode* vn_ptr
directory.
Now, the
int (*vn_readdir)(struct vnode* vn_ptr, struct uio* uiop, struct ucred cred)
operation reads entries from the struct vnode* vn_ptr
directory. The input/output arguments are given by struct uio* uiop
pointer. The uio offset is notionally made to be a file system dependent number...it's supposed to represent the logical offset in the directory when the reading is done. Not only is this a good idea, but it's necessary because the number of bytes returned by vn_readdir
is not necessarily the number of bytes in the equivalent part of the on disk directory.
Then
int (*vn_symlink)(struct vnode* vn_ptr, char *linkName, struct vattr* va, char* path, struct ucred cred)
symbolically links the path char* path
to the name char* linkName
in the struct vnode* vn_ptr
directory.
The
int (*vn_readlink)(struct vnode* vn_ptr, struct uio* uiop, struct ucred cred)
operation reads the symbolic link struct vnode* vn_ptr
with the input/output arguments supplied with the struct uio* uiop
pointer.
Next the
int (*vn_fsync)(struct vnode* vn_ptr, struct ucred cred)
function writes out all cached information for the struct vnode* vn_ptr
file...this is synchronous and does not return until the input/output is done.
Then the
int (*vn_inactive)(struct vnode* vn_ptr, struct ucred cred)
operation checks if the struct vnode* vn_ptr
is still used by the vnode layer; if not, it may be deallocated.
The
int (*vn_bmap)(struct vnode* vn_ptr, unsigned int logicalBlockNumber, struct vnode** vpp, unsigned int* block_nmbr)
operation maps the logical block number unsigned int logicalBlockNumber
in the struct vnode* vn_ptr
file to a physical block number and a physical device. The unsigned int* block_nmbr
points to a block number for the physical device and struct vnode** vpp
is a pointer to a vnode pointer for the physical device. The returned vnode may or may not be a physical device.
And now the
int (*vn_strategy)(struct buf* buf_ptr)
function is a block oriented interface to read or write a logical block from a file into or out of a buffer. The struct buf* buf_ptr
pointer is a pointer to a buffer header which contains a pointer to the vnode to be operated on. This does not copy through the buffer cache if the file system uses it. This function is used by the buffer cache routines and the paging system to read blocks into memory.
Next
int (*vn_bread)(struct vnode* vn_ptr, unsigned int block_no, struct buf** bpp)
reads a logical block unsigned int block_no
from the struct vnode* vn_ptr
file, returns a pointer to a buffer header in struct buf** bpp
which contains a pointer to the data. This does not necessarily imply the use of the buffer cache; this function is useful in avoiding extra data copying on the server side of a remote file system.
Our last function
int (*vn_brelse)(struct vnode* vn_ptr, struct buf* buf_ptr)
basically releases the buffer returned by vn_bread()
.
So...What?
Well, this is nice for handling data on various partitions that is formatted in different file systems...but what if one is smart and formats all partitions to have the same file system? What's the advantage of the virtual file system?One could argue that it's object oriented...that's always nice ;)
A serious advantage of the virtual file system is that it allows one to mount pseudo-file systems as
struct vfs
-es. It is pointed out in [1] (sections 4.7 and 4.8) that the /dev/
and /proc/
pseudo-file systems are implemented in this manner in SunOS back in the day.
This way, when one types into the command prompt:
$ sudo rm -rf /proc/63
One would kill the process with
pid == 63
. The usefulness of pseudo-file systems is more in keeping in line with the UNIX philosophy ("Everything is a file" -- something object oriented programmers would like, a sort of parallel to "Everything is an object"). So it only should appeal to zealots ;)
References
[1] Kleiman, S.R. Vnodes: An Architecture for Multiple File System Types in Sun UNIX (1986)[2] Rosenthal, D. Evolving the Virtual File System (1992?)
[3] A 4.3 BSD vnode header, 4.3 BSD UFS_VNOPS.C
Revision History
Revision 0: 24 August 2007 - published.Revision 1: 24 August 2007 - revised to fit code snippets on page correctly.
No comments:
Post a Comment