Saturday, August 4, 2007

The Skelix File System

The Skelix File System

The Skelix Operating System was created by a hacker that wanted to hack FreeBSD but found its code too disorganized (a common flaw with old operating systems). He wrote everything from scratch. It is (sort of) a Unix-like operating system...it is perhaps more of a tutorial to introduce a youngling to the scary inner workings of operating systems.

The Skelix File System is loosely based off of the Unix file system. Thus it is the first along the line of file systems to be studied because of its simplicity. Note that the author's permission is granted to use the source code.

Data Structures

The investigation shall begin with the introduction of the inode. The inode is a file, basically. Tanenbaum and Woodhull [1] note that:
our last method for keeping track of which blocks belong to which file is to associate with each file a data structure called an i-node (index-node), which lists the attributes and disk addresses of the file's blocks. [...] Given the i-node, it is then possible to find all the blocks of the file. The big advantage of this scheme over linked files using in-memory table is that the i-node need only be in memory when the corresponding file is open. If each i-node occupies n bytes and a maximum of k files may be open at once, the total memory occupied by the array holding the i-nodes for the open files is kn bytes. Only this much space needs to be reserved in advance. (Page 502 --emphasis is Tanenbaum and Woodhull's)
The inode method is the technique which will be investigated thoroughly, as it is the technique that most (if not, all) Unix-like operating systems use. We use Skelix because, as previously noted, it is the simplest introduction to the underlying concepts and picture. It should be noted that there is an inode table (that is, in practice, little more than a dynamic array or linked list of the inodes that are loaded into the memory).
06. #define FT_NML           1
07. #define FT_DIR          2
08. 
09. struct INODE {
10.       unsigned int i_mode;                /* file mode */
11.           unsigned int i_size;                /* size in bytes */
12.           unsigned int i_block[8];
13. };Skelix Revision 7: /skelix/include/fs.h
Line 10 indicates what sort of file the inode is: a FT_NML = 1 (normal file) or a FT_DIR = 2 (directory). That is the entire raison d'etre for the inode mode: it tells us what sort of file the inode is. Luckily, the Skelix operating system is simple enough to support only two file types: the ordinary file, and the directory. So we aren't concerned here about pipes, sockets, block or character device files.

Line 11, unsigned int i_size, tells us the size of the file in units of bytes. It's a convenient field that simplifies things greatly later on, so we won't need a complicated function to figure out the size of the file.

A little discussion about the structure of the inode's last field (line 12). First it should be noted that disk geometry is organized into blocks which consist of 512 bytes. The i_blocks are addresses to blocks. The first 6 are direct addresses. The seventh is an address of a block which consists solely of addresses to other blocks. It is a singly indirect block. The last address is an address to a doubly indirect block. That is, it refers to a block full of addresses to singly indirect blocks.

For those like me who are lazy and want to figure out how much space this is total that is supported by the inode, allow a quick calculation. There are 6 direct blocks, that is:
1. (6 blocks) * (512 bytes per block) = 3072 bytes
Now time to calculate out the singly indirect block. There are:
2. (512 bytes per block) * (4 bytes per address) = 128 addresses
3. (128 addresses) * (1 block per address) * (512 bytes per block) = 65536 bytes
Lastly, the doubly indirect block. By our calculations in step 3, we have 128 addresses per block. Each address refers to another indirect block. So that is 128 addresses referring to 128 addresses each, for a grand total of:
4. (128 addresses) * (128 addresses) = 16384 addresses
5. (16384 addresses) * (1 block per address) * (512 bytes per address) = 8388608 bytes
So that means we are dealing with a total of: 8388608 + 65536 + 3072 = 8457216 bytes. Note that 1024 bytes are a kilobyte, 1024 kilobytes are a megabyte, and "thus" 1048576 bytes are a megabyte. We are then dealing with 8 megabytes, 67 kilobytes total from 1 inode. That means that the file supports at most 8 megabytes, 67 kilobytes. No more, but certainly it supports less. This concludes our investigation of the inode data structure.

We can now continue our investigation of the /skelix/include/fs.h header file for more data structures:
15. extern struct INODE iroot;
16. 
17. #define MAX_NAME_LEN 11
18. 
19. struct DIR_ENTRY {
20.        char de_name[MAX_NAME_LEN];
21.        int de_inode;
22. };
Skelix Revision 7: /skelix/include/fs.h
Line 15 refers to the global variable which refers to the inode of the root file system.

Line 17 defines the maximum name length for a file or directory. (Remember, a directory is-a file!) For Skelix, it is 11 characters. That is 11 bytes.

Line 19 introduces a new data structure: the dir_entry. It is the directory entry, which stores the data about the files within the directory. For skelix this is two fields. One is the name of the file (line 20), the other is the inode number that is used to refer to the inode table (line 21). The size of the struct dir_entry structure is:
(11 chars)*(1 byte per char) + (1 int) * (4 bytes per int) = 15 bytes
These directory entries are used to find other files. The way one searches through the inodes via directory entries is to first select an entry, then the search method looks up the inode number, it then looks it up in the inode table.

The "root" directory is the directory which contains a few key directories. The exact directories vary according to the operating system. Linux, e.g., has the root hierarchy of: bin, dev, initrd, lib, mnt, root, sys, usr, boot, etc, lost+found, opt, sbin, tmp, var, home, media, proc, srv, u. Note that in Skelix has reserved the zeroeth inode for the root directory.

Now, there is an important block that we should pay some attention to: the super block. It is the first block. It contains all the key parameters about the file system, and is read into memory when the computer is booted or when the file system is first touched. Often times the super block is a complicated beast, luckily for us Skelix provides a simpler version of it.
35. struct SUPER_BLOCK {
36.        unsigned char sb_magic;
37.        /* DPT 0x08 */
38.        unsigned int sb_start;
39.        /* DPT 0x0c */
40.        unsigned int sb_blocks;
41. 
42.        unsigned int sb_dmap_blks;
43.        unsigned int sb_imap_blks;
44.        unsigned int sb_inode_blks;
45. };
46. 
47. #define FST_FS       0x2e                     /* normal partition */
48. #define FST_SW       0x2f                     /* swap partition */
Skelix Revision 7: /skelix/include/fs.h
Line 36 tells us what type of partition we are dealing with. The two possible partitions are: swap and the regular file system. That is the role of the magic number. We can tell what identifier to use by looking at lines 47 and 48.

The next line of importance is line 39. It tells us the address to the start sector number. Line 40 tells us the capacity of the partition. The count is in units of 512 block sectors.

The next field, unsigned int sb_dmap_blks, tells us the number of 512 byte blocks we have to contain the bit map of blocks. This "bitmap" is a series of ones and zeroes that tell us if the block is in use (1) or if it is free (0).

Correspondingly, we have a bitmap for the inodes. It tells us if inodes are in use or not. The idea is the allocate a certain amount of space to the inodes before hand. The partition is of size x bytes, and we know that a skelix inode can deal with 8457216 bytes. So the number of inodes that are allocated are thus x/8457216 inodes, roughly.

The last field in the structure is the sb_inode_blks. It tells us how many block we use for the inode array.

Now there are some fundamental macros that are defined which should be noted.

24. #define ABS_BOOT_BLK(sb)              ((sb).sb_start)
25. #define ABS_SUPER_BLK(sb)             ((ABS_BOOT_BLK(sb))+1)
26. #define ABS_DMAP_BLK(sb)              ((ABS_SUPER_BLK(sb))+1)
27. #define ABS_IMAP_BLK(sb)              ((ABS_DMAP_BLK(sb))+(sb).sb_dmap_blks)
28. #define ABS_INODE_BLK(sb)             ((ABS_IMAP_BLK(sb))+(sb).sb_imap_blks)
29. #define ABS_DATA_BLK(sb)              ((ABS_INODE_BLK(sb))+INODE_BLKS)
30. 
31. #define INODE_BIT_BLKS              1       /* 512*8 = 4096 inodes */
32. #define INODES_PER_BLK              (512/sizeof(struct INODE))
33. #define INODE_BLKS                  ((INODE_BIT_BLKS*512*8+INODES_PER_BLK-1)/(INODES_PER_BLK))
Skelix Revision 7: /skelix/include/fs.h
The macros defined on lines 24-29 deal specifically with the superblock data structure. On the other hand, lines 31-33 deal specifically with the inode mechanics. So, logically, we shall first investigate the first group and then the second group of macros.

The first super block macro (ABS_BOOT_BLK) is rather simple. It returns the boot block. The second macro (ABS_SUPER_BLK) gives where the super block is. The third macro (ABS_DMAP_BLK) tells us where the block bitmap starts. Correspondingly the fourth macro (ABS_IMAP_BLK) tells us where the inode bitmap starts. The fifth macro (ABS_INODE_BLK) returns the beginning of the inode array (the inode table). The sixth and last super block macro (ABS_DATA_BLK) returns the block where the data really starts. These macros are merely glorified accessor methods of a given super block structure.

On the other hand, the next set of macros deals with the geometry of the inodes. For example, line 31 defines INODE_BIT_BLKS. This tells us the number of 512 byte blocks dedicated to the inode bitmap block.

Line 32 tells us the inner workings of the mysterious INODES_PER_BLK block. It is essentially: (the size of a block) / (the size of an inode). There is not too much in this.

The last macro (line 33) INODE_BLKS tells us roughly this: (Inodes supported by the inode bitmap + Inodes per block - 1)/(Inodes per block). Mathematically, this is formally equivalent to: (Inodes supported by the inode bitmap - 1)/(Inodes per block) + 1. In a hand-wavy argument, we can roughly argue that this is approximately:

(Inodes supported by the inode bitmap)/(Inodes per block).

Intuitively this rough approximation (I emphasize this is a ROUGH APPROXIMATION) gives us the

(number of inodes)/(x inodes/block) = block * (total inodes / x inodes).

This gives us the number of blocks of the inode bitmap.

File System Source Code Analysis

Now that we have examined the fundamental data structures of the Skelix file system, we shall turn our focus on the bread and butter functions. First we shall examine the atomic block operations:
10. void
11. setb(void *s, unsigned int i) {
12.        unsigned char *v = s;
13.        v += i>>3;
14.        *v |= 1<<(7-(i%8));
15. }
16. 
17. void
18. clrb(void *s, unsigned int i) {
19.        unsigned char *v = s;
20.        v += i>>3;
21.        *v &= ~(1<<(7-(i%8)));
22. }
23. 
24. int
25. testb(void *s, unsigned int i) {
26.        unsigned char *v = s;
27.        v += i>>3;
28.        return (*v&(1<<(7-(i%8)))) !=0;
29. }

Skelix Revision 7: /skelix/fs.c
The first function, setb(void *buffer, unsigned int bitNumberToSet), tells us: given a buffer void *buffer, we set bit bitNumberToSet (offset is always 0). The function uses the voodoo of bit operators, as do the other three.

The second method is used to set a bit in the block bitmap to 0.

The third method is actually a test. It is used to test if a bit is set or unset. It returns 1 if it is set, 0 if it is not set I think (I could very well be wrong!).
125. unsigned int
126. alloc_blk(struct SUPER_BLOCK *sb) {
127.        unsigned int i = 0, j = 0, n = 0, m = 0;
128.        unsigned char sect[512] = {0};
129. 
130.        n = ABS_DMAP_BLK(*sb);
131.        for (; isb_dmap_blks; ++i) {
132.               hd_rw(n, HD_READ, 1, sect);
133.               for (j=0; j<512*8; ++j) {
134.                      if (testb(sect, j)) {
135.                             ++m;
136.                      } else {                     /* gotcha */
137.                             setb(sect, j);
138.                             if (m >= sb->sb_blocks)
139.                                    return 0;
140.                             else {
141.                                    hd_rw(n, HD_WRITE, 1, sect);
142.                                    return ABS_BOOT_BLK(*sb) + m;
143.                             }
144.                      }
145.               }
146.               ++n;
147.        }
148.        return 0;
149. }

Skelix Revision 7: /skelix/fs.c
Now it is probably important to note that the purpose of this function is for locating the first free block on the disk, then returning the logical block address. It takes one argument, struct SUPER_BLOCK *sb, a pointer to a superblock. It returns 0 if the allocation didn't happen. Otherwise it returns return ABS_BOOT_BLK(*sb) + m (line 142)...which is the address in units of blocks offset from the boot block.

Line 127 initiates four dummy variables: i, j, m, and n. Line 128 initiates a dummy sector that is 512 (signed) bytes big. We set the variable n to be the start of the block bitmap (line 130). We let m be the offset from the beginning of the disk. The variable j is used as a counter in the for-loop started on line 133. Similarly, i is used as a counter in the for-loop on line 131.

There is an odd function that is called on line 132 and 141 that we haven't seen before: hd_rw()! So a brief "explanation" of what this function does will be given. Behold the stripped down code (yes, it is modified so it simplifies the explanation and changes nothing): void hd_rw(unsigned int logicalBlockAddress, unsigned int ReadOrWrite, unsigned int SectorsToAccess, void *buf);

There are 4 arguments the function takes in: the logical block address of the block we are going to read from or write to, the flag that tells us whether we are reading from or writing to the block, an unsigned int referring to the sectors to access, and a buffer that is either being written to (if we are reading from the hardware) or being read from (if we are writing it to the hardware).

So line 132, the logical block address is the beginning of the block bitmap. We are reading from the bitmap (HD_READ tells us this). We are accessing the first sector, and we are using a buffer sect that is the dummy variable for 512 bytes defined on line 128.

We cycle through the block bitmap (by using the for-loop on line 131), and we cycle through each bit of each block (by using the for-loop on line 133). On each bit we test to see if the block is being used. (line 134). If it is, we increment the m dummy variable (line 135). If it is free, we set the block (line 137). However, if we somehow run through more blocks than we have, than we have to return 0 (lines 138-139). If we have not run through more blocks than we have, then we write to the super block that we have no modified the blocks in use slightly. Then we return the address offset from the boot block (lines 141-142).

151. void
152. free_blk(struct SUPER_BLOCK *sb, unsigned int n) {
153.        unsigned char sect[512] = {0};
154.        unsigned int t = (n-ABS_BOOT_BLK(*sb))/(512*8)+ABS_DMAP_BLK(*sb);
155.        hd_rw(t, HD_READ, 1, sect);
156.        clrb(sect, (n-ABS_BOOT_BLK(*sb))%(512*8));
157.        hd_rw(t, HD_WRITE, 1, sect);
158. }

Skelix Revision 7: /skelix/fs.c
The purpose of free_blk() is clear: it works in the opposite manner of alloc_blk(). It takes in two arguments: struct SUPER_BLOCK *sb which is a super block pointer, and unsigned int n which is the address of the

We initialize a dummy variable on line 153. It is used as a buffer for the block bitmap for a 512 byte block. Line 154 is a little tricky. It defines an unsigned int t to be equal to n minus the boot block address, divided by the number of bits in a block (512 bytes per block * 8 bits per byte = 4096 bits per block), added to the start of the block bit map. So the numerator (number on top of the fraction) is the offset from the beginning of the partition, the denominator is the number of bits in a block; i.e. (offset to the block)/(bits per block) + (start of the block bitmap).

We then read in the block bitmap, and store it in the buffer sect which is 512 bytes. We clear the block, and then write these changes back to the hardware on the next line. That's all there is to this method.

160. static int
161. alloc_inode(struct SUPER_BLOCK *sb) {
162.        unsigned char sect[512] = {0};
163.        int i = 0;
164.        hd_rw(ABS_IMAP_BLK(*sb), HD_READ, 1, sect);
165.        for (; i<512; ++i) {
166.               if (! testb(sect, i)) {
167.                      setb(sect, i);
168.                      hd_rw(ABS_IMAP_BLK(*sb), HD_WRITE, 1, sect);
169.                      break;
170.               }
171.        }
172.        return (i==512)?-1:i;
173. }

Skelix Revision 7: /skelix/fs.c


This method allocates a new inode. It takes in one argument: struct SUPER_BLOCK *sb which is the pointer to a super block.

It first initializes a 512 byte buffer on line 162 sect. The next line initializes a dummy variable i. Then on line 164, the inode bitmap is read into the sect buffer on line 164.

Then we cycle through every byte of the sect buffer (using the for-loop on line 165). It tests to see if the block is free (line 166); if so, then it writes to the block (line 167). It then writes to the inode bitmap the changes (a new inode has been allocated!) on line 168, and exits the for-loop.

Line 172, we check to see if we cycled through the entire inode bitmap (i==512). If so, we return -1: there was an error! Largely that we ran out of space for the inode. Otherwise, if i<512 we return i.

175. static void
176. free_inode(struct SUPER_BLOCK *sb, int n) {
177.        unsigned char sect[512] = {0};
178.        hd_rw(ABS_IMAP_BLK(*sb), HD_READ, 1, sect);
179.        clrb(sect, n);
180.        hd_rw(ABS_IMAP_BLK(*sb), HD_WRITE, 1, sect);
181. }

Skelix Revision 7: /skelix/fs.c
The point of free_inode() is clear: it frees an inode. It takes in two arguments: struct SUPER_BLOCK *sb, a pointer to a super block; and int n which is the number of the inode in the inode table which we would like to have freed.

We have a dummy variable that represents a 512 byte block of data initialized on line 177. The next line, we read from the inode bitmap into the buffer sect. We then call the clear block method clrb() to clear the block (which modifies the buffer representing the inode bitmap block), and then write the modifications to the disk (the importance of line 180).

183. static struct INODE *
184. iget(struct SUPER_BLOCK *sb, struct INODE *inode, int n) {
185.        unsigned char sect[512] = {0};
186.        int i = n/INODES_PER_BLK;
187.        int j = n%INODES_PER_BLK;
188. 
189.        hd_rw(ABS_INODE_BLK(*sb)+i, HD_READ, 1, sect);
190.        memcpy(inode, sect+j*sizeof(struct INODE), sizeof(struct INODE));
191.        return inode;
192. } 

Skelix Revision 7: /skelix/fs.c


The point of this function is to get an inode. It takes in 3 arguments: struct SUPER_BLOCK *sb, a pointer to a super block; struct INODE *inode, a pointer to an inode; and int n, which I assume is the inode number in the inode table. The method: copies the inode into the argument struct INODE *inode and then returns it.

We allocate a 512 byte buffer sect (as usual!). We then have a variable int i and int j which gives us the block and offset therein respectively (by lines 186 and 187).

We then read into sect the block where the inode lives (ABS_INODE_BLK(*sb)+i) on line 189. We then copy this into the inode pointer, and return it on lines 190 and 191 respectively.

194. static void
195. iput(struct SUPER_BLOCK *sb, struct INODE *inode, int n) {
196.        unsigned char sect[512] = {0};
197.        int i = n/INODES_PER_BLK;
198.        int j = n%INODES_PER_BLK;
199.        hd_rw(ABS_INODE_BLK(*sb)+i, HD_READ, 1, sect);
200.        memcpy(sect+j*sizeof(struct INODE), inode, sizeof(struct INODE));
201.        hd_rw(ABS_INODE_BLK(*sb)+i, HD_WRITE, 1, sect);
202. }

Skelix Revision 7: /skelix/fs.c


This method essentially is the opposite of iget(). It takes in, also, three arguments: struct SUPER_BLOCK *sb, which is a pointer to a super block; struct INODE *inode, which is an inode pointer; and lastly int n which is the inode number. This method returns nothing (it is static void after all!).

We begin, as usual, by initializing a buffer for a 512 byte block and call it sect. We then figure out which block (offset from the beginning of the inode section) the inode lives on, and the offset within that block (lines 197 and 198 respectively).

We then read into sect the inode we wish to "put away" (line 199). We then copy the memory location of this inode, the inode argument where the changes were done, to the buffer (sect+j*sizeof(struct INODE) is where to be precise) (line 200) and then write to disk the change in the sect buffer (line 201).

That is the end of this method.

(NOTE: for those interested about the root directory's inode, it is dealt with on line 204 in /skelix/fs.c: static struct INODE iroot = {FT_DIR, 2*sizeof(struct DIR_ENTRY), {0,}}; Recall the inode Structure which had three fields: i_mode which told us whether we were dealing with a directory or an ordinary file; i_size which told us the size of the inode; and i_blocks which is the addresses. The initialization of iroot tells us several things: 1) the root directory is a FT_DIR (directory); 2) the size is 2*sizeof(struct DIR_ENTRY), 3) it's blocks include the zeroeth block among others.)

There are other methods which are not really going to be surveyed here (e.g. void verify_dir() or static void check_root() but these can be inspected on the reader's time, they are not difficult to understand and is fairly straightforward). The last method we shall discuss is:
206. static void
207. stat(struct INODE *inode) {
208.        unsigned int i = 0;
209.        char sect[512] = {0};
210.        struct DIR_ENTRY *de;
211. 
212.        kprintf(KPL_DUMP, "======== stat / ========\n");
213.        switch (inode->i_mode) {
214.        case FT_NML:
215.               kprintf(KPL_DUMP, "File, ");
216.               break;
217.        case FT_DIR:
218.               kprintf(KPL_DUMP, "Dir,  ");
219.               break;
220.        default:
221.               kprintf(KPL_PANIC, "UNKNOWN FILE TYPE!!");
222.               halt();
223.        }
224.        kprintf(KPL_DUMP, "Size: %d, ", inode->i_size);
225.        kprintf(KPL_DUMP, "Blocks: ");
226.        for (; i<8; ++i)
227.               kprintf(KPL_DUMP, "%d, ", inode->i_block[i]);
228.        hd_rw(inode->i_block[0], HD_READ, 1, sect);
229.        switch (inode->i_mode) {
230.        case FT_DIR:
231.               kprintf(KPL_DUMP, "\nName\tINode\n");
232.               de = (struct DIR_ENTRY *)sect;
233.               for (i=0; ii_size/sizeof(struct DIR_ENTRY); ++i) {
234.                      kprintf(KPL_DUMP, "%s\t%x\n", de[i].de_name, de[i].de_inode);
235.               }
236.               break;
237.        default:
238.               break;
239.        }
240. }

Skelix Revision 7: /skelix/fs.c


The static void stat() method is used to get the stats on a certain file. That is why the only argument it takes in is struct INODE *inode - an inode pointer to the inode we'd like to gauge.

We have a few dummy variables and buffers defined first on lines 208-210. An unsigned int i, a 512 byte block buffer char sect[512], and a struct DIR_ENTRY *de buffer.

We announce we are gauging a given inode on line 212 by announcing it to the world with kprintf(). Note that, if you didn't guess already, kprintf() is essentially printf() written from scratch with a few extra parameters.

Line 213 we are investigating what sort of file we are gauging. If it is a file or directory (or unknown) we announce it (lines 215, 218, and 221 respectively - only if the file type is unknown will we actually halt()).

We then announce how big the file is with line 224. There is also an announcement of the blocks that are being used by the inode (we cycle through the blocks on line 227).

Recall the parameters for the hd_rw() (I reiterate it here only because I myself forgot!): void hd_rw(unsigned int logicalBlockAddress, unsigned int ReadOrWrite, unsigned int SectorsToAccess, void *buf);

So looking to line 228 with knowledge of this method renewed, we see that we read from inode->i_block[0], and we stuff it into the block buffer char sect[512].

IF we are dealing with a directory, then we print out the contents of the directory (lines 230-236). Otherwise we simply end the method.

References

Tanenbaum, Andrew S. and Woodhull, Albert S. Operating Systems: Design and Implementation. Prentice Hall: Upper Saddle River, NJ. (2006)

No comments: