Friday, August 3, 2007

Hard Disk Geometry Notes

Introduction and Road Map

All right, just to tell a little about myself, I'm a file system programmer (some call me a hacker, others call me a crack pot; the truth lies somewhere in between I suppose). I want to start a series of blog entries of "How to write a file system from scratch in the D Programming language", but I have to start small. Really small. So I will start with the geometry of the hard disk.

If all goes well, then the rough road map will be: introduction to Unix-like file systems, the Skelix file system, perhaps the Minix file system, the Fast File System (UFS), the ext2 file system, the ext3 file system, the ext4 file system, and the ZFS ("zeta byte file system"). Hopefully, I can pull off writing a ZFS clone or a file system superior to it. Ah, dare to dream!

Of course, perhaps I should be a little more specific: I will write a toy file system, then either hack it to make it better depending on the file system I just covered or I will write a completely new one from scratch. Of course, it will be in the D programming language ;)

Cylinder-Head-Sector Geometry

We need to begin with the fundamental part of the file system: the hardware. Without it, there is no need for a file system! It would be nice to discuss it first. The hardware, more specifically the hard disks, floppy disks, etc. that the file system keeps track of, has a sort of geometry to it.

The hard disks are actually a collection of magnetic plates, each plate has 2 sides called a head. Each head has grooves in it like a record, these grooves are called the track or the cylinders of the disk. The head is cut into pieces like a pie as well, each slice of the head is called a Sector.

That's the vocabulary lesson!

I'd hate to quote wikipedia, but I'm too lazy to discuss Cylinder-Head-Sector Addressing in detail:

The number of blocks on one side of a platter is:

blocksPerPlatterSide = (cylindersPerPlatter)*(SectorsPerPlatter)

The number of blocks per platter is:

blocksPerPlatter = (blocksPerPlatterSide)*(sidesUsedPerPlatter)

which is usually written in terms of the number of heads used:

blocksPerPlatter = (blocksPerPlatterSide)*(HeadsPerPlatter)

This is usually expanded to:

blocksPerPlatter =(cylindersPerPlatter)*(SectorsPerPlatter)*(HeadsPerPlatter)

and rearranged:

blocksPerPlatter =(cylindersPerPlatter)*(HeadsPerPlatter)*(SectorsPerPlatter)

Since all the platters are the same size and hard drives usually have more than one platter, the total number of blocks on the drive can be written as:

totalBlocks =(Cylinders)*(HeadsPerPlatter)*(Sectors)*(NumberOfPlatters)

If the number of platters is combined with the number of heads per platter to form the single parameter Heads, the equation can be written in its final form as:

totalBlocks = (Cylinders)*(Heads)*(Sectors)

The numbering for cylinders, heads, and sectors all begin at 0. There is a triple: (cylinder, head, sectors) that we use. Usually, it is given the number of sectors per track (A) and the number of heads per cylinder (B). The sectors component of the triple ranges from 0 to A-1. The head ranges from 0 to B-1. However the cylinder component is unbounded.

Logical Block Addressing ("LBA") Geometry

To contrast the cylinder-head-sector method, we can use a lazier approach: just count the 512 byte blocks from a given starting point (the so-called Boot Block)! So we would have block 0, block 1, ..., block N.

There is a reason this is called the Logical block addressing, and that's because it's completely logical rather than based on how the disks look, etc.

Vocabulary:

Block (aka Sector) is a 512 byte atom of disk space.

The hard disks are actually a collection of magnetic plates called a platter, each plate has 2 sides called a head. Each head has grooves in it like a record, these grooves are called the track or the cylinders of the disk. The head is cut into pieces like a pie as well, each slice of the head is called a Sector.

Note that: 1 byte = 8 bits,
1 kilobyte = 1024 bytes,
1 megabyte = 1024 kilobytes,
1 gigabyte = 1024 megabytes,
1 terabyte = 1024 gigabytes,
1 petabyte = 1024 terabytes,
1 exabyte = 1024 petabytes,
1 zettabyte = 1024 petabytes,
1 yottabyte = 1024 zettabyte.
And 1 bit is a "1-or-0" value!

The D Programming Language: Disabling Garbage Collection

I have become impressed with the D programming language (especially since it has a GNU Compiler front end!). However, it has a garbage collector! How terrible for systems programmers (ahem). So here is a way to manually manage memory with the classes.

class foo
{
     new(uint sz)
     {
         void* p;
         p = std.c.stdlib.malloc(sz);
         if (!p)
             throw new OutOfMemory();
         gc.addRange(p, p + sz);
         return p;
     }
     delete(void* p)
     {
         if (p)
         {   gc.removeRange(p);
             std.c.stdlib.free(p);
         }
     }
}
This example is available and analyzed elsewhere on the web, from the founder of the D programming language:
The critical features of new() are:
new() does not have a return type specified, but it is defined to be void*. new() must return a void*.
• If new() cannot allocate memory, it must not return null, but must throw an exception.
• The pointer returned from new() must be to memory aligned to the default alignment. This is 8 on win32 systems.
• The size parameter is needed in case the allocator is called from a class derived from Foo and is a larger size than Foo.
• A null is not returned if storage cannot be allocated. Instead, an exception is thrown. Which exception gets thrown is up to the programmer, in this case, OutOfMemory() is.
• When scanning memory for root pointers into the garbage collected heap, the static data segment and the stack are scanned automatically. The C heap is not. Therefore, if Foo or any class derived from Foo using the allocator contains any references to data allocated by the garbage collector, the gc needs to be notified. This is done with the gc.addRange() method.
• No initialization of the memory is necessary, as code is automatically inserted after the call to new() to set the class instance members to their defaults and then the constructor (if any) is run.

The critical features of delete() are: • The destructor (if any) has already been called on the argument p, so the data it points to should be assumed to be garbage.
• The pointer p may be null.
• If the gc was notified with gc.addRange(), a corresponding call to gc.removeRange() must happen in the deallocator.
• If there is a delete(), there should be a corresponding new().

If memory is allocated using class specific allocators and deallocators, careful coding practices must be followed to avoid memory leaks and dangling references. In the presence of exceptions, it is particularly important to practice RAII to prevent memory leaks.
Also, I just learned by experience that the following code works as well:

extern (C) void* kmalloc(uint sz); //alternatively malloc()
extern (C) void kfree(void *p); //and free() defined in an external C file

class foo
{
     new(uint sz)
     {
         void* p;
         p = kmalloc(sz);
         return p;
     }
     delete(void* p)
     {
         kfree(p);
     }
}

Wednesday, August 1, 2007

(Object Oriented) Python Rant

Python is the way of the future...like Zeppelins and Autogyros. It's a lot more natural programming in Python than it is in Perl, at least for Object Oriented Programming. (Note that there is an interesting series called the Python Papers that is a good read!)

When I first learned it, I hated it because it wasn't like C/C++/D/Java at all! However, I gave it high marks for having the lambda anonymous functions. It made me reminisce about the bad old days of LISP and SCHEME and The Structure and Interpretation of Programming.

Now, my tune has changed completely since I'm working on an operating system (Brainix). I recognize the value of having scripts, despite the fact that Brainix is not mature enough to run a simple shell. It's the benefit of platform independence at the cost of performance.

The problem is that there are no good scripting languages! Perl is esoteric as death, and BASH is not all that better...don't get me wrong, BASH is fabulous as a shell but terrible as a scripting language. Python is perhaps the best scripting language out there, and that's not saying all that much.

One particular problem that I have is, for perl I can do things like:

@files = `ls`;
for $file in @files
{
#do stuff!
}
However, for Python, having this neat scripting feature is changed! In the immortal words of "Grandpa" Simpson "I was with it once...then it changed into something horrible and scary." After a bit of study, I figured out that there is an OS module where I can use the os.system() method to invoke shell commands. Now I could write something like:

foreach file in os.command("ls"):
# do stuff translated into python!
Perhaps a more disturbing difference for my inner C/C++/D/Java programmer is the lack of curly brackets. I got over this by using Python in a more functional manner that would make John Armstrong proud.

Finally I fought my fears and started programming in an object oriented manner. Object oriented programming is the way of my people! But this sort of object orientedness seemed odd. The self pointer (I assume, I thought it to be a python parallel to the this pointer at first) is the first argument in every method, which was bizarre. It reminded me of Phil's Object Oriented ANSI C.

Perhaps what the open source movement needs is a good shell that's also a good scripting language...one that's open source, object oriented, and is an affront to the Windoze PowerShell. BeanShell is a possibility, but I think there is hope for python yet.

Well, this isn't much of a first post, but that's all I have to rant about so far about Python.