Sunday, October 14, 2012

Databases, Part 1: Components

1. Overview. So I've been using databases recently, and wanted to write a series of notes on them.

1.1. Warning: We will NOT talk about:

  1. What's going on "under the hood" in a database
  2. Compare various engines (SQL vs. NoSQL vs. etc.)
  3. How to code a database engine from scratch
  4. How to optimize a given database engine (e.g., "How do I tune MySQL?")

1.2. Topics Discussed in the Series. We will review a database agnostic pseudocode for queries, data definition, data manipulation. We will introduce the basic concepts, using black boxes along the way (e.g., "The database typically has a storage engine translating the data to the machine" and specify it no further!).

1.3. Topics Discussed in this post. We will discuss the components of a database, namely: a database record, record's field, query, table, and views.

1.4. Intended Audience. If you're a programmer, have not learned databases, but need to learn how to use them, then this article is for you! If you are trying to write a database in, say, C...this isn't for you.

Remark. It dawns on me others have pointed out SQL is a DSL [andrejkoelewijn.com]; this is true, but we need to abuse language when working with full generality. (Say what? We need to pretend the terms are not "strict" and "loose", e.g., weakening the demand that all entries in the database have identical format.)

Database Components

2.Terminology. So, suppose we're programmers for a library. They want a program to keep track of their books. What do we do?

2.1. Records. Well, each book is represented by some "Record" (also called a row or, in CS theory, tuple) which is like a plain old data structure.

I think theoretical computer scientists would want me to say "A record is a well-defined, ordered, finite list of heterogeneous fields" or something similar.

At any rate, returning to the library example, our records would be the books.

We call a record's components its "Fields" (or Columns). For a book, its fields would be the author, the title, the publisher, the year published, etc.

Right, so far a Database Record is just a composite data structure, and a Field is an entry in a record.

2.1.1. Pseudocode Conventions. I try to adhere to CLRS' pseudocode conventions (according to their third edition). So foobar.spam accesses the field spam of the variable foobar, a composite data structure. (This is eerily similar to JSON in JavaScript.)

So if hamlet represents a record, then the field hamlet.author should be the string "William Shakespeare".

2.1.2. Conventions. All records have some default fields: id (which is assigned by the engine "under the hood" to keep track of records). More may be added later, for now id is all that's needed.

2.2. Queries. A database is great because we can look stuff up. For libraries, we can find all books within a given subject, or find all the books one given author wrote, or...

Each of these are a "Query" which retrieve records from the database based on specific criteria (e.g., year published, author, subject, etc.).

2.3. Tables, Views. We usually stick these records into a file (or sometimes several files). The technical term for such a file is a "Table".

One may be tempted to ask "For our library example, what are the tables?" Good question!

The answer is non-unique: i.e., we have many different acceptable answers! We could have one giant file containing all the books, we could have one file per subject, or have each file correspond to the books published in different years, or...

Sometimes it's useful to do all of these! Why? Because they are the results of queries, and it may be faster to pre-store the results rather than dynamically generate it at runtime.

When we store in a table the results of a query for future use, we call this table a "View".

2.4. Future Directions: Modus Operandi. We have just introduced the components for a database, but there are four aspects the programmer needs to know about: (1) Queries, (2) Data manipulation, (3) Transaction controls, (4) Data definition.

We will begin discussing queries next time.

Custom Stacktrace/Logging Function for Clojure

1. I've been learning Clojure, and sometimes I need to know what my functions are doing. To do this, I usually use some format:

file.clj:<lineno> <message>

Where the message could be notifying me some line is being evaluated, or printing the value of some variable.

1.1. I prefer writing <file-name>:<lineno>, in a nod to grep -n output.

2. I know, this is bad form, inelegant, whatever. But it is useful...and that's why I do it!

2.1. To pre-emptively answer criticisms, I know there are precisely one million and one different logging, debugging, stack-tracing libraries. They are all quite heavy weight and complicated.

For these reasons (too complicated and too heavy-weight), I prefer writing my own.

3. What code can do this? Well, the code I typically use is a couple of simple macros:

(defmacro dbg [x]
  `(do (printf "%s:%s> %s\n"
               ~*source-path* 
               ~(:line (meta &form))
               ~x)
       (flush)))

(defmacro dbg-eval [x]
  `(do (printf "%s:%s> %s \n;=> %s\n"
               ~*source-path*
               ~(:line (meta &form))
               ~(pr-str x)
               ~x)
       (flush)))

3.1. So writing (dbg "foo() is starting"), I can tell when the function foo is being called.

Stylistic Problems

4. This is based on the Java stacktrace/logging format class.methodName(file.java:lineno) <message>, and namely takes advantage using method calling syntax: method().

4.1. It'd be nice if there were a LISP-y way to do likewise, but for now...it works.

Perhaps write (filename.clj:lineno/function-name <message>)? Or some appropriate variation.