Sunday, October 14, 2012

Databases, Part 1: Components

1. Overview. So I've been using databases recently, and wanted to write a series of notes on them.

1.1. Warning: We will NOT talk about:

  1. What's going on "under the hood" in a database
  2. Compare various engines (SQL vs. NoSQL vs. etc.)
  3. How to code a database engine from scratch
  4. How to optimize a given database engine (e.g., "How do I tune MySQL?")

1.2. Topics Discussed in the Series. We will review a database agnostic pseudocode for queries, data definition, data manipulation. We will introduce the basic concepts, using black boxes along the way (e.g., "The database typically has a storage engine translating the data to the machine" and specify it no further!).

1.3. Topics Discussed in this post. We will discuss the components of a database, namely: a database record, record's field, query, table, and views.

1.4. Intended Audience. If you're a programmer, have not learned databases, but need to learn how to use them, then this article is for you! If you are trying to write a database in, say, C...this isn't for you.

Remark. It dawns on me others have pointed out SQL is a DSL [andrejkoelewijn.com]; this is true, but we need to abuse language when working with full generality. (Say what? We need to pretend the terms are not "strict" and "loose", e.g., weakening the demand that all entries in the database have identical format.)

Database Components

2.Terminology. So, suppose we're programmers for a library. They want a program to keep track of their books. What do we do?

2.1. Records. Well, each book is represented by some "Record" (also called a row or, in CS theory, tuple) which is like a plain old data structure.

I think theoretical computer scientists would want me to say "A record is a well-defined, ordered, finite list of heterogeneous fields" or something similar.

At any rate, returning to the library example, our records would be the books.

We call a record's components its "Fields" (or Columns). For a book, its fields would be the author, the title, the publisher, the year published, etc.

Right, so far a Database Record is just a composite data structure, and a Field is an entry in a record.

2.1.1. Pseudocode Conventions. I try to adhere to CLRS' pseudocode conventions (according to their third edition). So foobar.spam accesses the field spam of the variable foobar, a composite data structure. (This is eerily similar to JSON in JavaScript.)

So if hamlet represents a record, then the field hamlet.author should be the string "William Shakespeare".

2.1.2. Conventions. All records have some default fields: id (which is assigned by the engine "under the hood" to keep track of records). More may be added later, for now id is all that's needed.

2.2. Queries. A database is great because we can look stuff up. For libraries, we can find all books within a given subject, or find all the books one given author wrote, or...

Each of these are a "Query" which retrieve records from the database based on specific criteria (e.g., year published, author, subject, etc.).

2.3. Tables, Views. We usually stick these records into a file (or sometimes several files). The technical term for such a file is a "Table".

One may be tempted to ask "For our library example, what are the tables?" Good question!

The answer is non-unique: i.e., we have many different acceptable answers! We could have one giant file containing all the books, we could have one file per subject, or have each file correspond to the books published in different years, or...

Sometimes it's useful to do all of these! Why? Because they are the results of queries, and it may be faster to pre-store the results rather than dynamically generate it at runtime.

When we store in a table the results of a query for future use, we call this table a "View".

2.4. Future Directions: Modus Operandi. We have just introduced the components for a database, but there are four aspects the programmer needs to know about: (1) Queries, (2) Data manipulation, (3) Transaction controls, (4) Data definition.

We will begin discussing queries next time.

Custom Stacktrace/Logging Function for Clojure

1. I've been learning Clojure, and sometimes I need to know what my functions are doing. To do this, I usually use some format:

file.clj:<lineno> <message>

Where the message could be notifying me some line is being evaluated, or printing the value of some variable.

1.1. I prefer writing <file-name>:<lineno>, in a nod to grep -n output.

2. I know, this is bad form, inelegant, whatever. But it is useful...and that's why I do it!

2.1. To pre-emptively answer criticisms, I know there are precisely one million and one different logging, debugging, stack-tracing libraries. They are all quite heavy weight and complicated.

For these reasons (too complicated and too heavy-weight), I prefer writing my own.

3. What code can do this? Well, the code I typically use is a couple of simple macros:

(defmacro dbg [x]
  `(do (printf "%s:%s> %s\n"
               ~*source-path* 
               ~(:line (meta &form))
               ~x)
       (flush)))

(defmacro dbg-eval [x]
  `(do (printf "%s:%s> %s \n;=> %s\n"
               ~*source-path*
               ~(:line (meta &form))
               ~(pr-str x)
               ~x)
       (flush)))

3.1. So writing (dbg "foo() is starting"), I can tell when the function foo is being called.

Stylistic Problems

4. This is based on the Java stacktrace/logging format class.methodName(file.java:lineno) <message>, and namely takes advantage using method calling syntax: method().

4.1. It'd be nice if there were a LISP-y way to do likewise, but for now...it works.

Perhaps write (filename.clj:lineno/function-name <message>)? Or some appropriate variation.

Friday, June 8, 2012

What else is wrong with Code Academy?

I had posted a number of missing topics on Code Academy's curriculum last time.

Since then I was asked if I could prepare some material for Ruby or Python. After carefully inspecting the terms of agreement:

You hereby grant us the right to own all right, title and interest (including patent rights, copyright rights, trade secret rights, mask work rights, trademark rights, sui generis database rights and all other rights of any sort throughout the world) to any and all Curriculum Contributions you make to the Website, and you hereby make all assignments necessary to accomplish the foregoing ownership. To the extent allowed by law, you also agree to waive all rights of paternity, integrity, disclosure and withdrawal and any other rights that may be known as or referred to as “moral rights,” “artist’s rights,” “droit moral,” or the like with respect to any Curriculum Contributions you make.

Ah, here's another thing wrong: I do your work and you get the credit for it. Well, not credit, but the "patent rights, copyright rights, trade secret rights, mask work rights, trademark rights, sui generis database rights and all other rights of any sort throughout the world".

I wonder why that doesn't work in school...?

Thursday, May 31, 2012

Ruby Koans

If anyone is learning Ruby, and wants to become more familiar with various nuances, go download the Ruby Koans.

There are about two dozen files, each discussing a specific topic. You go through each file, looking for __ segments and replace them with statements which would result in true assertions.

After completing a few, you can run ruby path_to_enlightenment.rb and you get the following prompt:

user:~/koans$ ruby path_to_enlightenment.rb 
AboutNil#test_nil_is_an_object has expanded your awareness.
AboutNil#test_you_dont_get_null_pointer_errors_when_calling_methods_on_nil has 
expanded your awareness.
AboutNil#test_nil_has_a_few_methods_defined_on_it has expanded your awareness.
AboutObjects#test_everything_is_an_object has damaged your karma.

The Master says:
  You have not yet reached enlightenment.
  You are progressing. Excellent. 8 completed.

The answers you seek...
  <"FILL ME IN"> expected but was  <true>.

Please meditate on the following code:
  /home/alex/local/ruby/koans/about_objects.rb:5:in `test_everything_is_an_object'

when you lose, don't lose the lesson
your path thus far [.X________________________________________________] 8/280
user:~/koans$

This really helps understand some of the nuances underpinning Ruby.

Thursday, May 24, 2012

Code Academy: What's Missing?

Recently, Code Academy launched with the intention to teach computer programming to "the masses".

I'd like to discuss a short-coming in their material (I won't discuss the method used currently).

Someone in their forums has asked[deadlink] "What topics need new courses the most?"

[NB: apparently code academy has taken down their forums...I guess they couldn't stand the complaints about their EULA.]

I gave a short answer, but it gave me something to think about: what should a programmer know?

Compounding the problem, the material a working coder should know differs from the material constituting an programmer's education.

I'm just going to rattle off items belonging to both categories...

Programmer's Education

The education includes stuff that may or may not be useful when working.

0. Difference between Expressions and Statements. This important distinction is never taught, so students never figure why foo = function() { ... } differs from function bar() { .... }.

1. Algorithms. This topic is not discussed thus far in any course on Code Academy. Being more specific, I mean CLRS-level stuff, not rudimentary "make a loop, do stuff, blah".

Specifically, algorithms that should be covered:

  1. Sorting Algorithms
    • Bubblesort, as an example of a bad idea;
    • Insertion sort
    • Mergesort
    • Quicksort
    • Heap sort
  2. Searching Algorithms
    • Linear search
    • Binary search
    • A* search algorithm (and other graph search algorithms)
    • etc.

2. Data Structures. Although discussed a little, the Academy does not discuss linked lists, stacks, queues, binary trees, etc.

Advanced data structures might include, but not limited to, Fibonacci heaps, B-trees, van Emde Boas Trees, etc.

3. Dynamic Programming. Both the "Bottom-Up" and "Top-Down" approaches should be discussed; memoization ought to be covered, etc.

4. Greedy Algorithms. This, I feel, is important for pedagogical purposes...greedy algorithms can fail in practice, but it is important to expand your mind.

5. Algorithm Analysis. I learned this in numerical analysis, where we say "Stop: Why does this algorithm work? How can we make it better?"

This cannot be taught on Code Academy, since it would require a nonlinear framework...i.e., actually taking the user's response and generating questions/problems/exercises based off of it.

Think of it as the Socratic method applied to programming. Code Academy cannot handle the Socratic method (perhaps it would not be too far to suggest they are...sophists?).

But seriously: nurturing curiosity in a way specific to the pupil, that's the mark of a good teacher. Code Academy currently cannot facilitate this.

6. Various Rites of Passage. It seems computer science majors have certain "rites of passage": writing an operating system from scratch (or implementing their own file system, process manager, etc., for *nix); writing their own compiler or interpreter; understanding the networking stack; etc.

These should be included in any coder's education...but they require low level languages to do it (assembly or C). Code Academy cannot handle this, at the present moment...

Working Coder's Knowledge

Perhaps one should consult What every CS majors should know...

1. AJAX. Certainly, the most glaring feature missing is AJAX, which I think any browser can handle (the XMLHttpRequest...does it depend on a library?).

2. Runtime, Parse Time, etc. Code Academy does not mention these distinguished times. Nor is there any discussion of "server-side" vs. "client-side", what the difference is, etc.

3. Debugging Techniques. Really, this is most useful, when you run code and cannot figure out what the problem is. What to do?

Well, insert console.log(...) calls indicating what is happening. "I am doing this", "function blah() has been called", etc.

4. How to Optimize Code. I have just written a JavaScript program. Wonderful. It's slow as death. What do I do to make it faster?

Final Thoughts

Certainly many things are missing, but these are my thoughts on the matter of Code Academy and a programmer's pedagogy.

Thursday, May 17, 2012

Pseudocode in HTML

So I've been typing up my own notes in html, by hand (like in the bad old days of the internet), and I've discovered some rather interesting behaviour of CSS2.

My pseudocode conventions have composite data structures resemble C's struct, so for example:

struct myDataStructure {
    // code not shown
}
Comments are done in traditional C method too. How to accomplish this using <span> classes? Well, we first specify a class pseudo for pre denoting our pseudocode:
pre.pseudo { } /* pre's pseudocode class */

Now we will investigate each part of pseudocode we want to implement!

Data Structures

I use Plain Old Data Structures in my pseudocode, often writing <span class="struct">myStructName</span> and formatting it as struct myStructName...

/* write a structure as <span class="struct">myStructName</span> */
pre.pseudo span.struct:before {
    content:"struct ";
    font-weight:bold;
}

Functions

My shorthand for the word "function" is always "fn", so this is the class I created. The pseudocode conventions used is simply function doSomething(...), and to implement this:

pre.pseudo span.fn:before {
    content:"function ";
    font-weight:bold;
}

This is my personal convention, but there is also the C-style convention which I'll discuss.

Alternative Function Pseudocode

Some people like writing their function pseudocode to resemble C's ret-type function-name (arg-list), and it is possible to do this:

/* <span class="function" ret="void">foobar</span>() produces "void foobar()" */
pre.pseudocode span.function:before {
    content:attr(ret) " ";
/*
    font-weight:bold;
*/
}

If you prefer the function's ret-type to be bold, you can uncomment the second line.

For Loops

Our trusty for(...) construct is readily implemented:

/* for-loop formatting to write 
   <span="for">i=1; i<j; i++</span>
   be formatted as
   for(i=1; i<j; i++) */
pre.pseudo span.for:before {
    content:"for (";
    font-style:normal;
}
pre.pseudo span.for:after {
    content:") ";
    font-style:normal;
}

It would be nice if I could write <span class="for>...</span> to produce for(...), but I cannot think of a way to do so.

Comments

I end up using the comment convention // my comment which works well on paper, but due to html resizing...it can be problematical. So here's the CSS implementation of the paper-way, and commented out is the traditional C way /* writing a comment like this */.

/* comments are italicized texts beginning with a double slash // */
pre.pseudo span.comment {
    font-style:italic;
}
pre.pseudo span.comment:before {
    content:"// ";
    font-style:normal;
}
/* alternatively uncomment the following to have C++ like comments */
/*
pre.pseudo span.comment:before {
    content:"/* ";
    font-style:normal;
}
pre.pseudo span.comment:after {
    content:" */";
    font-style:normal;
}
*/

Constants

What about constants? For example, I like referring to the null object as NULL, can we do it?

pre.pseudo span#NULL:before {
    content:"NULL";
}

This would require us to write <span id="NULL"/>. This may seem like a waste of time and energy, but when you decide lowercase "null" looks prettier...you have to change only one line of css to get it!

Flaws

None!

No, I'm just kidding, the only problems I can think of is that these <span> classes work only within the <pre class="pseudo"> environment.

So if you were working in the <code> environment...you'd have trouble!

Of course, my pseudocode listings provided are incomplete (conditionals and while-loops are omitted, among other things), but you get the idea how to take advantage of CSS when "typesetting" pseudocode in html.

Wednesday, May 16, 2012

Lectures on Algorithms

Although I'm devoted to Knuth, CLRS' Introduction to Algorithms is a good read.

Leiserson (the "L" in CLRS) has his lectures [youtube.com] free online. Caution: skip the first 16 or 17 minutes of the first lecture, it's administrative stuff.

The course website is available [mit.edu] thanks to MIT's OpenCourseWare.

The pseudocode appears a little foreign for neophytes (e.g. writing foo←bar instead of foo=bar for assignment). The assignment pseudocode operator actually remains the same across the board...at least, among theoretical computer scientists.

CLRS fixes their pseudocode conventions in their third edition, however; which was published after these lectures were Youtubed.

A personal quibble: array indices for CLRS begin at 1. This annoys me. But it is a misdemeanor, not a felony.

Also, if you are wondering about the difference between big-O notation and big-Theta notation: suppose some algorithm for a computer were Θ(n), then it is O(n). But not all O(n) algorithms are Θ(n) --- there are some rare instances when this is true.

So Θ(n) gives a more strict criteria than O(n), but in practice it's "ok" to mix up notations (since we usually examine worst-case scenarios anyways!).

Addendum: when is it not ok to make this mistake that [f(n)=Θ(g(n))]==[f(n)=O(g(n))] for some f(n), g(n)?

When f(n)<g(n).

However, if g(n)<f(n)<c*g(n) for every n and for some fixed positive integer c, then [f(n)=Θ(g(n))]==[f(n)=O(g(n))]

So if we study the worst case situations, it's okay to mix up big-O and big-Θ notation. The difference occurs when we study the best-case behaviour (which Leiserson calls "bogus"), and possibly when we study the average-case behaviour.

Punch-line: the big-Θ convention is "stronger" than (i.e., implies) the big-O notation, but big-O notation does not imply big-Θ in general.

For studying worst-case behaviour, they are equivalent.

When studying average-case behaviour, we do not know if they are equivalent (it needs to be proven for each algorithm).

When studying best-case behaviour, you are committing fraud: best-case scenarios are bogus.