TOAST

The other day I was working through my (quite massive) backlog of saved Hacker News stories and read this gem — Introduction to PostgreSQL physical storage. As someone who has loved databases forever, and has spent over a year building LinkedIn’s next generation distributed graph database, I found this post absolutely fascinating.

As the title suggests this article talks about how data in PostgreSQL tables and databases are actually stored on disk and how free space (to figure out where to store incoming data) is managed. The diagram of the page structure is very helpful in understanding how data is stored in a page. I also really liked the use of PostgreSQL queries used throughout the article to explain the topic at hand by examining a real PostgreSQL instance. The author does a great job at explaining concepts at just the right amount of detail, with several links provided for those interesting in learning more.

(Relevant side bar: the PostgreSQL source code documentation is amazing)

Research Paper: “Hekaton: SQL Server’s Memory-Optimized OLTP Engine”

(I’ve noticed that when I’m sad I tend to throw myself at whatever activity catches my fancy at the moment. I do this to distract myself, and in general this seems to work pretty well. To deal with my sadness this time around I will be reading research papers. And blogging. Here’s the first paper I read.)

(Hekaton was one of the systems mentioned in the “Red Book” that piqued my interest)

Hekaton: SQL Server’s Memory-Optimized OLTP Engine gives the reader an overview of Hekaton, a database engine that is a part of Microsoft SQL Server. It has been designed to work with data that fits entirely in main memory. The main motivation driving the design and implementation of Hekaton is the dropping cost of memory and the every growing popularity of multi-core CPUs. In order to achieve the best performance and to take full advantage of the multiple cores the Hekaton embraces a lock/latch-free design: all the index structures (a hash table and B-Tree) are lock/latch-free (details on the design are in [1] and [2]) and transactions use MVCC.

While the details of the implementation of the index data structures are in another paper, this paper does go into details of the MVCC design used and the garbage collection mechanism used to delete old records. Sections 6, 7, and 8 go into details of transactions, logging, and garbage collection. These sections are incredibly well written and do a great job of explaining these complex and core components of the system. The logging and checkpointing system is quite unique and I thought the non-usage (I’m sure there is a better term) of WAL is interesting. Section 8, which goes into details of the garbage collection mechanism used in Hekaton is definitely my favorite section in the paper. I think the GC algorithm is, simply put, beautiful.

Another unique aspect of the system: T-SQL queries and procedures are compiled down into native code to achieve high performance. Section 5 goes into the details of how this is done. What is interesting about this conversion process is that the generated code is one big function with labels and goto statements.

This was a great paper to begin 2016 with.

References

[1] Maged M. Michael. 2002. High performance dynamic lock- free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures (SPAA ’02): 73-82.

[2] Levandoski, J.J.; Lomet, D.B.; Sengupta, S., “The Bw-Tree: A B-tree for new hardware platforms,” in Data Engineering (ICDE), 2013 IEEE 29th International Conference on , vol., no., pp.302-313, 8-12 April 2013

Research Paper: “AsterixDB: A Scalable, Open Source BDMS”

(AsterixDB was one of the systems mentioned in the “Red Book” that piqued my interest)

AsterixDB: A Scalable, Open Source BDMS gives the reader an overview of the AsterixDB system. AsterixDB is an impressive “big data management system” (BDMS) with several interesting features including a flexible data model, a powerful query language, data ingestion capabilities and distributed query execution. Two features that stood out to me were the ability to describe custom index types (B+-tree, R-tree, etc.) on your data, and the ability to query data that “lives” outside the system.

A majority of the paper is on the data definition and manipulation layer. The authors use an example of a social networking website to illustrate the power of AsterixDB’s data model and query language. Most of this section consists of code snippets (to define, load, and query the data) followed by an explanation of what exactly that snippet of code does, and what happens under the hood when that snippet is run. These code snippets make this section of the paper very easy to read and understand.

The data storage, indexing, and query execution components are described in the System Architecture section of the paper. These subsystems have separate papers ([1] and [2]) devoted to them; in this paper we are just given a brief overview of how they function and what their key features are. One piece of information that stood out to me in this section was the software layer described that grants any index data structure LSM update semantics. I thought this was a very novel idea to help speed up data ingestion and index building, while at the same time having the benefit of diverse index data structures based on the type of data being stored and indexed. The secondary index design is also interesting.

I really enjoyed reading this paper. I’ve added [1] and [2] to my “research papers to read next” list, and hope to get to it very soon.

[1] S. Alsubaiee, A. Behm, V. Borkar, Z. Heilbron, Y.-S. Kim, M. Carey, M. Dressler, and C. Li. Storage Management in AsterixDB. Proc. VLDB Endow., 7(10), June 2014.

[2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A Flexible and Extensible Foundation for Data-intensive Computing. ICDE, 0:1151–1162, 2011.