Have you ever wondered why scientists don’t use databases, as in “relational databases”? Well, they actually do, but not as often as they could. Using a relational database is worth the effort in many cases, and in this post, I will be writing about SQLite, an engine that is easy to use both from C++ and Python.
What’s a relational database anyway? Simply put, it is a container where you put a bunch of tables. A table as in “database table” is not very different than a table that you would find in a written document, with rows and columns. In a database, each column of a table has a name. The names of the columns are kept easy to type, so that you can query the database using those names. Following is an example table with data from an hypothetical experiment.
Since you can have more than one table in a database, it is quite common to link different tables together by way of having the same values in some particular column or columns. For example, if you are performing a Bayesian simulation, you might want a table to store data for each step of the simulation, and another table if you need to differentiate between chains in the simulation. Each row of the data table would be associated to a single row of the chains table, since each reported data-point would correspond to a unique chain. The association is built simply by having the same chain id in one column of each table. Here is how the chains table would look:
Did you note the liberal use of “Binary Blob” in the two previous examples? This is related to the kind of data that database engines can naturally store. Normally, there is only a single scalar piece of data in each cell of a table. For example, an integer or a real number. There are a few more simple types available: true/false values, decimals and strings. The database engine knows how to work with these simple scalar types, and you will find them handy in most situations. However, chances are that certain pieces of the experiment data will only make sense to you and your program and not to the database engine. An example would be a vector of numbers of variable length that is always written and read in full, and where you don’t use the vector’s components to do searches in the database. Another example would be a custom numerical type, like very long integers. Or even some more general algebraic object, like a sparse matrix or a rational polynomial fraction. For those cases the database engine has something called “blobs”, i.e. fields of raw data which are not interpreted in any way by the database engine. I use them liberally, since they help to keep the data closer to their natural representation.
I’m singling out SQLite because:
- There is not a lot to configure to start using the database… basically you just open or create a file that will contain it,
- and because the entire database fits in that file, you can do things that you normally do with a file, like store multiple copies, or send it to a colleague.
- SQLite has bindings for virtually all programming languages, which is good is you want to generate your data in one programming language (say, C++) and analyze it in another programming language (Python); as we will be doing soon here.
- If you are new to SQL, SQLite offers a compromise between the more conventional file-based approach to data handling and full-blown database servers.
- Late but not least, SQLite runs virtually everywhere already, from your phone to your desktop, and in many places in the middle. It can be the ideal solution if you need to collect data using esoteric hardware.
Two of the issues mentioned before are moot with SQLite: with some dexterity it is possible to get good speed, and cost is only a problem if you really need to spend some expiring grant-money, because SQLite is free both as in free beer and as in royalty free. The most serious issue is, arguably, lack of knowledge about how to use this engine. This blog post is a tutorial providing a full example that hopefully will move SQLite closer to your comfort zone.
An example: simulating the negative binomial
A full-blown Bayesian simulation is perhaps a notch too heavy for an introductory example, so instead let’s focus here in something simpler. Say we want to make an experiment to obtain an empirical probability distribution for the number of times that you roll a dice and obtain something different than one, before you obtain a one for the fifth time (I’m stealing the idea from this Wikipedia article). Since we are interested in recovering the empirical distribution, we want to run this experiment many, many times.
Writing the code that performs the actual simulation should be straightforward enough, you can find a C++ candidate here.
In the first version of the example, the only fragment related to the data-saving part so far is argv, which we use to indicate that a if the file with the database exists when the program is run, we want to remove it so that we can create it anew. In this version of the file we have taken care of actually removing the file. For simplicity, we have also introduced a fixed file name for the database.
Introducing the first SQLite code
There are several ways of using SQLite in your code. For example, if you have it installed in your system, then it is a matter of #including the header and linking against the library. Another approach, the one that we will follow here, is to include SQLite sources together with ours. This way our code can be compiled even if the library is not installed in the system or the headers are not present. Including SQLite sources in the project is even more convenient if you develop in multiple platforms/computers.
Grab SQLite’s amalgamation from SQLite’s download page; there are several files there, just choose the most recent, paying attention to the version number separated with dots. One suitable candidate would be this file. From the amalgamation archive, copy both sqlite3.h and sqlite3.c to the directory where you have the project, in this case the file sim1_main.cpp. To ensure that everything is ok, try to compile the project now using this revision of sim1_main.cpp with the following commands:
$ gcc sqlite3.c -c -o sqlite3.o
$ g++ -std=c++11 sqlite3.o sim1_main.cpp -o sim1_main
The new detail in the code is the function “setup_db”, where we will try to create and setup the database. Two functions are used: “sqlite3_open“, and “sqlite3_close”. The first one is called with a file name and a pointer to sqlite3* ; this function opens the database file if it exists or otherwise creates a new database file. The second function is used for closing the connections, and you should invoke it even if sqlite3_open fails. However, there are some good news: even if you forget to call this function, or if your program crashes and you don’t have an opportunity to call it, all the data that you have already committed to the database will be safe.
Up to this point we have taken care of opening the database. We also have some code that ensures that we will open a brand new database, by deleting any file with the same name previously. Since sqlite3_open will be creating a new database, we need to create the tables inside the database before putting any actual data there. For that, we will need to use SQL.
SQL is a specialized programming language used for databases. It includes sentences for creating, modifying and deleting tables, and more importantly, for extracting the data from the tables while at the same time doing some processing in the data. This last possibility is more commonly known as “making queries”, and since database engines know a lot about how to work with relational data, you can get the best performance by expressing how do you want your data without having to know anything about all the tricky details of how to read and navigate the tables on disk. Unfortunately a full tutorial on SQL is out of the scope of this tutorial, nonetheless I will explain the sentences that we use.
In this revision, we use a little of SQL to create a single table in the database. As you see, the SQL is nothing extraordinary: the “CREATE TABLE” sentence instructs the SQLite to create a table with just three fields: simulation_id, steps_to_goal, and, just to make things a little bit more interesting, a field dice_points where we will store the full sequence of dice points values during the casts of each individual experiment. Notice that we embed the SQL sentence in a C++ 11 raw string literal; you can use conventional C/C++ strings for the same purpose of course, but raw string literals are just more convenient.
A smarter way of managing errors
There is a bit more going on in this revision. You saw that before this revision, we had a check directly after sqlite3_open
, where we, on error, both closed the database and raised an exception. In this new revision, we are using std::shared_ptr
to return the sqlite3* connection and to close the database automatically both in normal application shutdown and in failure.
Writing data to the database file
All what is left is to write the actual data in the database. Let’s take care of that. We need to compile a SQL sentence. Compiling the sentence is done once, and from there on the compiled sentence is used many times to insert each row of data.
The update revision of the code is here. First notice the function sqlite3_prepare_v2 outside the insertion loop, giving SQLite the SQL insert statement, which contains placeholders in the form of question signs. These placeholders will be filled inside the simulation loop, with invocations to sqlite3_bind for each field that needs to be inserted. In each loop iteration, after the parameters have been bound to the prepared statement, a call to sqlite3_step is issued, which takes care of actually sending the data to the database. If we were interested in getting data out of the database, we would need to fetch a row of data immediately after sqlite3_step, but since we don’t to get any data here, we reset the statement so that we can bind new parameters to it in the next simulation loop iteration. Finally, when we are out of the insertion loop, we take care of disposing the statement with sqlite3_finalize.
This covers the basic tutorial of how to write data out from a simulation to a SQLite file, but there are a few important things that you should try in your own.
First, ensure that you get correct numbers in the table. If you are new to SQL, this would be a good to learn a bit more about queries, you can use them with SQLite both with the command-line tool sqlite3 and from Python, using the module of the same name.
Second, check the speed of the program we just finished. Insertions in SQLite are a lot faster if the inserted blocks are surrounded by “START TRANSACTION” and “COMMIT” statements. You should use them, but not before deciding what would be the atomic unit of data that you want to send as a whole to the database.
Third, notice that I have not given any use to the field “dice_points” in the table. It has space for a blob. It is possible to save there the entire array of dice points for each simulation, even if these arrays of course would have different length. There are many ways of working with blobs, the most basic one for this case would be with sqlite3_bind_blob. Read the documentation and think about ways in which you can fill the blob with array data.