Why you should switch to Python 3, now

There are two major versions of Python: Python 2 and Python 3. Python 3 was released in December 2008, and was designed to rectify certain fundamental  flaws in the language. This fixing would unavoidably break backwards compatibility, which has slowed adoption by people which depend on legacy features and libraries. I myself have decided to  switch to Python 3 only recently, after popular libraries like Numpy started supporting it. But after succeeding at completing a few projects with Python 3, I’m not looking back. On the contrary, I’m very happy with the renewed Python.

The main reason to switch is that Python 3 is the version of the language that is evolving, while Python 2 is being slowly euthanized. If this alone is not enough to convince you, I have written down some more direct reasons from my experience using Python for scientific programming.

Practical reasons

The most impacting changes that I have found on my day-to-day work are the following:

  1. Python 3 makes the programmer more aware of the difference between encoded text, which is treated as binary data in the form of the bytes type, and just text, which is independent of the human language contained, just plain old good unicode, and that has type str. This is a good thing that in practice reduces the mental bookkeeping work that programmers do to handle Unicode text correctly. It is specially handy if you don’t intend to become a specialist on text and Unicode handling, but want to work comfortably analyzing data in multiple languages.
  2. Python 3 removes the print statement. Instead, an ordinary function with the same name is introduced in the builtins namespace. The new function has support for printing non-default separators and line terminators, and can also write to arbitrary files (just like the old print statement could, modulo the awkward syntax), and it even has an extra argument for flushing contents immediately, so that it is an excellent and simple way of logging things to a file or standard output. Quite handy for those long-running processes that you leave running in the cluster with nohup.
  3. In Python 2,  range and map returned fully constructed lists. In Python 3 the  functions have been made to return iterators. In practice this mean that large loops will be more memory-efficient, because range(1000000) won’t allocate a list with one million elements, and map can be used to find the longest line of a file that is terabytes long: max( map( len, open(“my_www_dump.txt”) ) . 
  4. Python 3 has improved and simplified handling of files: you can simply do open( “my_mandarin_glossary.txt”, encoding=”utf-8″) and get strings with mandarin characters directly out of the file. Conversely, if you just need bytes, you can do open( “my_binary_matrix.bin”, “rb”) and read bytes with the binary representation of your numbers. No need to twist your mind doing a-posteriori conversions.

Improvements to the standard library

There has also been improvements to other parts of the standard library which are relevant to the way we work when doing scientific programming. Here are some:

  1. Better support for compressed files in the bz2 module.
  2. Improved Pickle support. No need to import cPickle anymore.
  3. Improved xml.etree support.
  4. Unified handling of internet resources in submodules of urllib.  Before one had to use two modules with similar and confusing names: urllib and urrlib2. They did different things, and I could  never remember which one did what.

Popular libraries that you don’t need to fret about any longer when using Python 3

The following libraries work already out of the box with Python 3:

  • Numpy, Scipy, Pandas
  • NetworkX
  • Apsw, for SQLite databases.
  • Oursql, for connecting to MySQL databases.
  • Jinja2 and Django, in case you want you show off your interesting data in the web.

This list is of course much longer; I have written only the most interesting cases.

 

Cheat-sheet for Cython and C++

Iterators: Even if Cython has some support for operator overloading in C++, at version 0.19.1 that support is a bit sketchy and at times you will find some quirks. But the following code compiles:

# THE line below does not work:
# from cython.operator cimport dereference, preincrement

# This one does work
cimport cython.operator as co

from cython.operator cimport dereference, preincrement
cimport cython.operator as co

cdef extern from "myheader" namespace "my":
    cppclass iterator:
        iterator& operator++() 
        bint operator==(iterator)
        int operator*()

cdef class wi:
    cdef iterator* it
    cdef iterator* end

    def __cinit__(self, ):
        self.end = new iterator()

    # Most likely, you will be calling this directly from this 
    # or another Cython module, not from Python. 
    cdef set_iter(self, iterator* it):
        self.it = it

    def __iter__(self):
        return self 

    def __dealloc__(self):
        # This works by calling "delete" in C++, you should not
        # fear that Cython will call "free"
        del self.it 
        del self.end

    def __next__(self):
        # This works correctly by using "*it" and "*end" in the code,
        if  co.dereference( self.it ) == co.dereference( self.end ) :
            raise StopIteration()
        result =  co.dereference( co.dereference( self.it ) )
        # This also does the expected thing.
        co.preincrement(  co.dereference( self.it ) )
        return result

It can’t be None: this snippet comes from here, and apparently is a kind of annotation that forbids parameters having a “None” value.

def __cinit__(self, Connection conn not None, bint show_table=False,
            **kwargs):

Reading structured text files in C++ with boost::xpressive

Introduction

The easiest your data is to understand, the better off you are. Now, understand refers both to humans and to computers. At times, the two sides are in conflict. For example, natural speech  is really hard to understand for computers. In the other side, we humans need special goggles even to see binary data, not to mention all the brain power required to understand the raw thing. The situation can be resumed in the following way: humans like things that resemble anthropomorphic expressions, computers like bare logical structure. Well, maybe there is an easy solution: put structure in text. That’s what programming languages do, after all, and we use them just fine…(or almost).

How do this applies to your data? And more importantly, is it practical? The answer to the both questions is, unsurprisingly, “it depends”. If you have a large budget and loads of time, you can use it to build editing/viewing tools around an efficient binary representation. Alternatively, at no cost, you can use existent software like MySQL or SQLite, which amounts to using a third-party well-tested and robust binary representation with an ecosystem of libraries and tools already in place. This last option is very good a method for medium and large datasets. However, for small datasets that humans need to inspect and edit by hand, nothing trumps text files.

Self-explanatory text-files, I mean. Let’s see first an example that doesn’t fill the bill:

12 13 15 9
1 8 14 5
2 1 2 3 8
...

That gray block above is a fragment of a text data file that a colleague and I interchanged once. The numbers in each line are codes representing cities, and each line represents the a vehicle traveling and visiting those cities… up to the last number, which represents the number of times that the trip was done.

If you skip the previous explanation,  you have no way of knowing what those numbers are. Even if you was the author of the file, it is very likely that after a few months  you wouldn’t remember  what was it about either. There are other little problems as well. For example, since the file contains only integers, you are tempted to use them as indices on arrays…. and that can work, albeit with the hassle of determining the higher number  in the file, and perhaps the lower one, if it didn’t start on zero, and if they are consecutive… If you don’t validate the input for this kind of mishaps, one day the file will be prepared in a slightly different way and the program processing it will fail suddenly. Admittedly the previous format is very compact, but since

Readability counts.

(where does that quote come from?), we can try a more explicit and readable way of writing the vehicle traveling information:

( "New York" -> Reykjavik -> Oslo ) x 9
( Madrid -> Frankfurt -> Stockholm ) x 5
( London -> Berlin -> London -> Vienna ) x 8
...

I bet you understand better the second version, right?  Instead of codes for the cities, the actual names of the cities are used, and the lines have such a structure that you can get very close to the sense of the data contained in the file by your own. Now, if you use numerical codes to represent the cities inside your program, you can have those codes in whatever numerical space you prefer,  for example, starting at zero and going consecutive all the way up. It is not so much work compared with the situation before,  where you had to validate the numerical ranges, or the low-quality time that you may spend debugging your program if you don’t do any validation.

Now, you must be thinking that this second version is more difficult to parse by your program. It probably is. I mean, if you know already how to robustly read numbers into your programs, say, in a std::vector, and put them in a vector of vectors. For yours truly, reading both formats is equally difficult, since the interaction of std::cin with spaces and newlines is not entirely straightforward. In this post, I’m using the second format to build an example of a text-parsing C++ program.  I hope you will find the example useful,   since text parsing is pretty  unavoidable once the format of the textual file, or rather the information that it represents, goes beyond certain complexity threshold. Last, text-parsing is frequently required for information that is extracted from diverse sources, like when crawling web-pages or computer logs.

Be Xpressive

One of the good reasons for using C++, even if not all too a pretty language, is the excellent support in matter of libraries that it has. Here I will be using boost::xpressive for the parsing task outlined above.  This is the Swiss-army knife of parsing in C++: it can be used both for simple text-patterns all the way up to more structured snippets, like the trips description above.

I will start with a complete initial program here, that you can use to ensure that your C++ build environment works. To compile this program,  you need to be sure that your C++ compiler will be able to find boost headers. If you have not used boost before, please get it from www.boost.org.

How to build with boost

Before you go to www.boost.org and try to follow the installation and build instructions there, I would like to dispense some advice. Boost build system works great in Linux with gcc, but in other platforms things might be more complicated. Fortunately, many boost libraries are header only, meaning that all you need to do is to download and decompress the boost package. Then, when you want to build your program, for example, the code referred above, you add a “-I” flag (or equivalent) to the compiler’s command line indicating the path where you just unpacked boost.

Here is a brief description of the parts of this program. First, we define a C++ struct to contain the elements of a parsing grammar.  A parsing grammar is, simply put, a set of rules describing the correct syntax of the input. In our case that syntax is as follows:

     file ::= line + 

     line ::= LEFT_PARENTHESIS CITY_NAME ( ARROW CITY_NAME )+ RIGHT_PARENTHESIS LETTER_X POSITIVE_INTEGER

In the previous fragment, the ‘::=’ sign is read as “is made of”, and the “+” indicates “one or more times” for the thing immediately preceding it. Thus, the first line is an abbreviated way of saying “a  file is made of   one or more lines”. The second line says “a line is made of a left parenthesis first, then a city name, then one ore more combos of arrow and city name, then…” . The C++ struct in the code says exactly what is above, but using the syntax available to C++ programs, where the “+” sign can only be written before the expression affected, and the “>>”  operator has to be used instead of a plain space to man  “match first what it is on the left side and then what it is on the right side”.  We actually only define the second rule, which we name  “e_line” in the C++ example, because our information is delimited by lines anyway and we can use C++ built-in support for reading and processing one line at a time… as you will see in a bit, this helps to keep things simple. For the details of boost::xpressive that makes this grammar possible, please consult their official documentation, it is really good!

Now, our little example program compiles and reads a file and it says if the file conforms to the syntactic rules or not. If we were just interested in “compliance enforcement”, we would be done here. But we actually want to use the information inside the program, reading it is just the first step before processing it, right? There are many ways of representing the information read from the file in your program, in this next version of the file there is one of such ways, even if the example program still doesn’t put any data there. To look for what to do, note the typedefs and the new struct from line 59 to line 81. The idea is to represent places as numbers, and to keep a couple of maps with the one-to-one equivalence between numbers and place names. The vector with the numbers representing each itinerary is put in a itinerary_detail_t struct together with an integer indicating how many times that route was taken. The information from all the itineraries and the maps are packed in a new struct, document_t, that can be passed around as a single entity.

[collapse title="Why to use numbers instead of the full name"]

Not only a number uses less space than the full name representation, it is also faster in many operations where the only thing that matters is the object identity, and, if they are assigned consecutively, they can also be used as indices in an array.

[/collapse]

Once the data-structure to store the information is ready, it is time to actually put it there. There are two ways of doing this with boost::xpressive, one of them is often used with regular expressions: the so-called captures, while the other is more frequently used when building compilers, and that is to attach semantic actions to the rules. We are going to use this last way in the last version of the code. Here are the things that I cant to call to your attention in this last evolution:

  • Semantic actions are inside square brackets. They look almost like ordinary code, but the expressions there don’t actually execute until the previous pattern is matched. If you want to know more about how this is possible google expression templates, Wikipedia’s article is also a good reference. In order to use semantic actions with boost::xpressive, you need to include the boost/xpressive/regex_actions.hpp  header.
  • The parser_t struct has gained two member variables: one vector of strings and one string to store the string representation of the multiplicity part (the number after the “x”). These variables are used to exchange data, since they are visible to both the xpressive terms (e_city_name and e_positive_integer) and to the code running the line-reading loop. The reset method in parser_t takes care of resetting the vector with the names of the places.
  • The loop consuming lines from the file has also grown a bit, with logic for assigning numbers to the cities and storing the actual numbers in document_t.

Since the boost::xpressive grammar is still just “legal” C++, keep in mind that the usual rules about operator precedence apply.

Conclusion

My hope with this example is that you gain a little confidence in the text-parsing abilities of C++, for the occasions when you need it. There are many more things that you can do, and the documentation of boost::xpressive are a good start.

The many comprehensions in Python…

List comprehensions in Python are kind of no-frills, yet extremely convenient:


my_list_of_squares = [ x**2 for x in xrange( 10 ) if x % 2==0 ]
print( my_list_of_squares )
# It prints [ 0, 4, 16, 36, 64]

But beware, since they “just” add convenience, they are not very emphasized in introductory classes to the language, and they tend to fade away quickly from memory. What is even less known is that Python has comprehensions for dictionaries:

from_x_to_x_squared = { x: x*x for x in xrange( 10 ) }
print( from_x_to_x_squared) 
# It prints {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

and for sets:

x_squared = { x*x for x in xrange( 10 ) }
print( x_squared) 
# It prints set([0, 1, 4, 81, 64, 9, 16, 49, 25, 36])

There are no tuple comprehensions, instead the parenthesis syntax has a surprising use: to create small generators. In the following example, only the squares which are less or equal than 100 are printed, however the for inside parenthesis can potentially generate squares with a value as high as needed:

import itertools
all_the_integer_squares = (x*x for x in itertools.count())
while True:
    y = next( all_the_integer_squares )
    print( y )
    if y > 100:
        break 

Building for Win64: a collection of patches

Building things in Linux is easy, building them in Windows requires to patch a lot, as Windows tends to be messy with compilers and runtime libraries. I need a few packages in Windows for when I work from my laptop, and I think that it is a good idea to share the patches. They probably won’t be straightforward to use, (alas they can be, if you follow the instructions in this post and on top of that learn a bit about Pacman, the Arch-Linux package manager which I easily  ported to Windows), but by reading them you can get an idea of the changes required to the code. So follow this url to get to the repository.

Handling scientific data with SQLite: a tutorial for C++

Have you ever wondered why scientists don’t use databases, as in “relational databases”?   Well, they actually do, but not as often as they could.  Using a relational database is worth the effort in many cases, and in this post, I will be writing about SQLite, an engine that is easy to use both from C++ and Python.

What’s a relational database anyway? Simply put, it is a container where you put a bunch of tables. A table as in “database table” is not very different than a table that you would find in a written document, with rows and  columns. In a database, each column of a table has  a name. The names of the columns are kept easy to type, so that you can query the database using those names.  Following is an example table   with data from an hypothetical experiment.

step_number
chain_id
variables
00[Binary blob]
01[Binary blob]
10[Binary blob]

Since you can have more than one table in a database, it is quite common  to link different tables together by way of having the same values in some particular column or columns.  For example, if you are performing a Bayesian simulation, you might want a table to store data for each step of the simulation, and another table if you need to differentiate between chains in the simulation. Each row of the data table would be associated to a single row of the chains table, since each reported data-point would correspond to a unique chain. The association is built simply by having the same chain id in one column of each table.  Here is how the chains table would look:

chain_id
chain_parameters
0[Binary blob]
1[Binary blob]
2[Binary blob]

Did you note the liberal use of “Binary Blob” in the two previous examples? This is related to the kind of data that database engines can naturally store. Normally, there is only a single scalar piece of data in each cell of a table. For example, an integer or a real number. There are a few more simple types available: true/false values, decimals and strings. The database engine knows how to work with these simple scalar types, and you will find them handy in most situations. However,  chances are that certain pieces of the experiment data will only make sense to you and your program and not to the database engine. An example would be a vector of numbers of variable length that is always written and read in full,  and where you don’t use the vector’s components to do searches in the database. Another example would be a custom numerical type, like very long integers. Or even some more general algebraic object, like a sparse matrix or a rational polynomial fraction. For those cases the database engine has something called “blobs”, i.e. fields of raw data which are not interpreted in any way by the database engine. I use them liberally, since they help to keep the data closer to their natural representation.

Why SQLite?

I’m singling out SQLite because:

  • There is not a lot to configure to start using the database… basically you just open or create a file that will contain it,
  • and because the entire database fits in that file, you can do things that you normally do with a file, like store multiple copies, or send it to a colleague.
  • SQLite has bindings for virtually all programming languages, which is good is you want to generate your data in one programming language (say, C++) and analyze it in another programming language (Python); as we will be doing soon here.
  • If you are new to SQL, SQLite offers a  compromise between the more conventional file-based approach to data handling and full-blown database servers.
  • Late but not least, SQLite runs virtually everywhere already, from your phone to your desktop, and in many places in the middle. It can be the ideal solution  if you need to collect data using esoteric hardware.

Two of the issues mentioned before are moot with SQLite:  with some dexterity it is possible to get  good speed, and cost is only a problem if you really need to spend some expiring grant-money, because SQLite is free both as in free beer and as in royalty free.  The most serious issue is, arguably,  lack of knowledge about how to use this engine. This blog post is a tutorial providing a full example that hopefully will move SQLite closer to your comfort zone.

An example: simulating the negative binomial

A full-blown Bayesian simulation is perhaps a notch too heavy for an introductory example, so instead let’s focus here in something simpler. Say we want to make an experiment to obtain an empirical probability distribution for the number of times that you roll a dice and obtain something different than one, before you obtain a one for the fifth  time (I’m stealing the idea from this Wikipedia article). Since we are interested in  recovering the empirical distribution, we want to run this experiment many, many times.

Writing the code that performs the actual simulation should be straightforward enough, you can find a C++ candidate here.

In the first version of the example, the only fragment related to the  data-saving part so far is argv[3], which we use to indicate that a if the file with the database exists when the program is run, we want to remove it so that we can create it anew. In this  version of the file we have taken care of actually removing the file. For simplicity, we have also introduced a fixed file name for the database.

Introducing the first SQLite code

There are several ways of using SQLite in your code. For example, if you have it installed in your system, then it is a matter of #including the header and linking against the library. Another approach, the one that we will follow here, is to include SQLite sources together with ours. This way our code can be compiled even if the library is not installed in the system or the headers are not present. Including SQLite sources in the  project is even more  convenient if you develop in multiple platforms/computers.

Grab SQLite’s amalgamation from  SQLite’s download page; there are several files there, just choose the most recent, paying attention to the version number separated with dots. One suitable candidate would be this file. From the amalgamation archive, copy both sqlite3.h and sqlite3.c to the directory where you have the project, in this case the file sim1_main.cpp. To ensure that everything is ok, try to compile the project now using this revision of sim1_main.cpp with the following commands:

$ gcc sqlite3.c -c -o sqlite3.o
$ g++ -std=c++11 sqlite3.o sim1_main.cpp -o sim1_main

The new detail in the code is the function “setup_db”, where we will try to create and setup the database. Two functions are used: “sqlite3_open“, and “sqlite3_close”. The first one is called with a file name and a pointer to sqlite3* ; this function opens the database file if it exists or otherwise creates a new database file.  The second function is used for closing the connections, and you should invoke it even if sqlite3_open fails. However, there are some good news: even if you forget to call this function, or if your program crashes and you don’t have an opportunity to call it, all the data that you have already committed to the database will be safe.

Up to this point we have taken care of opening the database. We also have some code that ensures that we will open a brand new database, by deleting any file with the same name previously. Since sqlite3_open will be creating a new database, we need to create the tables inside the database before putting any actual data there. For that, we will need to use SQL.

SQL  is a specialized programming language used for databases. It includes sentences for creating, modifying and deleting tables, and more importantly, for extracting the data from the tables while at the same time doing some processing in the data. This last possibility is more commonly known as “making queries”, and since database engines know a lot about how to work with relational data, you can get the best performance by expressing how do you want your data without having to know anything about all the tricky details of how to read and navigate the tables on disk.  Unfortunately a full tutorial on SQL is out of the scope of this tutorial, nonetheless I will explain the sentences that we use.

In this revision, we use a little of SQL to create a single table in the database. As you see, the SQL is nothing extraordinary: the “CREATE TABLE” sentence instructs the SQLite to create a table with just three fields: simulation_id, steps_to_goal, and, just to make things a little bit more interesting, a field dice_points where we will store the full sequence of dice points values during the casts of each individual experiment. Notice that we embed the SQL sentence in a C++ 11 raw string literal; you can use conventional C/C++ strings for the same purpose of course, but raw string literals are just more convenient.

A smarter way of managing errors

There is a bit more going on in this revision. You saw that before this revision, we had a check directly after sqlite3_open, where we, on error, both closed the database and raised an exception. In this new revision, we are using std::shared_ptr to return the sqlite3* connection and to close the database automatically both in normal application shutdown and in failure.

Writing data to the database file

All what is left is to write the actual data in the database. Let’s take care of that. We need to compile a SQL sentence. Compiling the sentence is done once, and from there on the compiled sentence is used many times to insert each row of data.

The update revision of the code is here. First notice the function sqlite3_prepare_v2 outside the insertion loop, giving SQLite the SQL insert statement, which contains placeholders in the form of question signs. These placeholders will be  filled inside the simulation loop, with invocations to sqlite3_bind for each field that needs to be inserted.  In each loop iteration, after the parameters have been bound to the prepared statement, a call to sqlite3_step is issued, which takes care of actually sending the data to the database. If we were interested in getting data out of the database, we would need to fetch a row of data immediately after sqlite3_step, but since we don’t to get any data here, we reset the statement so that we can bind new parameters to it in the next simulation loop iteration. Finally, when   we are out of the insertion loop, we take care of disposing the statement with sqlite3_finalize.

Concluding

This covers the basic tutorial of how to write data out from a simulation to a SQLite file, but there are a few important things that you should try in your own.

First, ensure that you get correct numbers in the table. If you are new to SQL, this would be a good to learn a bit more about queries, you can use them with SQLite both with the command-line tool sqlite3 and from Python, using the module of the same name.

Second, check the speed of the program we just finished.  Insertions in SQLite are a lot faster if the inserted blocks are surrounded by “START TRANSACTION” and “COMMIT” statements. You should use them, but not before deciding what would be the atomic unit of data that you want to send as a whole to the database.

Third, notice that I have not given any use to the field “dice_points” in the table. It has space for a blob. It is possible to save there the entire array of dice points for each simulation, even if these arrays of course would have different length. There are many ways of working with blobs, the most basic one for this case would be with sqlite3_bind_blob. Read the documentation and think about ways in which you can fill the blob with array data.

JSON: a quick data-format for science

Science is based on data. Even mathematicians, which at times seem to pull new knowledge out of thin air, use and produce data. So, exchanging data is a must. If the focus is in exchanging data, then you need people to understand your data. That’s something JSON is good for. It is a text format where you can embed some metadata in the form of meaningful names for object fields. Here is an example of JSON that shows what I mean:

[
    {
        "latitude": 39.59,
        "longitude": 12.2
    },
    {
        "latitude": 39.51,
        "longitude": 12.123
    }
]

The previous snippet could be, for example, a fragment of a position tracking-record of a whale. By the way, the previous snippet can be read from Python with this code:

import json;
with open("the_whale.json", "r") as infile:
    fragment = json.load( infile )

and could be written with:

import json;
obj = [
    {
        "latitude": 39.59,
        "longitude": 12.2
    },
    {
        "latitude": 39.51,
        "longitude": 12.123
    }
]
with open("the_whale.json", "w") as outfile:
    fragment = json.dump( obj, outfile )

Now, you might say that JSON is verbose. It is. Furthermore, it is comparatively slow to process and most tools for dealing with JSON would load the entire file in memory; so JSON is not well suited for large data. But for small datasets JSON is hard to beat in terms of simplicity and interoperability, this concerning both humans and computer programs. Here is what I would suggest as a rule of dumb for using or not using it:

Use if:

  • Dataset size at most a few megabytes,
  • dataset structure is complicated and requires nesting and,
  • you want to be able to handle your dataset in several programming languages easily

Do not use if:

  • Dataset size is one hundred megabytes or more,
  • and you really want your programs to be fast.

What alternatives are there out for JSON? Quite a few. I will be covering some of them in coming blog-posts.

Windows for Scientific Programming

Scientific programming tools are mostly open-source, and as such their natural environment is Linux. However,  Linux might not be an option for you. For example,  very few laptops come pre-installed with Linux, and chances are that even if you install it yourself you will be short-changed in driver support. I use Linux in my desktops and in all the servers, but shiny new laptop is a different matter, for example. In any case, here’s a newbie guide to setting up a friendly environment for scientific programming in Windows. When designing this guide, I have been guided by the desire of re-using knowledge, so that you can adapt very easily to a true Unix environment when you need so.  Additionally, I have written this guide for people with minimal technical knowledge, so if you consider yourself a power-user blast through it to the interesting parts for you.

+ (do you want it portable?)

If you think that you will be using more than one Windows machine and want to save some work, you might consider doing the steps below in such a way that you can port the results (i.e., your new environment) easily. A portable environment is one that you can simply copy from one machine to another. You can also put it on a thumb drive, but I wouldn’t recommend that, as things would run very slowly from there.

To achieve a portable environment, check paragraphs hidden like this one.

Text Editor

We start by getting a decent text-editor. Windows comes with Notepad, but you will need something with more muscle . A popular option by hard-core hackers is either Emacs or Vim, but if you know nothing about these editors, don’t go for them now. Instead, download and install Notepad++, which according to the website, also happens to be environmentally-friendly and nice to your laptop’s battery.

+
Do you want to make it portable? There are .zip and .7z files in Notepad++ ‘s website; those correspond to portable versions of the program. So, do as follows:

  1. Create a folder in “C:” or “D:” (hard-drives in your computer) for your new development environment.  For example, “my_dev_env”. Avoid using spaces in folder and file names, some programs have problems with that; also, you will find yourself typing many commands incorrectly due to the presence of spaces in file-paths. For this same  reason, try to stick to a convention for folder names.
  2. Inside “my_dev_env”, create a folder called “notepadpp”.  Unpack the .zip or .7z file of Notepad++ that you just downloaded in this new folder.

Cygwin

We will be using Cygwin, an open-source platform that can make you feel more like if you were working on a Linux box. This has several advantages. First, you will be able to use programs and software libraries written for Unix. Second, when using them you will be able to refer to their online documentation, which most of the times is written for Unix, instead of scrolling down to the part where Windows support, if present at all, is explained.

Start by downloading the setup file in Cygwin’s page, and running it. The installer will take you through several screens asking for different settings. One of them is the installation path; try to choose a path without spaces, and preferably  close to “C:\”.  In another screen you will be asked for the mirror where packages can be accessed; here select one that you consider geographically close. If you are behind an http proxy, you may have better luck by choosing an http mirror. There is a setup screen Cygwin package selection screen where you can select packages to install, right now you can just press “Next” and that will install a bare-minimum subset of Cygwin. Keep the setup file, you will need it later to install or uninstall parts of Cygwin.

+
If you want it portable, select as Cygwin’s setup path one directly under “my_dev_env”, for example, “cygwin”; if you already installed it somewhere else, you can simply copy the installation. Also, copy Cygwin’s setup file to the “my_dev_env” directory, so that you can locate it easily later.

Since you will be moving your development environment from one place to another, you need a slightly smarter way to start Cygwin. So, go to “C:\my_dev_env\cygwin” and inside this folder edit a file called “Cygwin.bat” (if you can’t see the “.bat” part, follow the instructions here to make file extensions visible). Right-click on it and choose “Edit with Notepad++”.  Change the file contents for it to look like this one:

@echo off

C:
set scriptpath=%~d0%~p0
chdir %scriptpath%\bin

bash --login -i

,

save the file and close. If everything went OK, double-clicking the “Cygwin.bat” file should open a terminal like the one in the figure below.

Default cygwin terminal.
Default Cygwin terminal.

A better console

The figure above shows Cygwin’s terminal, which is friendlier than the default in Windows, but not quite. Let’s get a bit fancier on this, you will come to appreciate a more fluid text terminal. Here I would suggest you to use ConEmu, again an open-source tool that will simplify your life. Download and install ConEmu, and after that, execute the following steps so that you can enter Cygwin from ConEmu.

Configuring Con-Emu.
Configuring Con-Emu.
  1. Open ConEmu, and then click or right-click on its icon. In the menu that appears, go to “Settings”, and from there choose Startup, Tasks, and press the “+” button to add a new Task. You can identify the button by the red “1″ in the figure.
  2. Use the field “task name” (identified by a red “2″ in the figure) to enter a name for the task, for example, “New Cygwin Terminal”.
  3. In task parameters (identified by a “3″) we pass to ConEmu some information necessary for waking-up the Cygwin environment. In all rigor, only one parameter is needed, the one starting with “/dir”, but the icon can come handy if you start liking ConEmu for something else other than using Cygwin. The entire line reads /icon “c:\my_dev_env\cygwin\Cygwin.ico” /dir “c:\my_dev_env\cygwin\bin”.Change the file-paths “my_dev_env\cygwin” for the ones where you actually installed Cygwin.
  4. Enter the cygwin command that will open the shell just for you in the big box down (marked with a red “4″): bash –login -i.

Don’t forget to press “Save settings” when you be done typing the new configuration. Now you can open a new Cygwin terminal in ConEmu just by pressing Ctrl-W and selecting the task that you just created in the first drop-down that appears.
+

ConEmu is a perfectly portable Windows program, but we have used absolute paths in ConEmu configuration, and that won’t be portable. A simple solution is to:

  1. First, create a special folder for ConEmu right below “my_dev_env” with some easy to remember name,  for example, conemu. Copy the contents of the installed ConEmu, or the uncompressed contents of some of its downloadable bundles, inside the new “conemu” directory, so that, in our hypothetical situation, you end up with ConEmu64.exe right below “C:\my_dev_env\conemu\”. Double click this new ConEmu64.exe that you just copied.
  2. Now, save the config of ConEmu right where ConEmu64.exe is. You achieve that most easily in ConEmu’s settings dialog, using the top “Export…” button. Give it the name ConEmu.xml and let it exist directly on “C:\my_dev_env\conemu\ConEmu.xml”, by the example directory convention that we are using here.
  3. Last, in the console settings dialog of ConEmu, go to Taks again and select the task that you created before, i.e., “New Cygwin Terminal”. Go to line marked with a red three in the figure, and change its contents to <i>/icon “%ConEmuDir%\..\cygwin\Cygwin.ico” /dir “%ConEmuDir%\..\cygwin\bin”</i>. What we have done here is to use ConEmu’s  own position as a clue to locate the shell.

The development tools

Now that you have a basic Cygwin environment alive and kicking, it is time to add the development tools themselves, that is, the things that will allow you to build at the very least C/C++ programs. To that end, fire again the Cygwin setup tools and navigate the setup screens. Be sure to select the folder that already contains your Cygwin installation, the setup program can infer which packages are already installed straight from there. When presented with the screen for selecting programs, search and mark the following programs for installation:

  • make
  • Gcc, g++-gcc
  • cmake
  • autoconf
  • automake
  • python (not python 3)

Press “Next” in the dialog and have all these packages installed.

With this you will have ready your setup, and you will be ready for coding.  I will follow this post with others on customization you can do, and tricks and quirks of the platform. I will like to finish with a general word of advice: for desktops and servers simply use the real thing, Linux. It will be both simpler and safer.

Choosing a programming language

If you are really, really new to science, you might think that you are well-off with tools like Matlab and Mathematica. They are easy and handy for many tasks, but they are also expensive. Furthermore, doing science is about stretching limits, and it might well be that no single tool will magically solve all the problems. And this last sentence goes not only for software tools, but also for the experiments that you design and the mathematical algorithms that you use. While well-known scientific programming packages will be useful at times, you better keep your toolbox well stocked.

Wars are waged over which programming language is better — you can find the spoils just by googling a little — , but I will tell you what I know from my just-a-decade experience. First, get yourself a programming language that can get you work done. If you have never programmed before, either for a living or for fun, but barely have a few courses, then I would suggest you to go with Python. It is an easy and happy language, with extremely powerful libraries . In time you might need a bigger caliber,  but you ought to start somewhere!

If you have a little of experience programming, and need to code something that will run fast, go for C++. This language has a steep learning curve,  but for reasons that I don’t entirely understand, well coded programs in C++ are faster than anything else. There are other advantages of C++ that set it forward, like portability, library existence, and industrial support.

You can do almost anything with Python and C++, but don’t feel shy of experimenting with other programming languages and paradigms. I have tried a few and backed away from most of them; the main reason almost always was the same: language X offered some very interesting ideas, but then X was not very proficient handling, say, matrices. Or after a week coding a delicate algorithm, I discovered that the compiler/interpreter for X had a fatal bug with a corner case that nobody had tested (because programming languages have exponential numbers of corner cases, and not always as many users to triple on them and developers to fix things). Also,  in many cases I found that learning well a new programming language was a heavy investment in terms of time. But even if I didn’t end up adopting language X for everyday use, it probably taught me a few tricks, so it was not at all wasted time. So, I leave you with a short list of computer programming languages that you might want to check, in my subjective scale of suitability for scientific coding:

  • Python
  • C++
  • Haskell
  • Scala
  • Javascript (yes, you can get a nice surprise with this one… although admittedly library support for scientific coding is rather weak )
  • Java
  • Fortran

 

 

 

A preliminary check-list for scientific programming

Just starting any research-oriented profession where you will need lots of wits, maths and statistics? Will you need to process some data? A lot of data?  I envy you, because there is a lot of thrill in your path. You will need to become computer-savvy however, if you are not already. This entire blog is devoted to things you might find useful in your way, but this particular post is about the things you will need before you even start. The most basic tips, I should say.

OK, no more delays, here is the list:

  1. Willingness to be learning… forever. Both science and the arts of computer programming are constantly changing. Your most important skill is the will to keep the pace.
  2. A computer. This one comes without saying. You will need  a computer running Linux, Mac OS or Windows, as these operating systems have broad support for the kind of things you will be doing. At the time that I’m writing, they are very popular operating systems. There are other operating systems that will be more or less useless for doing any serious scientific coding: Chrome OS, iOS, Android; avoid devices with those. Also, a fancy computer is… errr…. fancier, but chances are that a cheap computer will do  if you are just starting.

A close third, ubiquitous Internet access, is really, really a helpful asset.

Moving on, if you have secured the basics, you may be interested on a list of programming languages, libraries and frameworks that I have found extremely useful and time-saving in my daily work.