Python – why it's the best language for cheminformatics
If you're wondering why Python is used so commonly in all things cheminformatics, bioinformatics or drug discovery and start your search for an answer, here's what you'll most likely get:
We use Python for cheminformatics projects, because we use it in science. We use it in science, because it has a lot of open source libraries. It has a lot of open libraries, because we use it in science.
Wait a minute... we're just running round in circles. Let's try and dig deeper, shall we?
People in chemistry or biology usually venture out into the cold informatics world when they need something from it for their project. They don't really spend their free time fantasizing about software development, latest programming tools or five free Java tips — they're in it for a specific purpose. And there are several aspects of the Python programming language that make it so useful for coding amateurs with a weakness for science.
Garbage, duck, quack and Guido van Rossum
To put it into simple words, Python makes it relatively easy to write code. Its garbage collector feature means you don't have to worry about memory, thanks to loose typing there's no need to declare and keep an eye on a given variable type. Actually, you might have heard about the "duck typing" approach in Python, which goes something like this:
If it walks like a duck and quacks like a duck — it’s a duck.
Or, to translate: the object’s type doesn't really matter as much, the most important thing is that it does what you need it to do. This kind of approach allows for a very flexible approach towards building data structures and processing them. The flexibility and speed you therefore get are essential when building prototypes.
The third reason Python is an easy-to-use programming language (sometimes referred to as a Swiss army knife of coding) is the so-called Guido's time machine. What's that you might ask and who exactly is that Guido? Guido van Rossum is a Dutch programmer and, not coincidentally, the creator of Python. His "time machine" stands for the following phenomena: (almost) every time you realize that there's a functionality or a tool you'd benefit from in Python, it turns out it's already been implemented. Matrix – check, complex numbers – in place, csv parser – you got it. Yep, looks like everything's already there.
The above three mean that if you need software for a specific purpose (computation, simulation, data extraction or analysis etc.), and you need it asap, Python will work great for you. While learning new programming languages is always a good thing, when you start new and diverse projects all the time it is way better to be a master in one jack-of-all-trades tool than a novice in many tools that are perfect in very specific applications.
Prototyping, good enough over perfect and only one way to do it right
The difference between a programming scientist and a programming developer is such that a scientist often needs to code a lot of completely different functionalities for different projects. Prototyping and creating one-time scripts are both Python's strengths as it doesn't need a long advanced compilation before implementing and testing changes. There are specialized languages on the IT "market", but rarely does a scientist repeat the same thing over and over. A non-ideal but “good enough for everything” tool is therefore much better.
Coming back to the quite obvious fact that scientists aren't programmers, it's also worth mentioning that Python offers an advantage over other languages in that it gives one — and preferably only one — way to do something. A lot of other languages fail to provide ready-to-use tools that make your life easier, so eg. a matrix you've written can be presented in two different ways and then it's hard to combine them. When there is only one obvious option to represent and handle data, almost every Python library will do it this way. Thanks to that, using many libraries to work together is really simple – almost like building from blocks that fit perfectly.
Then there are tools like memoryview, capsule and NumPy array, which allow you to perform low level tasks with the ease of high-level garbage collected programming language. There's also the ease of creating interfaces with other languages and low-level data. What does it mean in cheminformatics practice? Basically, you can process data from sensors, thermometers and other equipment used in labs without any hassle or process and image like an ordinary array of numbers. What's more, the interface can be quickly coupled with languages that aren't the strongest suit of everyone, like C++. This is actually what made writing python libraries so fun — RDKit, Indigo and Open Babel being the three top ones.
And while we're on the topic of libraries, we should also take a second to talk about code readability. Python has clearly communicated rules of how to write (and not write) in it. Actually, it's one of a few languages that have clean code guidelines written into their specification. Not only that, but the simple syntax makes reading Python almost a similar experience to reading a novel thanks to, among others, well named functions, pseudocode inspired syntax and bracket-free code.
Open source, scraping and no bucks needed
Python is an open source and a free-to-use language. That means that its lively users community, chemistry scientists among them, influence its development and shape the language so that it will serve their needs best. There are actually a lot of different incarnations of Python adjusted to the specific users (eg. fast c-python, multi-thread and Java libraries compatible j-python etc.). All of that for free thanks to the strict no-payments policy. No wonder this programming language is so loved in startups, academic projects and wherever else every buck counts.
A Python-written code is largely independent from the operating system and environment. That's how it found a wide variety of applications: from microcontrollers, devices like Arduino and Raspberry Pi, personal computers and even apps' servers or supercomputers. What's more, a python code written in one environment should run in a different one without any problems, assuming the user didn't use any functionalities strongly tied to a specific type of system.
Finally, scraping (the process of retrieving data from a fully- or partially text format) is really pleasant in Python. This language has a lot of libraries that make such a task easy, as well as great Input/Output mechanisms — downloading, saving and processing data. It's a very useful feature for cheminformaticians who often work on Excel files prepared by some kind of scientific software.
Hopefully, the above clarified why python is so widely used and appreciated in the scientific and more specifically, cheminformatics community. Don’t let anyone fool you into thinking that it’s just a tradition. :)
Useful resources and further reading:
NumPy – scientific multitool library with support for numerical computations, symbolic algebra, matrix manipulation and more
pandas – python based tool for data manipulation and analysis
sckit-learn – state-of-the art machine learning library
TensorFlow – rich and mature machine learning platform
Biopython – python library for representation, manipulation and analysis of gene sequences with integrated database and AI tools
SciPy – integrated scientific and engineering environment with many of the above-mentioned tools already preinstalled
Why Python does so well in scientific computing – blog post by Konrad Hinsen on the origins of Python success in scientific applications
Why scientists should learn to program in Python – article by Brian H. Toby and his colleagues presenting benefits of using Python in science
Here’s why you should use Python for scientific research – cross-refference of Python and its most important scientific competitors - Matlab, Fortran and C/C++ - by Vinay R. Rao