How Open are Commercial Scientific Software Packages?

A revised version of this post has been published as a Viewpoint in the Journal of Physical Chemistry Letters, DOI: 10.1021/acs.jpclett.5b02609.

Most scientific research nowadays relies on some kind of software. This is particularly true in fields such as my own, quantum chemistry. Such software is used both in applications to study various problems in chemistry, often in close connection with experiment, and it serves as platform for the development of new theoretical and computational methods.

In quantum chemistry, many program packages are available [1], that differ in functionality, usability (from easy-to-use by non-specialists to usable only by the person who wrote it) and computational efficiency. What basically all available codes have in common is that they have been developed with public (i.e., tax payers’) money. Nevertheless, the terms under which they are made available differ very significantly: Some program packages are available under open-source licenses (meaning that anyone can “study, change, and distribute the software to anyone and for any purpose” [2]), others are owned by commercial companies who sell them to both academic groups and industry users for a small or large fee. Intermediate models (free, but not open source) also exists, such as closed-source software that is distributed for free by academic groups [3] or software for which the source code is available to academic users, but with license terms that prohibit changes or redistribution.

Pushing towards Open-Source in Science

Open-source scientific software offers a number of advantages for science as a whole [4]. The most important one is that publicly funded scientific software should be available for everyone to use and extent. This has led some funding agencies, in particular in the US, to require software developed under certain grants to be open source. Recently, Krylov et al. published a viewpoint in J. Phys. Chem. Lett. that criticizes such open-source mandates [5].

The piece is written by eminent scientist, whose work in quantum-chemical method and software development I admire. All of them have in common, that, besides being professors at research universities, they are co-owners of companies selling quantum-chemical software packages [6]. I find many of the arguments put forward in this opinion piece flawed, miss a consistent use of terminology (free as in speech vs. free as in beer), and think that it is full of contradicting statements.

Perspective of the Method Developer

Here, I want to focus on one particular perspective: The one I have as a developer of new quantum-chemical methods. To develop, test, and finally use a new idea, it needs to be implemented in software. Usually, this requires using a lot of well-established tools, such as integral codes, basic methods developed many decades ago, and advanced numerical algorithms. All of these are a prerequisite for new developments, but not “interesting” by itself anymore today. Even though all these tools are well-documented in the scientific literature, recreating them would be a major effort that cannot be repeated every time and by every research group – because both time and funding are limited resources, especially for young researches with rather small groups such as myself.

Therefore, method developers in quantum chemistry need some existing program package as a “development platform”. Both open-source and commercial codes can offer such a platform. Open-source codes have the advantage that there is no barrier to access. Anyone can download the source code and start working on a new method. I have so far mostly contributed my developments to commercial codes. These also offer a lot of advantages: For successful codes, the revenue from selling licenses can be used by the companies owning them to employ software developers who maintain and document the code. These can further improve code contributed by academic groups in order to make it maintainable, efficient, and easily extendable. This can speed up new developments and improve the quality and efficiency of the resulting new software.

Commercial Codes as “Open Teamware”?

The authors of the opinion piece in Ref. [5] argue that there is no need for open-source development platforms because many commercial codes, such as Q-Chem [7] and others, operate under what they call a “open teamware” model. As they point out, many commercial code have assembled rather large communities of academic developers.

However, I would argue that access to commercial codes as a development platform is not as open as the authors of Ref. [5] claim. First of all, it is subject to signing a developer agreement, the terms of which are dictated by the companies owning the source code and that are drafted to protect their commercial interests. Usually, they include a transfer of intellectual property rights for the new developments to these companies as well as non-disclosure clauses concerning the source code and algorithms implemented in it. (Here and in the following, I am not talking about specific software packages, because the terms of developer agreements are usually covered by non-disclosure clauses themselves. Therefore, I either do not know the precise terms, or I am not allowed to reveal them).

Often, such developer agreements require exclusiveness, meaning that new source code cannot be contributed to different commercial packages. Sometimes, developers are even banned from using competing program packages [8]. Such requirements for exclusiveness prevent scientific collaborations. I have encounters this on several occasions, when fellow scientists told me that they would love to collaborate, but that they cannot do so because we are contributing to competing packages. Thus, the commercial interests of software companies lead to a segregation of the scientific community based on affiliation with certain codes. Often, methods developed in one program package are reinvented in others because scientists cannot collaborate or use each others’ software.

Perpetuating Power Structures

The use of commercial codes as development platform also puts the few scientists owning the corresponding companies into a gatekeeper position. It is up to them to decide who is allowed to contribute new ideas and developments. The policies of different companies may differ significantly. However, all of them will require to reveal novel research ideas to the scientists in these gatekeeper positions. These will in many cases be competing scientists, who might reject access because ideas are opposite to their own “scientific beliefs” or because they might interfere with their own lines of research.

These mechanism lead to perpetuating power structures that put very few individual scientists, the owners of commercial software packages, in control of most method development in our field. It should be pointed out here that many of the authors of Ref. [5] are not the original developers of the commercial codes they now own, put that they have inherited these from their academic teachers. Such decisions will certainly have been based on scientific achievements, but they have not been taken by the academic community as a whole through peer-review and funding panels, but by the few pioneers who developed the software infrastructure our whole field relies on today.

This contradicts the merit-based access to scientific resources that the authors of Ref. [5] so keenly advertise. The possibility to carry out new method developments should only be based on the quality of new ideas (as judged by grant reviewers and panels of funding agencies) and not on whether or not a scientist is part of a certain school. The “track record of productivity” [5] rewarded by funding agencies with grant money should have been established with competitive ideas, not because of access to a software infrastructure built by a researchers’ academic ancestors. (Again, let me point out that I admire the track record of all of the authors of Ref. [5] – but I think that the playing field has to be leveled for the next generation of scientists).

Finally, I have to admit that, at least in part, the problems discussed above also exist for open-source program packages. Often, these codes are less well documented and maintained (because of the lack of revenue from selling licenses), with the consequence that the barrier to contributing them might be significant. Often, it can only be overcome by collaborating with one of the lead authors of such open-source codes, which again puts these into a similar gatekeeper position. In addition, open-source code is often not immediately released to the public in order to maintain a competitive advantage over scientists that might want to improve or built upon new methods.

Possible Solutions

A first step towards a solution would be to remove the conflict of interest many scientist owning and running scientific software companies face. If these companies are run by businessmen instead of active scientists, then decisions to grant access to new external developers will be based on the possible merits for the (paying) users of the software packages and will not be influenced by fear of scientific competition. Some commercial codes use such a model [9]. A least, the policies underlying decisions whether or not to grant access to external developers should be made transparent.

Second, I believe that funding initiative to create open-source packages and to sustain their maintenance are an important piece in creating truly open platforms for method development. Apparently, such initiative are being implemented in the US, both through national laboratories and via funding agencies [5]. Such initiatives provide a means to level the playing field, by making funding available to open-source packages that commercial codes can obtain via their revenues form selling licenses to academic and industrial users. Such initiatives should, of course, not destroy commercial codes, but level the playing field. In fact, there are also funding opportunities that are exclusively available to commercial codes, such as technology grants. In Europe, many programs under the Horizon2020 framework encourage or require the involvement of small or medium enterprises, and some quantum chemistry software companies have been very successful in securing such grants [10].

Concerning funding for fundamental research, open source mandates might indeed have severe consequences for commercial codes because it would cut them off from academic method development. This could be mediated by requiring such codes – if they want to profit from public funding for basic research – to implement a truly open platform strategy, that allows non-discriminatory access to the source code for interested developers. With strict open-source mandates, commercial codes would still have the possibility to create new development in the form of modular libraries released under open-source licenses.

Conclusions

I have focused here on the perspective of the quantum-chemical method developer. Of course, there are other aspects of this discussion that are equally relevant, such as the one of the users and the global perspective of science as a whole. Related discussions on open access and open data policies are often mixed with those on open source software, which I find detrimental because the players are very different ones (small software companies run by scientists in the case of open source vs. huge publishers with monopolies in the case of open access). I any case, I want to repeat that this blog post only records some of my personal thought and I welcome any comments and discussions.

Conflict-of-Interest Statement I am a university professor in theoretical chemistry whose research depends on funding by public money – via government funding of our university and via funding agencies. Our research is also supported by industry grants from Volkswagen AG, Wolfsburg.
Most of my past method development has been contributed to the commercial software package ADF, owned by Scientific Computing and Modeling (SCM) B.V., Amsterdam, under a developer agreement. I also have access to the Turbomole program package under a developer agreement with Turbomole GmbH, Karlsruhe. I have no financial assets in SCM, Turbomole, or other scientific software companies, and I did not receive direct or indirect financial compensation for these contributions. I have also contributed to the Dirac and Dalton packages, which are free for academic users, but not open source (yet). Some software developed in my research group is – or will soon be – available under open-source licenses.

References

[1] https://en.wikipedia.org/wiki/List_of_quantum_chemistry_and_solid-state_physics_software
[2] https://en.wikipedia.org/wiki/Open-source_software
[3] see e.g., the ORCA code, http://www.cec.mpg.de/forschung/mts-forschungsprojekte/orca-prof-frank-neese-dr-frank-wennmohs.html
[4] J. D. Gezelter, “Open Source and Open Data Should Be Standard Practices”, J. Phys. Chem. Lett. 6, 1168−1169 (2015). DOI: 10.1021/acs.jpclett.5b00285
[5] A. I. Krylov, J. M. Herbert, F. Furche, M. Head-Gordon, P. J. Knowles, R. Lindh, F. R. Manby, P. Pulay, C.-K. Skylaris, H.-J. Werner, J. Phys. Chem. Lett. 6, 2751-2754 (2015). DOI: 10.1021/acs.jpclett.5b01258
[6] see the “conflict-of-interest statements” at the end of Ref. [5]
[7] http://www.q-chem.com/
[8] http://www.bannedbygaussian.org/
[9] http://www.scm.com/
[10] http://www.scm.com/EUprojects/

Nothing is more important than backups

There is one single most important thing for running a research group as a young scientist. It is not finding good students or writing high-impact papers (of course, you should do that, too), but it is having good backups. Of everything. And to have more than one backup. In different places. In short: be completely paranoid about your backups, and it will someday pay off.

Luckily, I have always been somewhat paranoid about my backups, but still not paranoid enough. Let me tell you two recent stories …

Part 1: The Macbook

A few months ago, I had to give a rather important talk. I spent a whole week preparing my slides. As usual, I work on my beloved Macbook for that. While putting the final touches on my talks and doing a few practice runs on the afternoon before I had to leave, my Macbook stared to show only a blue screen. It did not boot anymore, I could not logging, I could not access any data on it.

Luckily, I have backups, you would think. I use an external hard disk with time machine, but this backup is only done when I connect the external disk. Usually I do this once a week, but sometimes I forget, especially when there are many (apparently) more important things to do (like preparing talks). So the time machine backup did know anything about my talk and did not help. My secondary backup is done by Backblaze. This works online and permanently backups new files in the background [1]. As a last resort, I copy important files (like upcoming talks) to Dropbox. So I was able to recover my talk from Backblaze and Dropbox and could give the talk using my iPad. Thank you, Backup!

Part 2: The Fileserver

All data of my research group is on a file server, that is part of our computing cluster. Really, all data: raw data from all quantum-chemical calculations done in the last four years; all manuscript, notes, etc; all source code developed in the last four years; and probably much more that I did not think of now. It seems that backup is important here …

All this data is stored in a RAID-5 consisting of six disks. This means that one disk can fail without any data loss. This is the first line of defense. A few weeks ago, one disk failed. So far so good, no big problem: we ordered a new disk and replaced it, then the RAID rebuilds itself (which takes a few hours) and then another disk can fail without data loss. All this can be done without switching off the server. Actually, without doing anything other than replacing the disk.

However, what happened was that while the RAID was rebuilding, another disk failed. So we where completely screwed. All data on our file server was lost. Again, luckily we do have another backup: All data is backed up every night to the disks in our desktop computers using a home-built solution based on duplicity. So, we could restore basically everything from our backup [2].

While this sound all good, things were actually more difficult that expected: It turned out that for one directory, the backup had failed a few weeks ago because the desktop computer used for this backup had been switched off. So we were missing a few weeks of data there. Second, restoring data with duplicity was more difficult than backing up: For some files it failed with error messages or because the network connection was interrupted, and there is no option to resume restoring. Therefore, a lot of manual work was necessary, but this was successful in the end.

Lessons

  1. Backup you data! Backup everything. Have more than one backup. Have backups in different places. In short: be completely paranoid about your backups, and it will someday pay off!
  2. Backup should always be your number one priority. It somethings seems to be not right (like a desktop computer that is used for backups is switched off), fix it. Immediately.
  3. Practice restoring your backups! Discovering the problems with your backup software when you really rely on it is not the best time. This is like a fire drill: you should practice it before there is an actual fire.
  4. Disks that you bought at the same time might start to fail at the same time. Do not rely on your RAID as a “backup”.

[1] We could discuss data security now, but for the moment I trust Backblaze. All data is encyrpted before it is transmitted and should be only accessible by me with my encryption passphrase (which differs from the password). (Side note: Store your password and encryption passphrase not only on your computer or you will be screwed)

[2] In fact, we could save most data on our RAID by forcing one failed disk to go online again. This screwed up the filesystem somehow, but after doing a repair with some apparently pretty dangerous options, most files seemed to be there again so that we could at least get the server up and running again. Afterwards, we still did a full restore from the duplicity backup.

OpenAccess: A Young Scientist’s View

Even though I wanted this blog to be mainly about science, I had some discussions about (science) politics in the last days on Twitter and Facebook. So I decided to share some of my thoughts and hope that the discussion will continue here.

I am working as an independent young chemists in the southern German state of Baden-Württemberg. In Germany, the states are responsible for the universities and Baden-Württemberg is currently planning a new university law. There are many points in the current proposal that are worth being criticized by scientists. I will not go into details here, that would be another post and I am not an expert on many of these aspects. However, instead of discussing these big issues, lots of criticism is focussed on one small provision concerning OpenAccess publication. Examples can be found in the newspaper FAZ [1] and in an article in the newspaper of the Hochschulverband – the largest association of German professors [2].

What is all the fuzz about? The way science works is that I do stuff (science) and when I have found something more or less interesting I write an article about it and send it to a scientific journal. The editor there send it to colleagues, who read it carefully (most of the time) and write some feedback to help the editor decide whether my article should be published in this journal. If the feedback is positive, the article is edited, formatted and published online. However, before publication I have to sign a form in which I hand over more or less all rights on my article to the publisher of the journal. Usually this happens already when submitting the article – meaning without signing such a copyright transfer agreement nobody will ever look at or read my work.

Once my article is published online, you can only download it if you subscribed to this journal. These subscriptions are usually bought by the universities (for a lot of money). Depending on the terms in the copyright agreement, I am usually not allowed to make my article available in any other way, for example by posting it on my website. So if I do not like the terms, I could pick another publisher. However, the choice of journal is determined by many other considerations (readership of the journal, topic of the article, and – unfortunately – also the journal’s impact factor). This would be a topic for another blog post. Often, there is not much choice for a given article – in particular for young scientists that have to “play by the rules”. Finally, while there are slight differences, the copyright agreements of different publishers are in general very similar.

The government of Baden-Württemberg now wants to make it mandatory for scientists funded by the state that when publishing their results in a scientific journal, they maintain the right to post their article on their own websites or in university repositories six months after the journal publication. To me, this sound like a good thing. More colleagues will be able to download and read my articles, even of their universities do not have the money to subscribe to the journal in question. Hopefully, these additional readers will cite my article, which boosts my ego and is also good for my career. The more political argument for such a law is that if my research has been funded by the taxpayers’ money, also the results produced with this money should be available to the public.

Sound all reasonable, right? So why is this policy criticized at all? The main argument I read is that it undermines the “freedom of science”. In fact, this is partly true: As a scientist, I cannot choose any journal anymore, but only those that agree to the terms dictated by the state of Baden-Württemberg. However, I think that this is a necessary step to give back scientist the freedom to make their work freely available.

In the end, publishers rely on scientist that write articles for their journals. If enough scientists demand a change of their copyright agreements, they will change their policy. But on my own, I have no power to demand such a change. I have to publish in the most suitable journals no matter whether I like their policy or not – otherwise I would damage my career. Therefore, a critical mass has to be reached somehow. This could be a coordinated effort by scientists – for instance, though their scientific societies. But these societies usually publish journals themselves and depend on the revues from subscription fees.

This leaves those that fund science in the position to put pressure on publishers – and this is what Baden-Württemberg is trying to do now. In fact, it has already been shown that this approach works: The National Institute of Health (NIH) as the biggest funder of research in the biomedical sciences introduced a similar policy a few years ago. As a consequence, basically all publishers now offer a suitable option for scientists funded by the NIH. One could argue that Baden-Württemberg is much smaller, so that it is too small to put pressure on big, international publishers and eventually its scientists will suffer. However, Baden-Württemberg is not alone: The EU adopted a similar policy in their new research program Horizon2020, and federal German institutions like the Max Planck Society and the DFG are considering such rules as well. Hopefully, other German states will follow the example of Baden-Württemberg [3].

It should be mentioned that the requirements considered now in Baden-Württemberg (also known as “Green OpenAccess”) are pretty mild. Articles on the journal websites do not have to be freely available and journals can still sell subscriptions. And articles are posted on the scientists’ websites or university repositories only after six months. In fact, quite a few journals already comply with the proposed rules [4]. This is, for example, the case of Nature and Science and for many journals in physics-related field.

Finally, it should be mentioned that some journals might charge the authors money if these want to retain to make their articles freely available in order to make up for their loss in subscription revenues. Whether this is a valid argument for the mild requirements discussed here is another discussion, but if more strict OpenAccess rules are enforced, this will certainly be justified. In this case, it will be important that those who set these rules (i.e., the funders) also provide scientists with the necessary money for publishing costs.

To summarize: I believe that to make scientific results freely available, coordinated efforts by funding agencies or enforced by law are the only feasible way. This does not undermine the freedom of science, but eventually restores it.

[1] “Droht Wissenschaftlern der Zwang zum Selbstverlag?” FAZ, 5.2.2014. No link provided here because of a ridiculous German law called “Leistungsschutzrecht“.
[2] Jörg Michael Kastl, “Neue Steuerung” in Forschung und Lehre 12/2013, p. 996.
[3] The German federalism, in which each state is responsible for its own universities, is not really helpful for a coordinated effort here. So someone has to start.
[4] The SHERPA/RoMEO database provides an overview of the copyright policies of different scientific journals.

Is Theoretical Chemistry Really a Dead End?

It took me some time, but now I want to write my first “real” blog post. As I already announced, I wanted to write a response to Mario’s blog post on the state of computational and theoretical chemistry. Mario paints a quite depressing picture and in his view, computational chemistry is an (academically) exciting field, but there is no future for you once you got into it. I am doing theoretical chemistry myself, and even though Mario raises a number of important points, I tend to disagree with his conclusions.

But before getting to that, lets consider his points. First, he is worried that there is no real job market for computational chemists outside of academia. Second, he states that most work done by PhD students in academic groups is routine work that could be outsourced to service departments or specialized CompChem service companies. Examples of such routine work are “running calculations” and “implementing new properties”. What is then left for academic groups would to interpret results and draw interesting conclusions from them.

I will start with the second point: Is computational and theoretical chemistry really only routine work nowadays? Is our field just becoming a service branch of chemistry? I have heard quite a few other (also well established) people in the field express such thoughts. Of course, theoretical chemists have not developed methods in the past decades just for the fun of it. Instead, the field has tried to provide tools that are useful for chemistry in general. Today, we are at the point where this has become true, and hardly any experimental project can get along without some input from theory. Naturally, applying methods that have been developed before to new problems is to a large extent routine work. This is one part of what Mario describes.

However, I do not think that this is what dedicated academic groups in theoretical and computational chemistry should be doing, at least not exclusively. The more routine this kind of work becomes, the easier it will be for experimental groups to do the computational work themselves – which is happening more and more. And for the interpretation, experimentalists usually have the better idea of what is going on in their systems, anyways. If we as computational chemists rely on doing only service work, we will make ourselves superfluous in the long run.

So, if routine work is to be avoided, what is there left to do? I believe that there are lots of challenges ahead of us, both in method development and in applying quantum chemistry. In fact, my own experience tells me that when collaborating with experimentalist on real problems, you will quickly hit the limits of what is currently possible and notice that new and improved methods are required for solving their problems. Some good thoughts can also be found in Ivan’s comment on Mario’s blog. I do not want to elaborate on the specific challenges for theoretical chemistry head of us here, but some thoughts can be found is this editorial by Walter Thiel, in this special topic in Science [Paywall], and in this piece I wrote two years ago for the journal of the German Chemistry Olympiad [PDF Faszination Chemie – in German].

Now, what about the PhD students in computational chemistry? Are they only doing routine work? This, of course, depends on their supervisors and the science they are doing. Therefore, my advice to future PhD students would be to carefully check whether the group they want to join is doing real science. If you are supposed to work on implementing the n-th variant of method X and your potential supervisor cannot tell you what scientific problem this method is solving, look for some other topic. Similarly, if you are only supposed to run calculations “for experimentalist” and the scientific challenges to be addressed are unclear, there might also be better projects around.

However, even in scientifically challenging projects there are (sometimes quite large) parts that are routine work. Calculations need to be run, programs need to be debugged, and someone has to administrate your computer system. But this is the case in every branch of chemistry. Personally, I do not want to know how much time a PhD student in organic chemistry spends repeating synthesis of predecessor substances and purifying brown reaction products with his columns. This is just an essential part of science that cannot be avoided, but it should not be the only part of PhD research.

Finally, what about the job market for theoretical and computational chemists? First, of course the academic job market is difficult and not every PhD student or postdoc will eventually become a professor. But this is a topic for a different blog post. Second, I agree that there are very few jobs in industry that are really for computational chemist. However, I believe that this is going to change in the next years (and is already changing). In the same way that theoretical chemistry has become an essential part of academic research, it will also become an essential part of industrial research and experts will be required in industry. But this might be only my (too) optimistic outlook.

Nevertheless, lots of PhD students from theoretical chemistry will end up doing something else in their professional careers. This is something a supervisor should tell his/her PhD students from the start. During a PhD in theoretical chemistry you will acquire lots of skills in large-scale computational modeling and in (scientific) software development that are also relevant in different industries. Everyone of my former colleagues who decided to leave the academic world actually found a good job somewhere, even though these jobs are – with a few exceptions – outside of computational and theoretical chemistry. Again, I believe that this is not specific to our branch of chemistry. None of my friends who went into experimental chemistry is actually still working in the lab.

I think there is a bright future for theoretical and computational chemistry – otherwise I would not be doing it. There are a lot of scientific challenges ahead of us, and if you are a future PhD student and want to contribute to solving excited problems, then do so. But make sure that you are not “abused” for (only) routine work and do not expect to find a job in industry where you can continue to do what you did in your PhD.

To Blog or Not To Blog

So, I decided to start a blog. This blog will be mainly about science, in particular theoretical chemistry, which is what I do for a living (and because it is an exciting and fun subject). The title of this blog refers to the Frozen Density Embedding method, which is one of the main areas of my work.

I had been thinking about starting a blog for some time because blogs are fun and I enjoy reading other (chemistry and non-chemistry) blogs. However, I never really knew whether I have anything to tell the world that could fill my own blog. I still don’t know that, but at least I have a few ideas. Let’s see where this is going …

The main driving force for really setting up something NOW was a blog post by Mario Barbatti on his (rather negative) view on the future of theoretical computational chemistry. I promised Mario to post my thoughts as a comment on his blog, but then realized that I need more space.The my first real blog post on the (bright) future of theoretical chemistry is not ready yet,  but will hopefully appear here in the next days.

Follow

Get every new post delivered to your Inbox.

Join 361 other followers

%d bloggers like this: