SteinBlog

JChemPaint turns three …

… point zero. Soon. Hopefully.

In 1997 I decided to start my own structure editor and called it JChemPaint. Soon after, documented here, Egon joined and a lot of open cheminformatics developed afterwards. 🙂

Now, my team and the Bioclipse guys at Uppsala, are at work designing and implementing JChemPaint 3.0, which will have a modular design, various renderers and a production-quality applet. Or so the theory goes.

Gilleain Torrance has posted a number of blog items puttygen download , also referencing the work of our collaborators in Uppsala, thinking about design patterns which allow us to make the architecture flexible enough to produce a fully fledged structure editor on one side and a less than 100 kb drawing applet on the other side.


Happy birthday, Chem-bla-gon

Chem-bla-ics, the blog of my long-term collaborator Egon Willighagen, is celebrating its third birthday. I guess it is not only the birthday of the blog but also of Egon as a blogger, so all the best, Chem-bla-gon, from me and my team. Egon has made a lot of nice and interesting contributions to chemical blog space PuTTY basic commands , whose formalization, by the way, is one of the other achievements of him – done when he did his postdoc in my group in Cologne.


Faster Fingerprints for the CDK

Mark Rijnbeek, who has moved to my team last month to work on the chemistry search engine for our new chemogenomics data, has given Rajarshi‘s new fingerprint implementation a test. Mark was bored to hell by the performance of the version he had in hand and it turned out that it was my old one, which had served us well for quite a while but turned out to be unusable for the amount of data we are testing now.

So he downloaded CDK 1.04, just released a few days ago, and have it a shot.

Mark wrote:

“Here’s what happens: fetch 1000 molfile clobs from Oracle, put them in a list, create a list of Molecule objects from that, and lastly calculate fingerprints on that last list.
Below is Java system output, each CDK version tested against a 1000 compounds, twice.
The numbers are milliseconds [passed] since program start.
The performance increase is very significant; the older CDK fingerprinter took about a minute (see below) for 1000 fingerprints, the new one about 7 seconds.”

The numbers for the “old code”:

0 - Start benchmark 1000 compounds.
84 - Fingerprinter set up
531 - Connected to database
120 - Resultset opened
1706 - Molfile strings retrieved from database, stored in list
3231 - Molecule objects list built
64202 - Fingerprints calculated

And then CDK 1.04:

0 - Start benchmark 1000 compounds.
77 - Fingerprinter set up
536 - Connected to database
118 - Resultset opened
909 - Molfile strings retrieved from database, stored in list
2360 - Molecule objects list built
9900 - Fingerprints calculated

These numbers are just one representative instance from multiple runs performed by Mark. They do not quite fit the numbers reported by Rajarshi, but the conditions were to different to be comparable. In our case, the achieved speed-up is 8-fold, which is a nice success and even better than Rajarshi’s reported 4-fold speed-up.

We plan to soon be reporting on benchmarking a much larger dataset.

Thanks, Rajarshi. Great stuff!


Creating and Reviewing Patches in the Chemistry Development Kit (CDK)

In order to prevent major turbulences in the main source code development line of the Chemistry Development Kit (CDK), we decided a while ago to have separate branches in our subversion source code management system for each developer and each of his subprojects. Once a project has been finalized by a developer in her branch, she would then publish a patch in the CDK patch tracker system and ask for it to be reviewed by posting to the CDK developers mailing list. A CDK senior developer would the assign the patch to himself or another senior developer.

I have just been assigned the task to review the recent Iterator/Iterable patch for CDK and will protocol my task for reference reasons.The patch was published on the CDK patch tracker.

The executive summary of the reviewing task goes like:

  1. browse the code
  2. mark up code you think is buggy
  3. note missing unit tests
  4. note missing JavaDoc
  5. warn for subjected PMD warnings
  6. optionally note other problems
  7. optionally any other comment you have

So, let’s see how it went:

Browse the Code

I got the gzipped archive with Egon’s patch and looked at the code. A large part of the changes involve

removing       public Iterator<IIsotope> isotopes() {
and adding    public Iterable<IIsotope> isotopes() {

to enable things like

double overallCharge = 0.0
for (IAtom atom : molecule.atoms()) {
overallCharge += atom.getCharge();
}

In order to implement Iterable, one needs to have methods returning an Iterator, so a lot of code essentially implements those.

Remove:       public java.util.Iterator atoms() {
and add:      public Iterable<IAtom> atoms() {
logger.debug("Getting atoms iterator");
return super.atoms();
}

And then there is code actually using those iterators and all of these instances had to be adapted too (I’m just giving the patch syntax):

for(IReactionScheme rm : scheme.reactionSchemes()){
-                       for(Iterator<IAtomContainer> iter = getAllMolecules(rm, molSet).atomContainers(); iter.hasNext(); ){
-                       IAtomContainer ac = iter.next();
-                       boolean contain = false;
-                       for(Iterator<IAtomContainer> it2 = molSet.molecules();it2.hasNext();){
-                               if(it2.next().equals(ac)){
-                               contain = true;
-                               break;
-                       }
-                       }
-                       if(!contain)
-                               molSet.addMolecule((IMolecule)(ac));
-                       }
+                for (IAtomContainer ac : getAllMolecules(rm, molSet).atomContainers()) {
+                    boolean contain = false;
+                    for (IAtomContainer atomContainer : molSet.molecules()) {
+                        if (atomContainer.equals(ac)) {
+                            contain = true;
+                            break;
+                        }
+                    }
+                    if (!contain)
+                        molSet.addMolecule((IMolecule) (ac));
+                }

Overall, the patch affected 288 classes including test classes, with almost 2000 lines of code changed.

Mark up code you think is buggy

Impossible to do for me for such a large bunch of changes, so one must rely here on the unit tests to work.

Note missing unit tests

Egon had posted some notes about comparing failing and passing between unit tests earlier but we also need an automatic check for unit test coverage. And yes, of course, there are limits to what such an automated coverage tool can do.

With regard for failing unit tests, the “iterable” branch did have anymore failures and errors than the head branch.

Note missing JavaDoc

We’ve go DocCheck results on our CDK nightly pages but nothing tells you whether a patched method is missing neccessary JavaDoc. Presumably, we could “grep” the patches class names into a DocCheck input file and get customized info about it.

Warn for subjected PMD warnings

PMD is a tool for checking code with respect to adherence to certain coding standards.Again, the CDK nightly page contains all PMD reports on the CDK code, generated in nightly runs. The same can be achieved for each branch with a “ant -f pmd.xml”  on your local copy of the branch.

Optionally note other problems

I love optional things and tend to let them be optional

Optionally any other comment you have

Dto.

So, overall I would like to conclude that according to the best of my knowledge, the Iterable patch should be safe and can be applied to the HEAD branch.


Linus on GIT on Google TechTalks

I’m a big fan of Google TechTalks and watch a lot of them during flights. This week I enjoyed the recording of Linus Torvalds insulting all kinds of people including the whole SVN develoment team while introducing his distributed source code management system GIT. Egon had pointed me to GIT quite a while ago but seeing Linus himself discuss the issue made a difference.While CDK is still considerably smaller than the Linux kernel, I can see a lot of commonalities and I think that with our current development of having our fellow coadmins review important patches and branches PuTTY download , GIT sounds like a much easier way to do it.

In GIT the source code is distributed – there is no concept of a central source repository. Developers commit their chances to their local GIT systems, with all the advantages of versioning and source code history. Other developers pull code from you if they think that the changes you’ve advertised via your favourite communication channels are interesting. In theory, this allows for a very democratic and evolutionary code development. In addition to being distributed, GIT seems to be very fast when it comes to merging. Linus reports that he does hundreds of full merges per day and nothing take longer than 5 secs.

In practice, as Linus points out in his talk, there will always be one or very few repositories that people pull from – for the Linux kernel it will be Linus’s machine. In CDK it will very likely be Egon‘s. Sorry Egon, you’ve got to be online all day 🙂

The last sentence already brings me to the point. I wonder if we should give GIT a try for CDK development. The advantages do sound enormous. Ok, there are disadvantage too, such as loosing the central web browsing of the SVN repository on SF. There may be ways around this, as Egon decribed here, but this seems like not using the real thing.

This is a brief impression dump after watching Linus’ talk today and I’m happy to hear your opinions 🙂


La Mimosa, an Italian Restaurant in Cambridge, UK

For the second time I took guest to the La Mimosa in Cambridge and it was as flawless as the first evening.

La Mimosa has both a beautiful location and a very nice menu – solid, no-nonsense Italian food.

In the summer you can site outside directly at Jesus green at puttygen ,0.115829&spn=0.007153,0.019312&z=16″ target=”_blank”>Thompsons Lane, Fen Ditton, Cambridge, CB5 8AQ, UK, +44 1223 362525. The interior is very cosy, the staff enjoyable and friendly and I’m more than happy to recommend the place. Today we shared some mixed antipasti consisting of a bit too much salad, prosciutto, salami, buffalo mozzarella, sun-dried tomatoes and olives. Nothing fancy but solid quality.

I had a Risotto ai funghi porcini as the main course. Well done!

And concluded with the a traditional Italian desert – so boring that I’m ashamed to say which one.

Then Double Espresso and a Grappa, as usual.

So, good food, nice people running the place. A great evening for the second time PuTTY manual , which I really credit to the people taking care of us.


Breaking News: Open access to large-scale drug discovery data at EBI

Very exciting things have just happened here at EBI in the area of chemoinformatics and drug discovery:

The Wellcome Trust has awarded £4.7 million to the European Bioinformatics Institute (EBI) to support the transfer of a large collection of information on the properties and activities of drugs and a large set of drug-like small molecules from BioFocus DPI, part of the publicly listed company Galapagos to the public domain.

Here are the press releases of EBI and Galapagos.

The databases will be incorporated into EBI’s collection of open-access data resources for biomedical research and will be maintained by a newly established team of scientists at the EMBL-EBI. These data lie at the heart of translating information from the human genome into successful new drugs in the clinic.

The databases to be brought into the public domain include DrugStoreâ„¢ (database of known drugs), StARLITeâ„¢ (database of known compounds and their effects), Strudleâ„¢ (binding site drugability), and Kinase SARfariâ„¢ and GPCR SARfariâ„¢ (informatics systems for the most widely used target classes in drug discovery).
The main database, StARLITe, on Drug-Target interactions alone has hundreds of thousands of interaction data points manually curated from the medicinal chemistry literature.

A new team leader will be appointed to support the new resource and my group will provide the chemoinformatics expertise to move the underlying analysis software into the open source world. The Chemistry Development Kit (CDK) will play a major role both in freeing the QSAR code as well as in providing (sub-) structure and similarity searching to the database.

The transfer will empower academia to participate in the first stages of drug discovery for all therapeutic areas, including major diseases of the developing world. In future it could also result in improved prediction of drug side-effects and spark all kinds of new academic research directions.

We are thrilled 🙂


The Crowne Plaza, Koulova 15, in Prague

Not much to say here. For a hotel of this chain, a disaster! Avoid it, if you can. Take the Marriott or the Hilton in the center, or even better, one of the small, more private hotels in the area.

The Crown Plaza’s staff was, with the usual exceptions, crabby and unfriendly, and – even worse – horrendously disorganized. The regular category of rooms (the stuff that you get when you do not ask for anything smart) was loud, small and ugly. Trying to get something decent away from the street was possible but hard. The breakfasts on the following days were characterized by being treated like a fraudster because they did not manage to put our names paired with the correct room number on their list of people to be admitted to the restaurant for breakfast.

It was impossible to get a correct recommendation from the concierge on how to get to the centre. Staff tried to sell overpriced tickets as it turned out later. An attempt to get our hands on our voucher (delivered to staff at check-in) to check for the pick-up time to the airport puttygen download windows , resulted in 30 min of hectic searching by the people behind the counter. Yet without success.

So, again, avoid this place, even if you have business in the area.


The Trouble with Physics

I do not normally recommend books that I read to a wider public. Partly because I’m disappointed if someone dislikes a book that I loved, partly because I do not think that my taste is of interest to anyone, partly because 90% of my reading was written by Terry Pratchett. In addition, when it comes to Science books, I rarely finish any of them, partly because I don’t understand them, partly because they bore me to death.

But this one was different and since this is a blog partly concerned with science in general, I would like to recommend to you “The Trouble with Physics” by Lee Smolin. The book’s subtitle is “The rise of String Theory, the Fall of a Science, and What Comes Next”. You, likely to be a member of some molecular informatics community, may ask, who cares about String Theory, and you may be right at a first glance. This book, however, is about Science in general, the mechanisms that drive it, and about what can go wrong. It is as much a book about science, physics and string theory as it is about the people in science and how they shape the fate of their field and more importantly of the rest of the pack working in it. It if fascinating to follow Smolin’s account of the rise of String Theory as one of the leading theories in physics. It is frightening to realize that 20 years or more of String theory has, if we believe the book, not lead to a single, novel, unique prediction that has been verified afterwards by experiment.

But while the account of the rise of String Theory was very entertaining and informative to read, the rest of the book, dealing with questions of the alternative theories, of science sociology, what science really is and what we can do to keep a finite possibility of fundamental revolutions, is even better.

I’m going to stop here because I feel that this book, once you’ve started reading it, will speak for itself.