SteinBlog

Reviewing the CDK VFLIB patch

It is my duty today to review a patch for some piece of CDK code again. I blogged about this process earlier.

This particular patch was submitted by Mark Rijnbeek in my team via the CDK patch tracker. I first go to the patch page, assign the patch to myself and set the patch status to “pending” to indicate that it is being worked on. It patches a piece of code which uses the VFLIB graph matching library to provide substructure searches with an Ullman and a VF2 algorithm.

Again, for my own records:

The executive summary of the reviewing task goes like:

  1. browse the code
  2. mark up code you think is buggy
  3. note missing unit tests
  4. note missing JavaDoc
  5. warn for subjected PMD warnings
  6. optionally note other problems
  7. optionally any other comment you have

And this is how it went – I’m leaving out things that were not applicable:

Browse the code and mark-up buggy parts

Egon had made an older version of this code available via GIT. I checked it out and looked at the code, which looked horrible because it was a 1:1 translation of a horrible looking C code. Clearly,  a decent naming of the variables would greatly improve the code but I remember a statement that the translator himself could not make sense out of this, so the original author is to blame :-). I do not get the impression that this problem can be rectified quickly. In fact, it took Mark a few days to debug this code by adding a rich collection of debug messages. I’m not sure that this is how it should be. The code is essentially unreadable.

Note missing unit tests and javadoc

Mark and Rajarhsi supplied a number of unit tests for the code and they all pass. The code itself has javadoc and there are usage examples – very good.

Meanwhile, Egon, our CDK Uberworker, has posted the following:

Hi Rajarshi, Mark,

I have had a look at the vflib branch, and note that the code is aimed
at the standard module; like all new code, but for code in this module
in particular, should adhere to CDK's 'stable' standards...

(BTW, there is a Nightly at [0] which has been running on an older
version of the patch)

The below are some guidelines, please feel free to ask me or search
the cdk-devel archives for the details.

1. clean JavaDoc

You can use DocCheck to check that your clean has clean JavaDoc:

ant -f javadoc.xml doccheck

A common error is missing periods at the end of first sentences in the
JavaDoc. The first sentence is important to get right, per JavaDoc
standards.

2. no PMD warning (or with a good excuse)

ant -f pmd.xml

3. unit test coverage

Each module has a test suite MfooTests, which points to a Test class
doing coverage testing... new unit tests classes must be added to this
suite, MstandardTests for the vflib patch. The coverage testing class
will then check that all new code is tested.

I note missing tests of NodePair and State.

Then these issues have been resolved, I'll look at the
code/functionality itself.

Right – so that happens when you are not fast enough. And I stopped being fast enough long ago: So I guess I’ll just leave it where Egon is leaving it. So, guys, go and fix that stuff and then we’ll look at it again 🙂

And now I go for a run.


Do a cheminformatics PhD thesis at a world-class institution

The European Bioinformatics Institute (EBI)

The European Bioinformatics Institute (EBI)

If the pompous title caught your attention, and you are ashamed of that: Don’t worry. It is all true. My cheminformatics and metabolism group at the European Bioinformatics Institute (EBI) is looking for a phd student this year and all you need to do is apply through the regular route.  The range of possible topics is wide open, going from metabolomics via automated structure elucidation of metabolites to mining chemical information from the printed literature, and more. Your own suggestions are of course welcome.

The EBI is the world’s largest open provider of biological and chemical information.We are located, together with the Sanger Institute for Genome Research, on the beautiful campus of Hinxton Hall, a few miles south of Cambridge.

One of the small lakes on the Wellcome Trust Campus in Hinxton

One of the small lakes on the Wellcome Trust Campus in Hinxton

Our PhD students are enrolled with the University of Cambridge.

The important part for now: The application deadline for the Fall PhD selection is July 15 puttygen , 2009. And: Please drop me a note if you applied.


Quarterly report for the EBI industry programme

I’m giving my quartly report on progress in my area for the EBI industry programme. Here is what I will elaborate on:


New NMRShiftDB node at EBI

NMRShiftDB is a database of organic compounds and their nuclear magnetic resonance (NMR) data. At present, we hold 30.000 compounds and 1D NMR spectra for carbon, proton and some other nuclei.

NMRShiftDB developer Stefan Kuhn, in collaboration with the EBI systems group, has now established an NMRShiftDB node at the European Bioinformatics Institute (EBI), which brings us up to four working nodes in the NMRShiftDB network again.

View NMRShiftDB nodes in a larger map

The main URL, http://www.nmrshiftdb.org is now routed through the EBI.


ChEBI release 57, now with links to NMRShiftDB

Congratulations to the ChEBI team for publishing ChEBI version 57.

ChEBI Release 57 now contains links to NMRShiftDB. Search ChEBI for “caffeine” PuTTY quit command , for example, and you find the link to the carbon NMR spectrum of caffeine on the “automatic XREFs” page of ChEBI, in the “Small Molecules” section.

ChEBI now contains just under 17,963 manually annotated entries of which 108 entries have been submitted via the ChEBI Submission tool  (www.ebi.ac.uk/chebi/submissions). The next ChEBI Release will be on the 24 June 2009.

See our entity of the month, Oseltamivir.

All data are also available on the public FTP site:
ftp://ftp.ebi.ac.uk/pub/databases/chebi/


ChEBI chemistry ontology development funded by BBSRC

We received our official award letter from BBSRC Tools and Resources Fund today for the ChEBI ontology development grant. Needless to say, we are thrilled. We are now going to work together with Michael Ashburner’s group at the University of Cambridge to align ChEBI with other OBO Foundry ontologies by adoption of the Basic Formal Ontology and the Relationship Types Ontology.
This will include extensive annotation of the ChEBI ontology required after adoption of BFO and RO. The adoption of the BFO will require a major reorganisation of the upper levels of the ChEBI ontology in order to allow it to align to the BFO. This
reorganisation can only be achieved by manual annotation although some semi-automatic means will be employed to aidthe curator. In addition to the reorganisation of the upper levels https://puttygen.in , new relationships will be introduced semi-automatically but as the ChEBI ethos requires that all data is manually checked to maintain ChEBI’s high standards of data quality, we expect a major annotation task. The project is funded for three years. Stay tuned. We’ll report on our progress on a regular basis.


ChemSpider aquired by the Royal Society of Chemistry

It’s going to be all over the place soon anyway, so I’ll make it short: The Royal Society of Chemistry has announced that it has aquired ChemSpider. This is great news and I’m confident that it will be a move to even more openess in chemistry and cheminformatics. It will also allow the RSC to use Tony fantastic tools for even more semantic markup of articles. I’m looking forward to talking to everyone about the implications. For now, congratulations, Tony, and congratulations, RSC, for this great deal.


ChEBI behind the scenes

With ChEBI release 56 behind us, I thought I’d share some insight into how ChEBI is created and what we do to prepare a release. In the last years, the ChEBI team on average consisted of two software engineers maintaining and improving the software and two to three curators doing the data entry and curation. It is remarkable, that, by now, the question of which chemical compounds make it into ChEBI is completely community driven. Requests to enter compounds are submitted by users and other database maintainers via the ChEBI curator request tracker on SourceForge. Besides increasing the public knowledge of mankind, the biggest benefit and driving force for submitters is the assignement of a stable ChEBI identifier which then can be cited and linked to from other resources.

With ChEBI release 55 we have introduced the new submission tool which now allows our submitter to create ChEBI datasets themselves which a) gives our users more control over what they want to see in ChEBI and b) saves our curators some duplicate work.

In preparation for a release, here is what the ChEBI team does.

  • Create automatic cross-references to PubChem, UniProt, IntEnz, BRENDA, SABIO-RK, ArrayExpress, IntAct, Patents etc…These are all run a week before the release and are based on ChEBI identifier matching or text matching.
  • Annotation of entity of the month
  • Submissions deposited directly into the database by users are processed by our annotators.

On the release day:

  • Data is exported overnight into multiple formats, OBO format, SDF, Oracle data dumps and PostgreSQL/MySQL dumps.
  • Public web site updated with the entity of the month.
  • Statistics generated and stored.
  • Sitemaps are generated to be used by search engines like Google for indexing.
  • Finally data is deposited into PubChem and the EB-eye search engine is updated.