SteinBlog

Egon’s introductory talk about getting started with CDK

After my opening talk at the CDK workshop, Egon Willighagen gave an introduction on how to get started with the CDK. He uses the scripting environment Groovy to demonstrate things.

Egon has prepared a LaTeX document with his teaching material as well as the code examples on at http://pele.farmbio.uu.se/groovy. Turns out that Groovy scripting is a really nice environment for writing CDK code. You can say things like

import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.*;
import org.openscience.cdk.atomtype.*;
import org.openscience.cdk.config.*;
import org.openscience.cdk.tools.manipulator.*;
import javax.vecmath.Point3d;

molecule = new Molecule();
atom = new Atom(Elements.CARBON);
molecule.addAtom(atom);
matcher = CDKAtomTypeMatcher.getInstance(
DefaultChemObjectBuilder.getInstance()
);
type = matcher.findMatchingAtomType(molecule, atom);
AtomTypeManipulator.configure(atom, type);
println “Atom type: $type.atomTypeName”

As you can see, you do not need to handle things like Exceptions or Data Types. Groovy, like most other scripting environments will handle that for you.

If CDK.jar is in your CLASSPATH, you can run this code inside GroovyConsole and you’ll get “C.sp3” as an output.

Egon’s course material have many more examples.


CDK Workshop 2009 kick-off talk

3167672689_201e728ce1I’m collecting some thoughts for my CDK workshop kick-off talk on Monday and I guess I’ll go for the boring regular version, with an introduction to CDK history, followed by some statistical figures and ending with an explanation of the format for the developers workshop on Tuesday afternoon.

As anyone can read on our CDK homepage, the CDK was started because my old compchem java classes that I had written for the first version of my structure elucidator SENECA and which where also the basis for the first JChemPaint releases needed a re-write (they where actually not bad and pretty fast, but not really object-oriented). By that time, friend Egon had come on board and so we took advantage of a joint visit to Jmol creator Dan Gezelter at Universtiy of Notre Dame to start the redesign. Unfortunately, we lost the two screenshots from Dan Gezelter’s whiteboard at nd.edu where we sketched the object hierarchy which is still largely in place today.

So, here is the current status of CDK in numbers:

  • In September 2009, CDK will turn 9 years old and we already start planning for the 10th anniversary workshop.
  • Its code base is more than 90,000 lines of code in more than 900 classes and over 9000 methods.
  • As of today, 20/04/2009, CDK on Sourceforge has 67 registered developers and
  • 86 (111) people are subscribed to the cdk-devel (cdk-users) mailing lists.
  • According to Ohloh, CDK is now worth $ 4.6M and took an effort of 84 person years to create.
  • The CDK article published in 2003 has been cited 68 times according to Google Scholar (I could not get my VPN to work so I do not know what WebOfScience says about our current citation count)
  • According to SourceForge, CDK has been downloaded 90,000 times since 2001.

On the first workshop day, we’ll have tutorials on

followed by the workshop dinner and a second day with scientific talks John van Drie, Asad Rahman and Oliver Karch.

The final discussion will include a presentation by Mark Forster, Syngenta, on his observations on the usability issues of CDK in industry. We also owe Mark a lot for creating the freely distributable linux image with the CDK and all the CDK related software used at the workshop.

On Tuesday afternoon we’ll have a developers workshop, which, traditionally, has the format of an unconference. The ideas for this format are taken from a concept called “Open Space Technology“, which we only discovered after having practised them for more than 5 years :-). To be fair, OST conceptualizes things nicely and relieves one from figuring out how to run meetings as openly as possible. Citing from the Wikipedia articles linked above, the idea is to have a facilitated, participant-driven conference centered around a theme or purpose, in a self-organising process; participants construct the agenda and schedule during the meeting itself.

A facilitator (me in this case) introduces the principles of OST to the participants. Participants then write the title of a session they would be interested in on a piece of paper, walk to the front of the auditorium and announce the title. If a participant proposes a topic, he or she should be passionate enough about the topic to lead the respective session.

OST philosophy is based on four rules and a law:

  1. Whoever comes is the right people: this alerts the participants that attendees of a session class as “right” simply because they care to attend
  2. Whatever happens is the only thing that could have: this tells the attendees to pay attention to events of the moment, instead of worrying about what could possibly happen
  3. Whenever it starts is the right time: clarifies the lack of any given schedule or structure and emphasises creativity and innovation
  4. When it’s over, it’s over: encourages the participants not to waste time, but to move on to something else when the fruitful discussion ends

There also exists another tentative “law”, usually referred to as the Law of Two Feet (or “The Law of Mobility”), which reads as follows: If at any time during our time together you find yourself in any situation where you are neither learning nor contributing, use your two feet. Go to some other place where you may learn and contribute.

Having said all this, I hope that this largest CDK workshop ever will be asuccess and most importantly fun for everyone. We’ll keep you informed at the CDK 2009 Workshop wiki page.


3rd International Biocuration Conference in Berlin

Berlin Dahlem-Dorf tube station

Berlin Dahlem-Dorf tube station

I’m attending the 3rd International Biocuration Conference in Berlin, which looks like a pretty successful meeting in terms of numbers of participants. Seems like somewhere between 100 and 200 participants. It looks like the time for recognition for biocuration and curated biological resource has come. The International Society for Biocuration has been inaugurated yesterday. People from publishing companies such as Nature are attending.

Janet Thornton, director of EBI, gave the opening keynote yesterday evening, rehearsing some of the history of biocuration and looking into the future of securing funding for biocuration through the Elixir project.

I’m now listening to Philip Bourne talking about “Changes in Scholarly Communication and the Potential Impact on Biocuration”. He talks, beyond a lot of other things, about the author embedding semantic information into the orginal manuscript and introduces part of his own work with Microsoft on a plug-in for word to do this enrichment.

There is nothing overly particular about this meeting but it strenghens my feeling that we are at the point where finally the idea of preserving the information in the first place, in the scientific document, has come. Both Dietrich’s semantic enrichment conference as well as this one was well attended by publishers – Elsevier and Nature where at both. This scientific document can then become both a scientific article as well as one or many database entries.

Another notion that has come up a couple of times is the question of reward for authors to make and submit semantically rich documents. One of the ideas is fast-tracking those documents – publishing them faster.


Cheminformatics/Metabolism PhD position at EBI

Image courtesy of emhuwar

Image courtesy of emhuwar

The Cheminformatics and Metabolism group at the European Bioinformatics Institute (EBI) in Hinxton, Cambridge, Uk, has an opening for a phd position. The EBI is one of four outstations of the European Molecular Biology Laboratory (EMBL) and is a great place to do research in chemistry, cheminformatics and drug discovery. In all of these areas, the really exciting stuff is done at the boundaries of molecular biology, chemistry, nanoscience, etc, and EMBL is the place for such interdisciplinary research. The really cool thing and a well-hidden secret: The successful candidate will be enrolled at the University of Cambridge and eventually get a phd from UCam. Back to the records:
The Steinbeck group does research in metabolism, natural products and cheminformatics algorithm development. The successful candidate is free to choose from a range of topics (see below) or suggest his or her own project.
Applications need to be submitted by 15 July 2009 through the EMBL phd programme and ideally the candidate should clearly indicate that he/she has a preference to work with us here at the EBI.

Here are a two suggestions for potential projects

a) Structure elucidation of unknown metabolites based on mass spectrometry or proton NMR. This topic is of great importance for Metabolomics and metabolism research in general, both of which are current hot topics in molecular biology. We have a long standing history of research in this area (see our publication list) and the candidate will be able to build on existing knowledge.

b) Intelligent systems for extracting information from the chemical literature
A vast amount of knowledge is hidden in more than 100 years of chemical literature – knowledge which needs to be semantically annotated and made discoverable and interpretable by computational algorithms. Now, a scientific article in chemistry is a complex mixture of different information and data types. It contains plain text, analytical numerical data in various flavours as well as graphics of all kinds. Methods have been developed to extract of re-discover information from either of these areas. The project here aims at combining information extracted from text, tables and graphics and using each of the areas to validate data from any of the other areas.


ChEBI at the Fall 2009 ACS meeting in Washington

I’ve been invited to present our ChEBI ontology at the 2009 Fall Meeting of the American Chemical Society. Here is our abstract:

ChEBI – An open ontology for Chemical Entities of Biological Interest

Paula de Matos (1), Kirill Degtyarenko (2), Marcus Ennis (1), Janna
Hastings (1), Inma Spiteri (1) and Christoph Steinbeck (1)

(1) European Bioinformatics Institute, Hinxton, Cambridge, UK
(2) European Patent Office, The Hague, The Netherlands

Chemical Entities of Biological Interest (ChEBI) is a freely available, manually annotated resource providing data such as chemical nomenclature, an ontology and chemical structures. The ChEBI ontology imposes meaning onto the data according to four subontologies: molecular structure, application, biological role and subatomic particle. As a cheminformatics resource it provides chemical substructure and similarity searching using the Chemistry Development Kit (CDK). ChEBI annotates structures with various properties such as charge and mass and names including brand names and International Nonproprietary Name (INN). This extended coverage is complemented by manually annotated names appearing in Patents and Patent identifiers. In addition names can now appear in French, German, Latin and Spanish. Acting as a chemoinformatics portal to other bioinformatics resources, ChEBI has introduced automatically generated links to resources such as UniProtKB, IntAct, ArrayExpress, SABIO-RK or PubChem. ChEBI lives at http://www.ebi.ac.uk/chebi/ putty download windows , where it is also available for download in
a variety of formats and accessible via webservices.


ACS Meeting Salt Lake City

Downtown Salt Lake City, Image courtesy of A4GPA

Downtown Salt Lake City, Image courtesy of A4GPA

I’ve just arrived at the ACS meeting in Salt Lake City. The trip was a real nuisance, 19 hours or so, and I always ask myself why I do this stuff.Still, after a fantastic breakfast in my even more fantastic hotel, the Grand America Hotel (review pending), and now being at the meeting, is is again great to meet all the usual faces and see some great talks. For cheminformatics and chemical information, there is really no other meeting in the world with such broad attendance from the community.

My first impression from the meeting is that it is probably a little less well attended as the previous ones I’ve been to and that also the number of exhibitors has dropped significantly. This maybe partly because of the problematic economic situation but there have also been emails of companies expressing discontent with how the ACS has treated them in the past.

My program: I will

  • give two talks, one about open source for chemistry teaching in one of the CHED sessions on Monday morning, and another one in CINF general papers on Thursday morning on Machine Learning methods for proton prediction.
  • participate in the CSA Trust Board meeting on Sunday,
  • participate in the InChI subcommittee meeting on Monday
  • participate in the CIC-CINF working group meeting on Wednesday
  • co-organize the Blue Obelisk Dinner on Wednesday evening.
  • go to each and every CINF reception and Harry’s party 🙂

It looks like there is not much time for anything else but there are number of interesting sessions and individual talks here. The meeting is emphasizing chemistry’s role in nanotechnology, which of course is a very timely and exciting topic.


Open Access Journal of Cheminformatics now live!

I’m delighted to announce that the first open access journal of our field, the Journal of Cheminformatics, is now live and has published its first articles.  Journal of Cheminformatics is a new open access journal from Chemistry Central publishing peer-reviewed research in all aspects of cheminformatics and molecular modelling.  It is run by Editors-in-Chief David J. Wild (Indiana University) and myself (European Bioinformatics Institute).
Amongst the launch articles are an Editorial by David J. Wild on Grand challenges for cheminformatics, a Commentary by Steven M. Bachrach on Chemistry publication – making the revolution and last but not least an article by Tony Williams and coauthors on Computer Assisted Structure Elucidation (CASE), one of my own fields of research.
You can view articles and submit your manuscripts at www.jcheminf.com.  Please share this information with your colleagues working in the field of chemical information who may be interested in this new journal.


“new open source era … for better drugs”

As we learn from a rather poorly written article over at xconomy, “Biology has never really had a social-networking movement like open-source computing, where thousands of loosely-affiliated people around the world pool brainpower to make better software”. If you translate that into what was needed for biology (or chemistry) according to the xconomy author, it would translate into a “social-networking movement where thousands of loosely affiliated people around the world pool brainpower to make better biology”. Now, I leave it to you extremely bright guys out there to figure out why I think that already exists and how it is called.

At least, the article informs us about lots of money being pumped into another collaborative effort to exploit systems biology to make better drugs, which might be a good thing. I’m having a hard time to understand the fundamental difference between this and existing open approaches, but I’m happy to learn. Part of this lack of understanding comes from a lack of meat in this article. Maybe one of my readers can comment.