Faster Fingerprints for the CDK

October 7, 2008

Mark Rijnbeek, who has moved to my team last month to work on the chemistry search engine for our new chemogenomics data, has given Rajarshi‘s new fingerprint implementation a test. Mark was bored to hell by the performance of the version he had in hand and it turned out that it was my old one, which had served us well for quite a while but turned out to be unusable for the amount of data we are testing now.

So he downloaded CDK 1.04, just released a few days ago, and have it a shot.

Mark wrote:

“Here’s what happens: fetch 1000 molfile clobs from Oracle, put them in a list, create a list of Molecule objects from that, and lastly calculate fingerprints on that last list.
Below is Java system output, each CDK version tested against a 1000 compounds, twice.
The numbers are milliseconds [passed] since program start.
The performance increase is very significant; the older CDK fingerprinter took about a minute (see below) for 1000 fingerprints, the new one about 7 seconds.”

The numbers for the “old code”:

0 - Start benchmark 1000 compounds.
84 - Fingerprinter set up
531 - Connected to database
120 - Resultset opened
1706 - Molfile strings retrieved from database, stored in list
3231 - Molecule objects list built
64202 - Fingerprints calculated

And then CDK 1.04:

0 - Start benchmark 1000 compounds.
77 - Fingerprinter set up
536 - Connected to database
118 - Resultset opened
909 - Molfile strings retrieved from database, stored in list
2360 - Molecule objects list built
9900 - Fingerprints calculated

These numbers are just one representative instance from multiple runs performed by Mark. They do not quite fit the numbers reported by Rajarshi, but the conditions were to different to be comparable. In our case, the achieved speed-up is 8-fold, which is a nice success and even better than Rajarshi’s reported 4-fold speed-up.

We plan to soon be reporting on benchmarking a much larger dataset.

Thanks, Rajarshi. Great stuff!

Tagged with:

Categorised as: Open Science

Egon Willighagen says:

October 8, 2008 at 10:22 am

Mark, could you please also compare using Molecule and NNMolecule? I would expect the difference between the old code and CDK 1.0.4 to be smaller when using NNMolecule. The latter does not use IChemObjectListeners, which you do not now anyway.

Even if the ratio old/new stays the same, I guess the 7 seconds still goes down to 5 or 6 using NNMolecule.

October 8, 2008 at 10:46 am

Oh, and rather interesting, of course… I’d welcome some statistics on the filtering success of the fingerprints on the database… Maybe compare the path-based-only fingerprint with Stefan’s extended fingerprint… and, rather interesting, with Rajarshi’s cdk.fingerprint.MACCSFingerprinter…

Mark Rijnbeek says:

October 9, 2008 at 1:28 pm

First of all sorry about a my mistake on information regarding the timing – times in ms are absolute, not relative to start time. So the calculation of fingerprints was 64s for the old code, 9.9 for the new. That’s about 6 times faster.
I re-ran the class today and got the same factor out, although timings were different (overall slower) than above.

There doesn’t really seem to be a performance difference between NNMolecule and Molecule. On my computer it takes around 1.3-1.5 seconds to build a list of a 1000 for both classes.

The benchmark was already done using the extended fingerprinter. I now also measured the MACCSFingerprinte, but only had that class available in the old code; the cdk104 jar file does not seem to contain it.

It took 187 seconds to calculate 1000 fingerprints, so pretty slow compare to the extended fingerprinter in the original post.
0 – Start benchmark 1000 compounds.
111 – Fingerprinter set up
341 – Connected to database
124 – Resultset opened
817 – Molfile strings retrieved from database, stored in list
1597 – Molecule objects list built
187210 – Fingerprints calculated

October 9, 2008 at 1:31 pm

To be clear, that 187s was for the MACCSFingerprinter

SteinBlog

Faster Fingerprints for the CDK

4 Comments

Leave a Reply