Faster Fingerprints for the CDK
Mark Rijnbeek, who has moved to my team last month to work on the chemistry search engine for our new chemogenomics data, has given Rajarshi‘s new fingerprint implementation a test. Mark was bored to hell by the performance of the version he had in hand and it turned out that it was my old one, which had served us well for quite a while but turned out to be unusable for the amount of data we are testing now.
So he downloaded CDK 1.04, just released a few days ago, and have it a shot.
Mark wrote:
“Here’s what happens: fetch 1000 molfile clobs from Oracle, put them in a list, create a list of Molecule objects from that, and lastly calculate fingerprints on that last list.
Below is Java system output, each CDK version tested against a 1000 compounds, twice.
The numbers are milliseconds [passed] since program start.
The performance increase is very significant; the older CDK fingerprinter took about a minute (see below) for 1000 fingerprints, the new one about 7 seconds.”
The numbers for the “old code”:
0 - Start benchmark 1000 compounds. 84 - Fingerprinter set up 531 - Connected to database 120 - Resultset opened 1706 - Molfile strings retrieved from database, stored in list 3231 - Molecule objects list built 64202 - Fingerprints calculated
And then CDK 1.04:
0 - Start benchmark 1000 compounds. 77 - Fingerprinter set up 536 - Connected to database 118 - Resultset opened 909 - Molfile strings retrieved from database, stored in list 2360 - Molecule objects list built 9900 - Fingerprints calculated
These numbers are just one representative instance from multiple runs performed by Mark. They do not quite fit the numbers reported by Rajarshi, but the conditions were to different to be comparable. In our case, the achieved speed-up is 8-fold, which is a nice success and even better than Rajarshi’s reported 4-fold speed-up.
We plan to soon be reporting on benchmarking a much larger dataset.
Thanks, Rajarshi. Great stuff!
Categorised as: Open Science
Mark, could you please also compare using Molecule and NNMolecule? I would expect the difference between the old code and CDK 1.0.4 to be smaller when using NNMolecule. The latter does not use IChemObjectListeners, which you do not now anyway.
Even if the ratio old/new stays the same, I guess the 7 seconds still goes down to 5 or 6 using NNMolecule.
Oh, and rather interesting, of course… I’d welcome some statistics on the filtering success of the fingerprints on the database… Maybe compare the path-based-only fingerprint with Stefan’s extended fingerprint… and, rather interesting, with Rajarshi’s cdk.fingerprint.MACCSFingerprinter…
First of all sorry about a my mistake on information regarding the timing – times in ms are absolute, not relative to start time. So the calculation of fingerprints was 64s for the old code, 9.9 for the new. That’s about 6 times faster.
I re-ran the class today and got the same factor out, although timings were different (overall slower) than above.
There doesn’t really seem to be a performance difference between NNMolecule and Molecule. On my computer it takes around 1.3-1.5 seconds to build a list of a 1000 for both classes.
The benchmark was already done using the extended fingerprinter. I now also measured the MACCSFingerprinte, but only had that class available in the old code; the cdk104 jar file does not seem to contain it.
It took 187 seconds to calculate 1000 fingerprints, so pretty slow compare to the extended fingerprinter in the original post.
0 – Start benchmark 1000 compounds.
111 – Fingerprinter set up
341 – Connected to database
124 – Resultset opened
817 – Molfile strings retrieved from database, stored in list
1597 – Molecule objects list built
187210 – Fingerprints calculated
To be clear, that 187s was for the MACCSFingerprinter