Currently there are a few limitations that we'll get out the way:
- Only generation of the CDK ECFP4 is supported and at a folded length of 1024, this should give a close approximation to what Matt used in MongoDB (RDKit Morgan FP). Other fingerprints and foldings could be used but the generation time of path based fingerprints in the CDK is currently (painfully) slow.
- Building the index is done in memory, since 1,000,000x1024 bit fingerprints is only 122 MiB you can easily build indexes with less than 10 million on modern hardware.
- During index searching the entire index is memory mapped, setting the chunks system property (see the GitHub readme) will avoid this at a slight performance cost.
- Results return the id in the index (indirection) and to get the original Id one would need to resolve it with another file (generated by mkidx).
- Index update operations are not supported without rebuilding it.
$ ./smi2fps /data/chembl_19.smi chembl_19.fps # ~5 mins $ ./mkidx chembl_19.fps chembl_19.idx # seconds
fpsscan does a linear search computing all Tanimoto's and outputting the lines that
are above a certain threshold. The
toper utils use the index, they either filter for similarity or the top k results. They can take multiple SMILES via the command line or from a file.
$ ./fpsscan /data/chembl_19.fps 'c1cc(c(cc1CCN)O)O' 0.7 # ~ 1 second $ ./simmer chembl_19.idx 0.7 'c1cc(c(cc1CCN)O)O' # < 1 second $ ./toper chembl_19.idx 50 'c1cc(c(cc1CCN)O)O' # < 1 second (top 50)
Using the same queries from the MongoDB search I get the following distribution of search times for different thresholds.
|Threshold||Median time (ms)|
It's interesting to see that the times seem to flatten out. By plotting how many fingerprints the search had to check we observe that below a certain threshold we are essentially checking the entire dataset.
The reason for this is potentially due to the sparse circular fingerprints. Examining the result file (see the github README) we can estimate that on average we're calculating 23,556,103 Tanimoto's a second. This also means that retrieving the top k queries isn't bad either. For example 10,000 gives a median time (Code 3) of 72 ms.
$ ./toper chembl_19.idx 10000 queries.smi
Next I'll look at some like-for-like comparisons.