New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch/maccs166opt #291
Patch/maccs166opt #291
Conversation
… one molecule. In future this should be handed off to the user to manage, this will be easier when AtomRef is introduced.
… the grammar was not successful.
…rings. In the case of R0 we can also check connectivity. [!R0] = [R]
…out changing the meaning.
… and bi18 to also match the aromatic atoms.
…lly only mean benzene/pyridine - using the correct arom model this change has no change of results.
…ut was actually matching >= 8. Removing this and only checking for 8M rings affects a large number of molecules.
…e patterns should usually be matched, e.g. bit118: ACH2CH2A is not equivalent to *1~[#6H2]~[#6H2]~1. This is confirmed by other MACCS implementations (T.J. O'Donnell, Design and use of relational databases in chemistry). There are a lot of changes in MACCS FP generate because recursive SMARTS were incorrectly used: *~[#6H2]~[#6H2]~* matches the SMILES OCCO once uniquely, whilst [$(*~[#6H2]~[#6H2]~*)] matches it twice.
One for @rajarshi, I think he did the original port... |
There may be more improvements coming in future to improve the benchmark score slight but I'll wait till after the paper is done to do that. |
Very thorough analysis - I went over most of the modified SMARTS and they look OK to me. I think this is good to go |
@rajarshi, you don't have rights to push the "Merge pull request" button? |
Oh, excellent! Good to see you have proper merge rights :) |
Thanks for looking over it. |
When running the benchmarks for the paper there was a significant improvement (
~10x
) in the MACCS fingerprint speed from 1.4.19. Whilst good a quick look showed lots of optimizations that could be made. Some of these were non-destructive e.g*~*~*~[S]
starts at all possible atoms and trys to find a path of length 3 to a Sulphur. Whilst[S]~*~*~*
finds a sulphur and then a path of length 3. Some SMARTS compilers do optimisations like this most don't. This patch optimizes the patterns defined in maccs.txt in two steps, non-changed/changed. The non-changed are safe, the changed are where I think the implementation is wrong/redundant.As Andrew Dalke points out there is not exact agreement between implementations.
Changes in the second type fall under whether we match and 8 member (M) ring or an 8M+ ring. The Accelrys doc clearly says, 5M, 6M, 7M, 8M so we match 8M as exactly 8 now. The second type: when the current patterns were written for RDKit they used recursive SMARTS to match alternatives for some bits. This a) slows down the matching, b) adds cases I don't think the MDL impl matched and c) has and odd quirk with counting.
With regards to b), the MACCS key 118 and 147 is ACH2CH2A, the implementation has this as both
*~[CH2]~[CH2]~*
and*1~[CH2]~[CH2]~1
. The second one 3 member ring is clearly very rare. If you did want to match you could do: [CH2D2]~[CHD2].With regards to c) and the above keys
*~[CH2]~[CH2]~*
matches once CHEMBL411976 but[$(*~[CH2]~[CH2]~*)]
matches twice. Consider the simpler caseOCCO
to see why.Once
Twice
Now of course what I think really matter is if the fingerprint performs the same, here are the results on the popular Briem Lessel benchmark:
CDK 1.4.19
CDK 1.5.x
CDK 2.0 (this patch)
Some slight change between classes but the same total score.
Bib
T.J. O'Donnell, DESIGN and USE of RELATIONAL DATABASES in CHEMISTRY (Appendix)
http://accelrys.com/products/pdf/keys-to-keyset-technology.pdf
http://www.dalkescientific.com/writings/diary/archive/2014/10/17/maccs_key_44.html