Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch/maccs166opt #291

Merged
merged 14 commits into from Apr 1, 2017
Merged

Patch/maccs166opt #291

merged 14 commits into from Apr 1, 2017

Conversation

johnmay
Copy link
Member

@johnmay johnmay commented Mar 30, 2017

When running the benchmarks for the paper there was a significant improvement (~10x) in the MACCS fingerprint speed from 1.4.19. Whilst good a quick look showed lots of optimizations that could be made. Some of these were non-destructive e.g *~*~*~[S] starts at all possible atoms and trys to find a path of length 3 to a Sulphur. Whilst [S]~*~*~* finds a sulphur and then a path of length 3. Some SMARTS compilers do optimisations like this most don't. This patch optimizes the patterns defined in maccs.txt in two steps, non-changed/changed. The non-changed are safe, the changed are where I think the implementation is wrong/redundant.

As Andrew Dalke points out there is not exact agreement between implementations.
Changes in the second type fall under whether we match and 8 member (M) ring or an 8M+ ring. The Accelrys doc clearly says, 5M, 6M, 7M, 8M so we match 8M as exactly 8 now. The second type: when the current patterns were written for RDKit they used recursive SMARTS to match alternatives for some bits. This a) slows down the matching, b) adds cases I don't think the MDL impl matched and c) has and odd quirk with counting.

With regards to b), the MACCS key 118 and 147 is ACH2CH2A, the implementation has this as both *~[CH2]~[CH2]~* and *1~[CH2]~[CH2]~1. The second one 3 member ring is clearly very rare. If you did want to match you could do: [CH2D2]~[CHD2].

With regards to c) and the above keys *~[CH2]~[CH2]~* matches once CHEMBL411976 but [$(*~[CH2]~[CH2]~*)] matches twice. Consider the simpler case OCCO to see why.

Once
Twice

Now of course what I think really matter is if the fingerprint performs the same, here are the results on the popular Briem Lessel benchmark:

CDK 1.4.19

[duster/benchmark] k=10, measure=Tanimoto, 
[duster/benchmark] Loaded fingerprints in 0s5ms
[duster/benchmark] 5 classes:  ACE (39) TXA2 (48) HMG (109) PAF (132) 5HT3 (47)
[duster/benchmark] Scored fingerprints in 0s45ms
ACE: 68%
TXA2: 61%
HMG: 79%
PAF: 62%
5HT3: 48%
[duster/benchmark] maccs166: 65.7% (63.6% avg-of-avg)

CDK 1.5.x

[duster/benchmark] k=10, measure=Tanimoto, 
[duster/benchmark] Loaded fingerprints in 0s4ms
[duster/benchmark] 5 classes:  ACE (39) TXA2 (48) HMG (109) PAF (132) 5HT3 (47)
[duster/benchmark] Scored fingerprints in 0s38ms
ACE: 69%
TXA2: 61%
HMG: 80%
PAF: 63%
5HT3: 50%
[duster/benchmark] cdk/macc: 66.5% (64.6% avg-of-avg)

CDK 2.0 (this patch)

[duster/benchmark] k=10, measure=Tanimoto, 
[duster/benchmark] Loaded fingerprints in 0s4ms
[duster/benchmark] 5 classes:  ACE (39) TXA2 (48) HMG (109) PAF (132) 5HT3 (47)
[duster/benchmark] Scored fingerprints in 0s55ms
ACE: 67%
TXA2: 60%
HMG: 80%
PAF: 63%
5HT3: 52%
[duster/benchmark] maccs166: 66.5% (64.4% avg-of-avg)

Some slight change between classes but the same total score.

Bib

T.J. O'Donnell, DESIGN and USE of RELATIONAL DATABASES in CHEMISTRY (Appendix)
http://accelrys.com/products/pdf/keys-to-keyset-technology.pdf
http://www.dalkescientific.com/writings/diary/archive/2014/10/17/maccs_key_44.html

… one molecule. In future this should be handed off to the user to manage, this will be easier when AtomRef is introduced.
…rings. In the case of R0 we can also check connectivity. [!R0] = [R]
…lly only mean benzene/pyridine - using the correct arom model this change has no change of results.
…ut was actually matching >= 8. Removing this and only checking for 8M rings affects a large number of molecules.
…e patterns should usually be matched, e.g. bit118: ACH2CH2A is not equivalent to *1~[#6H2]~[#6H2]~1. This is confirmed by other MACCS implementations (T.J. O'Donnell, Design and use of relational databases in chemistry). There are a lot of changes in MACCS FP generate because recursive SMARTS were incorrectly used: *~[#6H2]~[#6H2]~* matches the SMILES OCCO once uniquely, whilst [$(*~[#6H2]~[#6H2]~*)] matches it twice.
@johnmay
Copy link
Member Author

johnmay commented Mar 30, 2017

One for @rajarshi, I think he did the original port...

@johnmay
Copy link
Member Author

johnmay commented Mar 30, 2017

There may be more improvements coming in future to improve the benchmark score slight but I'll wait till after the paper is done to do that.

@egonw egonw requested a review from rajarshi March 31, 2017 06:43
@rajarshi
Copy link
Member

Very thorough analysis - I went over most of the modified SMARTS and they look OK to me. I think this is good to go

@egonw
Copy link
Member

egonw commented Apr 1, 2017

@rajarshi, you don't have rights to push the "Merge pull request" button?

@rajarshi rajarshi merged commit a9bc279 into master Apr 1, 2017
@egonw
Copy link
Member

egonw commented Apr 1, 2017

Oh, excellent! Good to see you have proper merge rights :)

@johnmay
Copy link
Member Author

johnmay commented Apr 1, 2017

Thanks for looking over it.

@johnmay johnmay deleted the patch/maccs166opt branch April 2, 2017 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants