New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch/fpfixup #297
Patch/fpfixup #297
Conversation
…l allow us to avoid holding all paths in memory in future.
…turn the inverse of the lexicographic lowest because it was being re-reveresed later.
…here this is going.
… only done on first atom previously.
…eudo atoms. Typically you would skip all paths with pseudo atoms in them anyways.
…s C-eF instead of C-Fe. This does change the fingerprint for those atoms but is now more correct.
Whoops I meant to squish the "Fixup" commits into previous ones... I'll do that first thing tomorrow. Please don't merge before that. |
put("Se", "E"); | ||
put("Na", "G"); | ||
put("Ca", "J"); | ||
put("Al", "A"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the code dealt with Fe/eF reversing issues :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Happy to see this code updated.
Yes, so, about those the GraphOnly and Hybridization FPs... the first exists to find similar skeletons, but not sure if it is still used... the second has an important purpose... aromaticity is a pain... Mind you, I think the code was generally ignoring hydrogens... so, to make that work, you had to ignore whether something was actually aromatic, but focus on being delocalized... (just that you understand what people may ask about...)
return "A"; | ||
} | ||
return atom.getSymbol(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the comment here that Fe is not included... it would need extension to all two-char element symbols, but I'm sure you are changing it in a later patch anyway?
case TRIPLE: | ||
return "#"; | ||
default: | ||
return ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice example of maintenance... "switch" did not exist when Chris originally wrote this code now close to 20 years ago :)
BTW, is the switch seriously enough faster to notice?? Or is it just not doing the getOrder() so often?
(But I won't merge in until you tell me it's ready... PS, nice speed up!) |
Will comment here because the push will wipe the commits:
Yes I presumed that was the case.
Yep it gets called a lot, and always switch. It's is minimal for something this size but switches are implemented as jumps in assembly vs branches which are conditional jumps. For branching https://en.wikipedia.org/wiki/Branch_predictor helps a lot in most cases but in this case we're hoping between conditions single, double, a lot. See https://en.wikipedia.org/wiki/Branch_(computer_science)#Performance_problems_with_branch_instructions also, So conditional branches can cause "stalls" in which the pipeline has to be restarted on a different part of the program. The biggest speed up was actually from eliminating the path encoding as strings (i.e the StringBuilder). Typically I've always (re)used StringBuilder in such situations but it really was being bottlenecked by the String creation. You can go either faster if we remove the backwards compatibility.
And then if you remove the uniqueness requirement you can generate the hash in the traversal.. as we traverse the next hash is the previous hash plus the new visited atom. I really don't think the uniqueness is needed - without it you set twice as many bits, however Daylight actually used to encode multiple bits per pattern based on the path length. |
…lower without actually generating it or reversing atoms (we still need to reverse for the generation ATM).
…ction is now very minimal and easy to reverse.
…aromatic flags by default this example won't work. Temporary work around is not not wipe aromatic flags for pseudo atoms. Ultimately the fingerprint should be doing aromaticity perception (GraphOnly/HybridisationFingerprint is an option here inelegant).
…9 atoms) means you generate many paths for some caged structures, we have an exception thrown for this cases telling users to decrease the length.
Oh nice githup keeps the original commits now :-). All fixed up. |
OK, I'll wait for Travis to finish and if all is good, I'll merge it in... |
Going to have a quick look at subclasses as I think Hybridisation FP etc can be improved also (i.e. they use the old getHashes method). |
@johnmay, the Hybridization FP is identical to the path FP except that it does a s/aromaticity/sp2Hybrid/g... So, I recommend to just overwrite the code with your new path FP code, and then replace the checks for aromaticity with a check if the atom is sp2 hybridized... |
See #298 |
I've been meaning to do this for a while. Ultimately I wanted to rewrite the entire fingerprint stack it's all a bit backwards at the moment but I digress. The default path based (i.e. Fingerprint) has some really odd quirks in it. I did the patches so it should be easy to step through the thought process and easy to follow. It will go even faster once I add in adjacency list but there's already enough here for a decent patch.
You can see in the commits but essentially it used to generate the path, reverse it, test uniqueness, reverse it again, test uniqueness, add it to an int[], then add it to a BitSet (unique set by design). On top of all of that the reversing wasn't even correct, the uniqueness is over-rated but it was meant to mean you only encode one bit for a path (e.g. NC=O and O=CN). However since the reversing was done by string manipulation you get some fun situations with two letter atom symbols.
FeC
should reverse toCFe
(but they'reCeF
andeFC
reversed) Doh!I also made it so psuedo atoms are always encoded as '*' instead of max atomic number + 1 (this recently changed - :D). Oh and you don't normally want to include these in the fingerprint anyways so I added an option which skips all pseudo atoms.
After all these changes (and some very crafty cleverness) the fingerprint uses a lot less memory and is close to optimal without breaking backwards compatibility. Here are the times on my laptop for 10k molecules in ChEMBL 22, also shows there were no changes to the generated fingerprint.
Even faster