Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch/fpfixup #297

Merged
merged 19 commits into from Apr 9, 2017
Merged

Patch/fpfixup #297

merged 19 commits into from Apr 9, 2017

Conversation

johnmay
Copy link
Member

@johnmay johnmay commented Apr 8, 2017

I've been meaning to do this for a while. Ultimately I wanted to rewrite the entire fingerprint stack it's all a bit backwards at the moment but I digress. The default path based (i.e. Fingerprint) has some really odd quirks in it. I did the patches so it should be easy to step through the thought process and easy to follow. It will go even faster once I add in adjacency list but there's already enough here for a decent patch.

You can see in the commits but essentially it used to generate the path, reverse it, test uniqueness, reverse it again, test uniqueness, add it to an int[], then add it to a BitSet (unique set by design). On top of all of that the reversing wasn't even correct, the uniqueness is over-rated but it was meant to mean you only encode one bit for a path (e.g. NC=O and O=CN). However since the reversing was done by string manipulation you get some fun situations with two letter atom symbols. FeC should reverse to CFe (but they're CeF and eFC reversed) Doh!

I also made it so psuedo atoms are always encoded as '*' instead of max atomic number + 1 (this recently changed - :D). Oh and you don't normally want to include these in the fingerprint anyways so I added an option which skips all pseudo atoms.

After all these changes (and some very crafty cleverness) the fingerprint uses a lot less memory and is close to optimal without breaking backwards compatibility. Here are the times on my laptop for 10k molecules in ChEMBL 22, also shows there were no changes to the generated fingerprint.

[sovereign ~/workspace/github/cdk/cdk-paper-3/benchmark/cdk-2.0]: time head -n 10000 /data/chembl_22.smi | ./cdk fpgen --ifmt=smi --type=path -p > old.fps
Processing STDIN
[INFO] 10000 (261 per second)
Finished fpgen --ifmt=smi --type=path -p
 Elapsed Time: 38325
 Records       10000
 Processed:    10000
 Skipped:      0

real	0m38.980s
user	0m52.669s
sys	0m0.959s
[sovereign ~/workspace/github/cdk/cdk-paper-3/benchmark/cdk-2.0]: time head -n 10000 /data/chembl_22.smi | ./cdk fpgen --ifmt=smi --type=path -p > new.fps
Processing STDIN
[INFO] 10000 (2113 per second)
Finished fpgen --ifmt=smi --type=path -p
 Elapsed Time: 4734
 Records       10000
 Processed:    10000
 Skipped:      0

real	0m5.070s
user	0m13.885s
sys	0m0.375s
[sovereign ~/workspace/github/cdk/cdk-paper-3/benchmark/cdk-2.0]: diff old.fps new.fps 
5c5
< #date=2017-04-98T11:57:51
---
> #date=2017-04-98T11:59:32

Even faster

@johnmay
Copy link
Member Author

johnmay commented Apr 8, 2017

Whoops I meant to squish the "Fixup" commits into previous ones... I'll do that first thing tomorrow. Please don't merge before that.

put("Se", "E");
put("Na", "G");
put("Ca", "J");
put("Al", "A");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the code dealt with Fe/eF reversing issues :)

Copy link
Member

@egonw egonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Happy to see this code updated.

Yes, so, about those the GraphOnly and Hybridization FPs... the first exists to find similar skeletons, but not sure if it is still used... the second has an important purpose... aromaticity is a pain... Mind you, I think the code was generally ignoring hydrogens... so, to make that work, you had to ignore whether something was actually aromatic, but focus on being delocalized... (just that you understand what people may ask about...)

return "A";
}
return atom.getSymbol();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the comment here that Fe is not included... it would need extension to all two-char element symbols, but I'm sure you are changing it in a later patch anyway?

case TRIPLE:
return "#";
default:
return "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice example of maintenance... "switch" did not exist when Chris originally wrote this code now close to 20 years ago :)

BTW, is the switch seriously enough faster to notice?? Or is it just not doing the getOrder() so often?

@egonw
Copy link
Member

egonw commented Apr 9, 2017

(But I won't merge in until you tell me it's ready... PS, nice speed up!)

@johnmay
Copy link
Member Author

johnmay commented Apr 9, 2017

Will comment here because the push will wipe the commits:

This is how the code dealt with Fe/eF reversing issues :)

Yes I presumed that was the case.

BTW, is the switch seriously enough faster to notice?? Or is it just not doing the getOrder() so often?

Yep it gets called a lot, and always switch. It's is minimal for something this size but switches are implemented as jumps in assembly vs branches which are conditional jumps. For branching https://en.wikipedia.org/wiki/Branch_predictor helps a lot in most cases but in this case we're hoping between conditions single, double, a lot. See https://en.wikipedia.org/wiki/Branch_(computer_science)#Performance_problems_with_branch_instructions also, So conditional branches can cause "stalls" in which the pipeline has to be restarted on a different part of the program.

The biggest speed up was actually from eliminating the path encoding as strings (i.e the StringBuilder). Typically I've always (re)used StringBuilder in such situations but it really was being bottlenecked by the String creation.

You can go either faster if we remove the backwards compatibility.

int hashFor = hashPath(..);
int hashRev = hashPath(..);
return Math.min(hashFor, hashRev);

And then if you remove the uniqueness requirement you can generate the hash in the traversal.. as we traverse the next hash is the previous hash plus the new visited atom. I really don't think the uniqueness is needed - without it you set twice as many bits, however Daylight actually used to encode multiple bits per pattern based on the path length.

…lower without actually generating it or reversing atoms (we still need to reverse for the generation ATM).
…ction is now very minimal and easy to reverse.
…aromatic flags by default this example won't work. Temporary work around is not not wipe aromatic flags for pseudo atoms. Ultimately the fingerprint should be doing aromaticity perception (GraphOnly/HybridisationFingerprint is an option here inelegant).
…9 atoms) means you generate many paths for some caged structures, we have an exception thrown for this cases telling users to decrease the length.
@johnmay
Copy link
Member Author

johnmay commented Apr 9, 2017

Oh nice githup keeps the original commits now :-). All fixed up.

@egonw
Copy link
Member

egonw commented Apr 9, 2017

OK, I'll wait for Travis to finish and if all is good, I'll merge it in...

@egonw egonw merged commit 1bd50d6 into master Apr 9, 2017
@johnmay
Copy link
Member Author

johnmay commented Apr 9, 2017

Going to have a quick look at subclasses as I think Hybridisation FP etc can be improved also (i.e. they use the old getHashes method).

@egonw
Copy link
Member

egonw commented Apr 9, 2017

@johnmay, the Hybridization FP is identical to the path FP except that it does a s/aromaticity/sp2Hybrid/g... So, I recommend to just overwrite the code with your new path FP code, and then replace the checks for aromaticity with a check if the atom is sp2 hybridized...

@johnmay
Copy link
Member Author

johnmay commented Apr 9, 2017

See #298

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants