Patch/fpfixup #297

johnmay · 2017-04-08T23:31:35Z

I've been meaning to do this for a while. Ultimately I wanted to rewrite the entire fingerprint stack it's all a bit backwards at the moment but I digress. The default path based (i.e. Fingerprint) has some really odd quirks in it. I did the patches so it should be easy to step through the thought process and easy to follow. It will go even faster once I add in adjacency list but there's already enough here for a decent patch.

You can see in the commits but essentially it used to generate the path, reverse it, test uniqueness, reverse it again, test uniqueness, add it to an int[], then add it to a BitSet (unique set by design). On top of all of that the reversing wasn't even correct, the uniqueness is over-rated but it was meant to mean you only encode one bit for a path (e.g. NC=O and O=CN). However since the reversing was done by string manipulation you get some fun situations with two letter atom symbols. FeC should reverse to CFe (but they're CeF and eFC reversed) Doh!

I also made it so psuedo atoms are always encoded as '*' instead of max atomic number + 1 (this recently changed - :D). Oh and you don't normally want to include these in the fingerprint anyways so I added an option which skips all pseudo atoms.

After all these changes (and some very crafty cleverness) the fingerprint uses a lot less memory and is close to optimal without breaking backwards compatibility. Here are the times on my laptop for 10k molecules in ChEMBL 22, also shows there were no changes to the generated fingerprint.

[sovereign ~/workspace/github/cdk/cdk-paper-3/benchmark/cdk-2.0]: time head -n 10000 /data/chembl_22.smi | ./cdk fpgen --ifmt=smi --type=path -p > old.fps
Processing STDIN
[INFO] 10000 (261 per second)
Finished fpgen --ifmt=smi --type=path -p
 Elapsed Time: 38325
 Records       10000
 Processed:    10000
 Skipped:      0

real	0m38.980s
user	0m52.669s
sys	0m0.959s
[sovereign ~/workspace/github/cdk/cdk-paper-3/benchmark/cdk-2.0]: time head -n 10000 /data/chembl_22.smi | ./cdk fpgen --ifmt=smi --type=path -p > new.fps
Processing STDIN
[INFO] 10000 (2113 per second)
Finished fpgen --ifmt=smi --type=path -p
 Elapsed Time: 4734
 Records       10000
 Processed:    10000
 Skipped:      0

real	0m5.070s
user	0m13.885s
sys	0m0.375s
[sovereign ~/workspace/github/cdk/cdk-paper-3/benchmark/cdk-2.0]: diff old.fps new.fps 
5c5
< #date=2017-04-98T11:57:51
---
> #date=2017-04-98T11:59:32

Even faster

…l allow us to avoid holding all paths in memory in future.

…turn the inverse of the lexicographic lowest because it was being re-reveresed later.

…here this is going.

… only done on first atom previously.

…eudo atoms. Typically you would skip all paths with pseudo atoms in them anyways.

…s C-eF instead of C-Fe. This does change the fingerprint for those atoms but is now more correct.

johnmay · 2017-04-08T23:36:07Z

Whoops I meant to squish the "Fixup" commits into previous ones... I'll do that first thing tomorrow. Please don't merge before that.

egonw · 2017-04-09T05:53:05Z

base/standard/src/main/java/org/openscience/cdk/fingerprint/Fingerprinter.java

-                                                                          put("Se", "E");
-                                                                          put("Na", "G");
-                                                                          put("Ca", "J");
-                                                                          put("Al", "A");


This is how the code dealt with Fe/eF reversing issues :)

egonw

Looks good to me. Happy to see this code updated.

Yes, so, about those the GraphOnly and Hybridization FPs... the first exists to find similar skeletons, but not sure if it is still used... the second has an important purpose... aromaticity is a pain... Mind you, I think the code was generally ignoring hydrogens... so, to make that work, you had to ignore whether something was actually aromatic, but focus on being delocalized... (just that you understand what people may ask about...)

egonw · 2017-04-09T05:54:12Z

base/standard/src/main/java/org/openscience/cdk/fingerprint/Fingerprinter.java

+                    return "A";
+            }
+            return atom.getSymbol();
+        }


With the comment here that Fe is not included... it would need extension to all two-char element symbols, but I'm sure you are changing it in a later patch anyway?

egonw · 2017-04-09T06:00:37Z

base/standard/src/main/java/org/openscience/cdk/fingerprint/Fingerprinter.java

+            case TRIPLE:
+                return "#";
+            default:
+                return "";


This is a nice example of maintenance... "switch" did not exist when Chris originally wrote this code now close to 20 years ago :)

BTW, is the switch seriously enough faster to notice?? Or is it just not doing the getOrder() so often?

egonw · 2017-04-09T06:23:00Z

(But I won't merge in until you tell me it's ready... PS, nice speed up!)

johnmay · 2017-04-09T07:58:09Z

Will comment here because the push will wipe the commits:

This is how the code dealt with Fe/eF reversing issues :)

Yes I presumed that was the case.

BTW, is the switch seriously enough faster to notice?? Or is it just not doing the getOrder() so often?

Yep it gets called a lot, and always switch. It's is minimal for something this size but switches are implemented as jumps in assembly vs branches which are conditional jumps. For branching https://en.wikipedia.org/wiki/Branch_predictor helps a lot in most cases but in this case we're hoping between conditions single, double, a lot. See https://en.wikipedia.org/wiki/Branch_(computer_science)#Performance_problems_with_branch_instructions also, So conditional branches can cause "stalls" in which the pipeline has to be restarted on a different part of the program.

The biggest speed up was actually from eliminating the path encoding as strings (i.e the StringBuilder). Typically I've always (re)used StringBuilder in such situations but it really was being bottlenecked by the String creation.

You can go either faster if we remove the backwards compatibility.

int hashFor = hashPath(..);
int hashRev = hashPath(..);
return Math.min(hashFor, hashRev);

And then if you remove the uniqueness requirement you can generate the hash in the traversal.. as we traverse the next hash is the previous hash plus the new visited atom. I really don't think the uniqueness is needed - without it you set twice as many bits, however Daylight actually used to encode multiple bits per pattern based on the path length.

…lower without actually generating it or reversing atoms (we still need to reverse for the generation ATM).

…e string.

…ction is now very minimal and easy to reverse.

… would match.

…aromatic flags by default this example won't work. Temporary work around is not not wipe aromatic flags for pseudo atoms. Ultimately the fingerprint should be doing aromaticity perception (GraphOnly/HybridisationFingerprint is an option here inelegant).

…9 atoms) means you generate many paths for some caged structures, we have an exception thrown for this cases telling users to decrease the length.

johnmay · 2017-04-09T08:05:13Z

Oh nice githup keeps the original commits now :-). All fixed up.

egonw · 2017-04-09T08:06:56Z

OK, I'll wait for Travis to finish and if all is good, I'll merge it in...

johnmay · 2017-04-09T08:56:05Z

Going to have a quick look at subclasses as I think Hybridisation FP etc can be improved also (i.e. they use the old getHashes method).

egonw · 2017-04-09T09:05:02Z

@johnmay, the Hybridization FP is identical to the path FP except that it does a s/aromaticity/sp2Hybrid/g... So, I recommend to just overwrite the code with your new path FP code, and then replace the checks for aromaticity with a check if the atom is sp2 hybridized...

johnmay · 2017-04-09T09:06:15Z

See #298

johnmay added 9 commits April 8, 2017 18:25

Localize path encode to a seperate method, little benefit now but wil…

6cb2980

…l allow us to avoid holding all paths in memory in future.

Avoid the separate clean step, not doing any work. Note we have to re…

15b0955

…turn the inverse of the lexicographic lowest because it was being re-reveresed later.

Of course we don't need to store the strings are all... you can see w…

b8553de

…here this is going.

Just set the values in the bitset, by definition the set it unique.

b8ed70d

Encode atoms in a specific method - note the pseudo atom handling was…

ed53d4b

… only done on first atom previously.

Fingerprint will change as we get more elements - just use '*' for ps…

214361c

…eudo atoms. Typically you would skip all paths with pseudo atoms in them anyways.

Better cache performance.

6676d2a

Reverse the path rather than the string, Fe-C, reversed as a string i…

6d011bc

…s C-eF instead of C-Fe. This does change the fingerprint for those atoms but is now more correct.

Rename some vars and encapsulate method ready for low-mem traversal.

cd59c82

egonw reviewed Apr 9, 2017

View reviewed changes

johnmay added 10 commits April 9, 2017 09:03

Low memory path traversal.

8a3cbee

Fast bond symbol encoding.

99c09be

Use the bonds captured during the traversal to encode the path.

d916006

Be clever and work out what generated path will be lexicographically …

e078b08

…lower without actually generating it or reversing atoms (we still need to reverse for the generation ATM).

Encode the hash of a concatenated string without actually creating th…

2a4c2e0

…e string.

Might as well avoid the reversing, not expensive but the hashPath fun…

cb2d296

…ction is now very minimal and easy to reverse.

Better error message.

9b853c2

Allow fingerprint for pseudo atoms to be a subset of a structure they…

5b569a3

… would match.

Decrease default path length to 7 bonds (same as Daylight). 8 bonds (…

6c59dfb

…9 atoms) means you generate many paths for some caged structures, we have an exception thrown for this cases telling users to decrease the length.

johnmay force-pushed the patch/fpfixup branch from cf6c7a3 to 6c59dfb Compare April 9, 2017 08:04

egonw merged commit 1bd50d6 into master Apr 9, 2017

egonw mentioned this pull request Apr 23, 2017

Reuse parts from base path fingerprint. #298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch/fpfixup #297

Patch/fpfixup #297

johnmay commented Apr 8, 2017

johnmay commented Apr 8, 2017

egonw Apr 9, 2017

egonw left a comment

egonw Apr 9, 2017

egonw Apr 9, 2017

egonw commented Apr 9, 2017

johnmay commented Apr 9, 2017

johnmay commented Apr 9, 2017

egonw commented Apr 9, 2017

johnmay commented Apr 9, 2017

egonw commented Apr 9, 2017

johnmay commented Apr 9, 2017

Patch/fpfixup #297

Patch/fpfixup #297

Conversation

johnmay commented Apr 8, 2017

johnmay commented Apr 8, 2017

egonw Apr 9, 2017

Choose a reason for hiding this comment

egonw left a comment

Choose a reason for hiding this comment

egonw Apr 9, 2017

Choose a reason for hiding this comment

egonw Apr 9, 2017

Choose a reason for hiding this comment

egonw commented Apr 9, 2017

johnmay commented Apr 9, 2017

johnmay commented Apr 9, 2017

egonw commented Apr 9, 2017

johnmay commented Apr 9, 2017

egonw commented Apr 9, 2017

johnmay commented Apr 9, 2017