Patch/faster atom types #175

egonw · 2015-11-14T22:20:03Z

John, a series of patches to reuse calculated properties. It's not really a significant performance improvement, but some 20-30%, I think.

Your thoughts please...

…, and be ready for reusing the data

egonw · 2015-11-15T14:26:41Z

I now ran it on nine SMILES strings of varying complexity, with enough nitrogens. Here are the numbers:

600727.211 ±(99.9%) 101636.546 ns/op [Average]
171571.599 ±(99.9%) 14104.917 ns/op [Average]

That's a lot better than my guesses... a speed up of about 70%.

johnmay · 2015-11-15T14:40:06Z

Looks good, but still lots of bits which could be much much faster. What are you benchmarks (ns/op?) something more relatable is mols per second.

I would try and remove more calls to methods like getMaxBondOrder(). I like the reuse of adjacency lists but consider that you're currently doing this:

for (IAtom atom : mol.atoms()) {
  List<IBond> bonds = new ArrayList(4);
  for (IBond bond : mol.bonds()) {
     if (bond.contains(atom))
        bonds.add(bond);
  }
}

Much better is the following, which literally n times better!

  Map<IAtom,List<IBond> bonds = new HashMap();
  for (IBond bond : mol.bonds()) {
     bonds.put(bond.getAtom(0), bond);
     bonds.put(bond.getAtom(1), bond);
  }
}

Also you can do much faster with some fundamental properties, if you calculate the connectivity (x), valence (v), and charge (q) not only can you switch on the values very fast but it bounds other more expensive ops.

v == x;  // all single bonds
v > x; // at least one double/triple bond

Patch/faster atom types

egonw · 2015-11-15T14:45:26Z

I fully agree with your observation... I did not have time for this idea yet... OK, let me try it...

BTW, do realize this complication... the code must be fairly flexible... it must work with missing information... the missing info you already solved for SMILES and possibly MDL format readers too, by implementing the implicit atom type models there... but this is not always present... e.g. in JChemPaint where there is no implicit atom type model... so, not sure the latter will work well...

johnmay · 2015-11-15T15:01:01Z

Looking good, testing this patch with 100,000 random from ChEMBL 20:
Before: 100000 7.46 s (13410.19 s-1)
After: 100000 3.46 s (28869.05 s-1)

Should be easy to get to close to 100,000 per second though.

johnmay · 2015-11-15T15:04:45Z

Something I find confusing is the tests for single and lone pairs... where does this info come from? Most molecule formats won't have this info right (maybe radicals I guess but super rare).

johnmay · 2015-11-15T15:06:14Z

Just tested switch, no difference really because it's C/N/O/P/S most of the time but it is neater code.

egonw · 2015-11-15T15:14:13Z

single, lone pairs -> it's needed for calculating Gasteiger (pi) charges...

johnmay · 2015-11-15T15:17:11Z

But doesn't it add these in its self..? here it uses the single/lone pair count to derive atom types but that info is missing in the first place (i.e. not provided).

Also for the JChemPaint you only really need H count, and then only for common atoms. As long as the interface allows one to specify the valence/h count that's all that's needed for drawing.

Also the Gasteiger Pi charges aren't great right? Graphs - http://www.eyesopen.com/quacpac

egonw · 2015-11-15T15:21:31Z

[single electron/lone pair] not sure what you mean... atom type perception should indeed not add such information; it should only perceive atom types...

[Gasteiger Pi charges aren't great] I know... but we don't integrate QM approaches in the CDK, and many properties depend on some property...

johnmay · 2015-11-15T15:25:17Z

Yes, it doesn't add it so where do the lone pairs come from? CML?

egonw · 2015-11-15T15:27:43Z

Oh... umm... there is code in the CDK that adds them... Miguel's thesis work in Chris' group in Cologne. CML supports it too, yes. MDL molfiles support radicals (right?), but not sure if the code currently reads that...

egonw · 2015-11-15T15:29:22Z

OK, I implemented your caching of connected bonds... another reasonable speed up (more than I expected; love to see your stats on 100k random ChEMBL; BTW, can you plz blog that?):

CDKBenchmark.testPerceiveAtomType avgt 10 140366.748 ± 31514.608 ns/op
CDKBenchmark.testPerceiveOneByOne avgt 10 169363.249 ± 11576.977 ns/op

Here's the patch, but plz wait a sec, so that I can do some more things: master...egonw:patch/evenFasterAtomTypes

johnmay · 2015-11-15T15:32:05Z

MDL molfiles support radicals

True, no lone pairs though, what proportion of molecules have radicals :-)

johnmay · 2015-11-15T15:33:59Z

Anyways mute point about the single/lone electrons as I like you quick check to skip the iterators. Just a curiosity that there's a lot of code and radical atom types.

egonw · 2015-11-15T15:41:07Z

Well, current cheminformatics is very biased towards neutral drug-like compounds. But the CDK is a chemistry development kit, not a drug-discovery development kit :)

BTW, second patch uses the precomputed bonds for sulphurs, hydrogens, and phosphors too:

CDKBenchmark.testPerceiveAtomType avgt 10 120095.246 ± 8394.922 ns/op
CDKBenchmark.testPerceiveOneByOne avgt 10 190666.575 ± 64073.884 ns/op

Apparently still quite a few of them... OK, let me try to halogens too...

egonw · 2015-11-15T15:44:59Z

Oh, carp... I need to just pass the full map... OK, more speed ups pending!

johnmay · 2015-11-15T15:52:32Z

Well, current cheminformatics is very biased towards neutral drug-like compounds. But the CDK is a chemistry development kit, not a drug-discovery development kit :)

Hmm, I would say more biased towards organic chemistry that just drug like molecules but they're one and the same I guess. The trouble is to handle more exotic things inorganics, polymers, formulations, etc and do it well it's a completely different paradigm away from lewis structures. I'm not completely sure having a universal model of both is compatible or even useful.

egonw added 14 commits November 14, 2015 17:23

Reduce the number of (specific) iterators created

ee2d8e4

Make more reuse of once-calculated connected bonds.

1eeb025

Removed more methods that do not take precalculated bond lists

873cf4d

Removed an unused method

f9b9d0e

Got rid of more old methods

dc7ab81

Simplified the code

b92b7b2

More reuse of info

59f46fc

Removed dead code

1a3d45a

Fail early

ebdd217

Reuse connected bonds for hydrogens

94d361b

Hydrogens are more frequent

94a7112

Ring searching is relative expensive: postpone it as much as possible…

6ce62d7

…, and be ready for reusing the data

Calculate rings only once if we type the full molecules

cc22713

Run ring search on the whole molecule

81c9b07

johnmay added a commit that referenced this pull request Nov 15, 2015

Merge pull request #175 from egonw/patch/fasterAtomTypes

6e5c9cf

Patch/faster atom types

johnmay merged commit 6e5c9cf into cdk:master Nov 15, 2015

egonw deleted the patch/fasterAtomTypes branch November 16, 2015 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch/faster atom types #175

Patch/faster atom types #175

egonw commented Nov 14, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

johnmay commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

Patch/faster atom types #175

Patch/faster atom types #175

Conversation

egonw commented Nov 14, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

johnmay commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015

johnmay commented Nov 15, 2015

egonw commented Nov 15, 2015

egonw commented Nov 15, 2015

johnmay commented Nov 15, 2015