Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch/faster atom types #175

Merged
merged 14 commits into from Nov 15, 2015
Merged

Patch/faster atom types #175

merged 14 commits into from Nov 15, 2015

Conversation

egonw
Copy link
Member

@egonw egonw commented Nov 14, 2015

John, a series of patches to reuse calculated properties. It's not really a significant performance improvement, but some 20-30%, I think.

Your thoughts please...

@egonw
Copy link
Member Author

egonw commented Nov 15, 2015

I now ran it on nine SMILES strings of varying complexity, with enough nitrogens. Here are the numbers:

600727.211 ±(99.9%) 101636.546 ns/op [Average]
171571.599 ±(99.9%) 14104.917 ns/op [Average]

That's a lot better than my guesses... a speed up of about 70%.

@johnmay
Copy link
Member

johnmay commented Nov 15, 2015

Looks good, but still lots of bits which could be much much faster. What are you benchmarks (ns/op?) something more relatable is mols per second.

I would try and remove more calls to methods like getMaxBondOrder(). I like the reuse of adjacency lists but consider that you're currently doing this:

for (IAtom atom : mol.atoms()) {
  List<IBond> bonds = new ArrayList(4);
  for (IBond bond : mol.bonds()) {
     if (bond.contains(atom))
        bonds.add(bond);
  }
}

Much better is the following, which literally n times better!

  Map<IAtom,List<IBond> bonds = new HashMap();
  for (IBond bond : mol.bonds()) {
     bonds.put(bond.getAtom(0), bond);
     bonds.put(bond.getAtom(1), bond);
  }
}

Also you can do much faster with some fundamental properties, if you calculate the connectivity (x), valence (v), and charge (q) not only can you switch on the values very fast but it bounds other more expensive ops.

v == x;  // all single bonds
v > x; // at least one double/triple bond

johnmay added a commit that referenced this pull request Nov 15, 2015
@johnmay johnmay merged commit 6e5c9cf into cdk:master Nov 15, 2015
@egonw
Copy link
Member Author

egonw commented Nov 15, 2015

I fully agree with your observation... I did not have time for this idea yet... OK, let me try it...

BTW, do realize this complication... the code must be fairly flexible... it must work with missing information... the missing info you already solved for SMILES and possibly MDL format readers too, by implementing the implicit atom type models there... but this is not always present... e.g. in JChemPaint where there is no implicit atom type model... so, not sure the latter will work well...

@johnmay
Copy link
Member

johnmay commented Nov 15, 2015

Looking good, testing this patch with 100,000 random from ChEMBL 20:
Before: 100000 7.46 s (13410.19 s-1)
After: 100000 3.46 s (28869.05 s-1)

Should be easy to get to close to 100,000 per second though.

@johnmay
Copy link
Member

johnmay commented Nov 15, 2015

Something I find confusing is the tests for single and lone pairs... where does this info come from? Most molecule formats won't have this info right (maybe radicals I guess but super rare).

@johnmay
Copy link
Member

johnmay commented Nov 15, 2015

Just tested switch, no difference really because it's C/N/O/P/S most of the time but it is neater code.

@egonw
Copy link
Member Author

egonw commented Nov 15, 2015

single, lone pairs -> it's needed for calculating Gasteiger (pi) charges...

@johnmay
Copy link
Member

johnmay commented Nov 15, 2015

But doesn't it add these in its self..? here it uses the single/lone pair count to derive atom types but that info is missing in the first place (i.e. not provided).

Also for the JChemPaint you only really need H count, and then only for common atoms. As long as the interface allows one to specify the valence/h count that's all that's needed for drawing.

Also the Gasteiger Pi charges aren't great right? Graphs - http://www.eyesopen.com/quacpac

@egonw
Copy link
Member Author

egonw commented Nov 15, 2015

[single electron/lone pair] not sure what you mean... atom type perception should indeed not add such information; it should only perceive atom types...

[Gasteiger Pi charges aren't great] I know... but we don't integrate QM approaches in the CDK, and many properties depend on some property...

@johnmay
Copy link
Member

johnmay commented Nov 15, 2015

Yes, it doesn't add it so where do the lone pairs come from? CML?

@egonw
Copy link
Member Author

egonw commented Nov 15, 2015

Oh... umm... there is code in the CDK that adds them... Miguel's thesis work in Chris' group in Cologne. CML supports it too, yes. MDL molfiles support radicals (right?), but not sure if the code currently reads that...

@egonw
Copy link
Member Author

egonw commented Nov 15, 2015

OK, I implemented your caching of connected bonds... another reasonable speed up (more than I expected; love to see your stats on 100k random ChEMBL; BTW, can you plz blog that?):

CDKBenchmark.testPerceiveAtomType avgt 10 140366.748 ± 31514.608 ns/op
CDKBenchmark.testPerceiveOneByOne avgt 10 169363.249 ± 11576.977 ns/op

Here's the patch, but plz wait a sec, so that I can do some more things: master...egonw:patch/evenFasterAtomTypes

@johnmay
Copy link
Member

johnmay commented Nov 15, 2015

MDL molfiles support radicals

True, no lone pairs though, what proportion of molecules have radicals :-)

@johnmay
Copy link
Member

johnmay commented Nov 15, 2015

Anyways mute point about the single/lone electrons as I like you quick check to skip the iterators. Just a curiosity that there's a lot of code and radical atom types.

@egonw
Copy link
Member Author

egonw commented Nov 15, 2015

Well, current cheminformatics is very biased towards neutral drug-like compounds. But the CDK is a chemistry development kit, not a drug-discovery development kit :)

BTW, second patch uses the precomputed bonds for sulphurs, hydrogens, and phosphors too:

CDKBenchmark.testPerceiveAtomType avgt 10 120095.246 ± 8394.922 ns/op
CDKBenchmark.testPerceiveOneByOne avgt 10 190666.575 ± 64073.884 ns/op

Apparently still quite a few of them... OK, let me try to halogens too...

@egonw
Copy link
Member Author

egonw commented Nov 15, 2015

Oh, carp... I need to just pass the full map... OK, more speed ups pending!

@johnmay
Copy link
Member

johnmay commented Nov 15, 2015

Well, current cheminformatics is very biased towards neutral drug-like compounds. But the CDK is a chemistry development kit, not a drug-discovery development kit :)

Hmm, I would say more biased towards organic chemistry that just drug like molecules but they're one and the same I guess. The trouble is to handle more exotic things inorganics, polymers, formulations, etc and do it well it's a completely different paradigm away from lewis structures. I'm not completely sure having a universal model of both is compatible or even useful.

@egonw egonw deleted the patch/fasterAtomTypes branch November 16, 2015 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants