Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patched a bug in IsotopePatternGenerator #196

Merged
merged 2 commits into from Feb 16, 2016
Merged

Conversation

tomas-pluskal
Copy link
Contributor

I found this curious bug - running IsotopePatternGenerator on C20H30Fe2P2S4Cl4 returned sometimes 34 and sometimes 35 entries. Such non-deterministic behavior is not ideal for a library.

By tracking deeper I found the bug is caused by this: MolecularFormula.isotopes() returns an iterator of a Set, which is by definition unordered, and iterates the elements in random order (not sure how JRE manages this internally). The elements are therefore added to the growing isotope pattern in non-deterministic order. See https://github.com/cdk/cdk/blob/master/tool/formula/src/main/java/org/openscience/cdk/formula/IsotopePatternGenerator.java#L102

The method than does strange heuristics to decide whether to remove certain low-abundance isotopes. See
https://github.com/cdk/cdk/blob/master/tool/formula/src/main/java/org/openscience/cdk/formula/IsotopePatternGenerator.java#L224
However, whether the isotopes are low-abundant will depend on the order in which these are added to the calculation, which is random. Hence, the result is non-deterministic.

I added a junit test, which fails roughly in 50% of runs with CDK 1.5.12. I also added a patch for this bug, by reducing the threshold of what is considered low-abundance. Note: this does not fix the bug, only reduces its visibility (at the cost of increased computation time). Basically, I think the heuristics that is used is wrong, and the results of IsotopePatternGenerator are probably not 100% correct.

@johnmay
Copy link
Member

johnmay commented Feb 16, 2016

Just to clarify, the test will still fail 50% of the time? Any ideas on a better method.

  • John

@tomas-pluskal
Copy link
Contributor Author

With the updates, the test method should not fail.

@tomas-pluskal
Copy link
Contributor Author

For a better method, see e.g. http://www.sciencedirect.com/science/article/pii/S1044030507004308

Actually, calculation of exact isotope patterns is a non-trivial problem, as MW gets larger. Not sure if CDK also wants to cover the area of large molecules like proteins, though.

@johnmay
Copy link
Member

johnmay commented Feb 16, 2016

Okay thanks for info and patch.

johnmay added a commit that referenced this pull request Feb 16, 2016
Patched a bug in IsotopePatternGenerator
@johnmay johnmay merged commit 30e8d67 into cdk:master Feb 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants