Patched a bug in IsotopePatternGenerator #196
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I found this curious bug - running IsotopePatternGenerator on C20H30Fe2P2S4Cl4 returned sometimes 34 and sometimes 35 entries. Such non-deterministic behavior is not ideal for a library.
By tracking deeper I found the bug is caused by this: MolecularFormula.isotopes() returns an iterator of a Set, which is by definition unordered, and iterates the elements in random order (not sure how JRE manages this internally). The elements are therefore added to the growing isotope pattern in non-deterministic order. See https://github.com/cdk/cdk/blob/master/tool/formula/src/main/java/org/openscience/cdk/formula/IsotopePatternGenerator.java#L102
The method than does strange heuristics to decide whether to remove certain low-abundance isotopes. See
https://github.com/cdk/cdk/blob/master/tool/formula/src/main/java/org/openscience/cdk/formula/IsotopePatternGenerator.java#L224
However, whether the isotopes are low-abundant will depend on the order in which these are added to the calculation, which is random. Hence, the result is non-deterministic.
I added a junit test, which fails roughly in 50% of runs with CDK 1.5.12. I also added a patch for this bug, by reducing the threshold of what is considered low-abundance. Note: this does not fix the bug, only reduces its visibility (at the cost of increased computation time). Basically, I think the heuristics that is used is wrong, and the results of IsotopePatternGenerator are probably not 100% correct.