Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement round robin algorithm for decomposing molecular formulas #210

Merged
merged 5 commits into from Aug 1, 2016

Conversation

kaibioinfo
Copy link
Contributor

Hi,

I talked with Thomas Pluskal about implementing the molecular formula decomposer from SIRIUS into the CDK.

Motivation:
SIRIUS is using the round robin algorithm to decompose formulas. It can be around 10 to 1000 times faster than the current algorithm (depending on the input parameters). The key idea is to first compute a dynamic programming (DP) table that memoizes which masses are decomposable for a certain alphabet. Afterwards, enumerating all formulas with a certain mass is just backtracking this DP table. In comparison to the current enumeration algorithm, this will only check molecular formulas which really have the given mass (and, therefore, the running time depends only on the number of formulas in the result set and not on the mass itself). Round Robin is using an extended residue table (ERT), which can store this kind of information with few kilobytes of memory.

See
http://dx.doi.org/10.1093/bioinformatics/btm631
and
http://dx.doi.org/10.1007/978-3-642-40453-5_5

API changes:

  • I tried to not change the public API of the MolecularFormulaGenerator. So far the only change is that the getFinishedPercentage method is not working anymore (and, therefore, a test fails). It returns 0 if the decomposition is running and 1 if it is canceled or done.

Implementation Details:

  • I used the code directly from SIRIUS (which is LGPL, but I'm the author anyways) and changed a few details, in especially I added an iterator variant to match the current API in CDK. I tried to change as less as possible, therefore most of the classes are moved into a decomp package and some classes are redundant with CDK classes.
  • The decomp classes are not fully documented. I will work on that. The important classes for CDK are fully documented.
  • the current implementation cannot deal with some ill-posed inputs, e.g. when only one element is given or masses are incredibly large. Therefore, I kept the original decomposer by Thomas and wrote a wrapper class that dynamically decide which implementation it want to choose
  • the method getFinishedPercentage is implemented as dummy. There is no implementation that would make sense
  • as the extended residue table has to be computed only once, it is currently stored in the DecomposerFactory. I implemented a very simple cache that just stores the decomposers for the last 10 alphabets.

A Bugfix to the current API:

  • I know, that should be a separate pull request, but it's already in this commit, my apologize.
  • The flag searchRunning have to be volatile, otherwise the cancel method might be useless when called from another thread

*
* @param <T> type of a single character in the alphabet
*/
interface Alphabet<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there likely to be different Alphabets? I'd be tempted to pull these methods up to the single ChemicalAlphabet implementation and drop the interface.

@johnmay
Copy link
Member

johnmay commented Jul 28, 2016

Looks really great and nicely gets called automatically for existing uses, thanks! There's a test regression and some style/location adaptions would be appreciated.

  1. There's a regression in the test suite to do with testGetFinishedPercentage.
  2. Remove Alhabet interface and pull methods up to the concrete ChemicalAlhabet
  3. More package-private, we've currently got a huge API for historical reasons so as much as we can hide as possible is best. Two options here:
    a) [Preferred] move everything in org.openscience.cdk.decomp.* to org.openscience.cdk.formula.* and package private scope all new classes, interfaces need to then be abstract classes.
    b) move RoundRobin to decomp and keep it public, make everything else package private as above

I'm happy to merge once these are resolved. Egon might also want to take a look.

@tomas-pluskal
Copy link
Contributor

Thanks, Kai!

Is there really no way to estimate the progress of this algorithm? I tried briefly running it on a larger mass range (10,000 +- 0.003 Da), it took ~30 min to complete. Such runtimes with no feedback of progress are obviously quite annoying for GUI apps.

Which method does the algorithm spend most time in? Is it integerDecompose()?

@kaibioinfo
Copy link
Contributor Author

kaibioinfo commented Jul 28, 2016

@tomas-pluskal : The running time depends on the number of molecular formulas that are returned (so, basicially, the number of #getNextFormula calls). If you wait 30 minutes for decomposing a mass you will also get millions of formulas back. Writing them to file or display them in some GUI should take even more time. The question is why you would want to decompose millions of formulas.
However, you can, of course, just guess a number. I just don't really see a use case for that.

@johnmay : I kept the classes to keep the code mostly identical to the original code. However, there is no further reason to do so. I'm fine with removing the alphabet class and moving the other classes with package visibility into the formula package.

@tomas-pluskal
Copy link
Contributor

@kaibioinfo: The 30 min run I mentioned did not result in millions of formulas, but only in 1510. And that is just a list of "raw" formulas, which can be further reduced by seven golden rules, isotope matching, etc. Therefore, I'd say it totally makes sense to let the tool run for 30 min and get all possible candidates.

@kaibioinfo
Copy link
Contributor Author

@tomas-pluskal hm, that sounds like a bug =( Can you give me the number and alphabet you used for that?

@tomas-pluskal
Copy link
Contributor

Sure, I used this as input:

mass=10000.0
tolerance(Da)=0.003
elements=C[0-100]H[0-100]N[0-100]O[0-100]S[0-100]P[0-100]

@kaibioinfo
Copy link
Contributor Author

kaibioinfo commented Jul 28, 2016

Ah, okay, I see the problem: The algorithm is input-sensitive when using no bounds. Especially bounds on hydrogen atoms don't give you anything (as it is in the current implementation of yours) as hydrogens are added at very last to the formula. So if you decompose some mass and it generate millions of formulas it might take half an hour. But if you set the hydrogen bound to some very small number such that 99,99% of the generated formulas do not satisfy this bound, you will still have to wait half an hour for the formulas but won't get that many back.
The same is, as you said, for additional rules like seven golden rules: You might generate millions of formulas but only keep a tiny subset due to filtering.

You could somehow simulate the existing getFinishedPercentage method. But it would be very inaccurate. And overall it just slows down your code because you synchronize every iteration step just for a counter... As said, other CDK method (like CycleFinger) don't use a progress bar, too.

For mass spectrometry use cases you would not want to fully decompose that large masses, anyways.

@tomas-pluskal
Copy link
Contributor

Well, yes, I understand all these arguments, but as a developer and a user of a GUI app, I'd say that progress feedback is a very important feature, often overlooked in libraries. And yes, I think CDK should have such feedback for other methods as well, especially if their runtime is substantial.

It's true that synchronizing threads might slow things down a bit, but I doubt it will be a significant difference (we could test that of course). Anyway, I'll take a closer look at how it is actually implemented, and I'll let you know if I get any idea. If not, let's leave it as it is.

@kaibioinfo
Copy link
Contributor Author

Okay, I removed all sort of classes and wrote everything into the RoundRobinDecomposer. Currently, I use top-level local classes. Is it better to use nested private static classes?

Regarding the getPercentage method: I now use the same implementation as the current decomposer and return how much of the search space we have already covered. However, this will be very inaccurate and is probably not much better than just guessing numbers. However, it will give you an increasing progress bar and it now pass all the tests.

@tomas-pluskal
Copy link
Contributor

Awesome, thanks!!

I wonder if it would make sense to leave FullEnumerationFormulaGenerator and RoundRobinFormulaGenerator as public classes, to let people choose one or the other for whatever reason they might have. What do you think, @johnmay ?

@johnmay
Copy link
Member

johnmay commented Aug 1, 2016

Currently, I use top-level local classes. Is it better to use nested private static classes?

Either is fine and updated patch looks good

I wonder if it would make sense to leave FullEnumerationFormulaGenerator and RoundRobinFormulaGenerator as public classes, to let people choose one or the other for whatever reason they might have.

The way I would personally preferred to do that is to have a parameter in the method. Can be done at a later stage but an enum with something like (both of you may have better naming): Automatic (default), Full, RoundRobin.

@johnmay johnmay merged commit 868adff into cdk:master Aug 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants