New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement round robin algorithm for decomposing molecular formulas #210
Conversation
* | ||
* @param <T> type of a single character in the alphabet | ||
*/ | ||
interface Alphabet<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there likely to be different Alphabets? I'd be tempted to pull these methods up to the single ChemicalAlphabet implementation and drop the interface.
Looks really great and nicely gets called automatically for existing uses, thanks! There's a test regression and some style/location adaptions would be appreciated.
I'm happy to merge once these are resolved. Egon might also want to take a look. |
Thanks, Kai! Is there really no way to estimate the progress of this algorithm? I tried briefly running it on a larger mass range (10,000 +- 0.003 Da), it took ~30 min to complete. Such runtimes with no feedback of progress are obviously quite annoying for GUI apps. Which method does the algorithm spend most time in? Is it |
@tomas-pluskal : The running time depends on the number of molecular formulas that are returned (so, basicially, the number of #getNextFormula calls). If you wait 30 minutes for decomposing a mass you will also get millions of formulas back. Writing them to file or display them in some GUI should take even more time. The question is why you would want to decompose millions of formulas. @johnmay : I kept the classes to keep the code mostly identical to the original code. However, there is no further reason to do so. I'm fine with removing the alphabet class and moving the other classes with package visibility into the formula package. |
@kaibioinfo: The 30 min run I mentioned did not result in millions of formulas, but only in 1510. And that is just a list of "raw" formulas, which can be further reduced by seven golden rules, isotope matching, etc. Therefore, I'd say it totally makes sense to let the tool run for 30 min and get all possible candidates. |
@tomas-pluskal hm, that sounds like a bug =( Can you give me the number and alphabet you used for that? |
Sure, I used this as input:
|
Ah, okay, I see the problem: The algorithm is input-sensitive when using no bounds. Especially bounds on hydrogen atoms don't give you anything (as it is in the current implementation of yours) as hydrogens are added at very last to the formula. So if you decompose some mass and it generate millions of formulas it might take half an hour. But if you set the hydrogen bound to some very small number such that 99,99% of the generated formulas do not satisfy this bound, you will still have to wait half an hour for the formulas but won't get that many back. You could somehow simulate the existing getFinishedPercentage method. But it would be very inaccurate. And overall it just slows down your code because you synchronize every iteration step just for a counter... As said, other CDK method (like CycleFinger) don't use a progress bar, too. For mass spectrometry use cases you would not want to fully decompose that large masses, anyways. |
Well, yes, I understand all these arguments, but as a developer and a user of a GUI app, I'd say that progress feedback is a very important feature, often overlooked in libraries. And yes, I think CDK should have such feedback for other methods as well, especially if their runtime is substantial. It's true that synchronizing threads might slow things down a bit, but I doubt it will be a significant difference (we could test that of course). Anyway, I'll take a closer look at how it is actually implemented, and I'll let you know if I get any idea. If not, let's leave it as it is. |
…t everything into one class with local access modifier
Okay, I removed all sort of classes and wrote everything into the RoundRobinDecomposer. Currently, I use top-level local classes. Is it better to use nested private static classes? Regarding the getPercentage method: I now use the same implementation as the current decomposer and return how much of the search space we have already covered. However, this will be very inaccurate and is probably not much better than just guessing numbers. However, it will give you an increasing progress bar and it now pass all the tests. |
Awesome, thanks!! I wonder if it would make sense to leave FullEnumerationFormulaGenerator and RoundRobinFormulaGenerator as public classes, to let people choose one or the other for whatever reason they might have. What do you think, @johnmay ? |
Either is fine and updated patch looks good
The way I would personally preferred to do that is to have a parameter in the method. Can be done at a later stage but an enum with something like (both of you may have better naming): Automatic (default), Full, RoundRobin. |
Hi,
I talked with Thomas Pluskal about implementing the molecular formula decomposer from SIRIUS into the CDK.
Motivation:
SIRIUS is using the round robin algorithm to decompose formulas. It can be around 10 to 1000 times faster than the current algorithm (depending on the input parameters). The key idea is to first compute a dynamic programming (DP) table that memoizes which masses are decomposable for a certain alphabet. Afterwards, enumerating all formulas with a certain mass is just backtracking this DP table. In comparison to the current enumeration algorithm, this will only check molecular formulas which really have the given mass (and, therefore, the running time depends only on the number of formulas in the result set and not on the mass itself). Round Robin is using an extended residue table (ERT), which can store this kind of information with few kilobytes of memory.
See
http://dx.doi.org/10.1093/bioinformatics/btm631
and
http://dx.doi.org/10.1007/978-3-642-40453-5_5
API changes:
Implementation Details:
A Bugfix to the current API: