New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read Optimised Data Structures #278
Conversation
I had a go at this 2 or 3 times... at the time I tried doing it an an IAtomContainer level, which generally was a pain, because of the invalidation during making changes to the container... also, and more importantly, it never really gave an significant speed up at computation level... that said, all those approaches did not collapse things first into a connectivity matrix, so perhaps not surprising... the issue was, in these approaches, that the overhead of extra bookkeeping was really less than the win of the better scalability for most data sets I played with... (do you have performance stats?) So, I'm happy to hear you're having a go at this. Downstream algorithms still need to know if they need to call the AtomRef.getAtomRefs() again, or is that the idea in the first place? Is that fast enough? |
Nice patch. It looks like a good alternative to a full rewrite. I can definitely see the utility in avoiding unnecessary bookkeeping code, and if the refs are keeping track of indices and adjacent bonds, I'm guessing that will improve performance. It'd be interesting to see some benchmarks. But even if there is no improvement (and no degradation) I'd support it on the basis of cleaner code. Nice work and I support bringing it up on the list. |
Yes that would be a pain, just to clarify as documented this does no book keeping and is specifically designed to be read only. Writes are v. rare - but do happen - as you say the user needs to know what to call. I would see this as an expert mode for those that need the read access speed.
I'll write a ring membership algorithm using one and then the other to show the times.
So I think what I do at the moment with the int[][] will still be optimised better - mainly due to memory layout but the win here is much cleaner code. My actual use case them prompted the clean up of how I do it currently was I need to add a custom comparator to the canonicalisation: Comparator<AtomRef> comp = new Comparator<AtomRef>(){
int compare(AtomRef a, AtomRef b) {
// compare bonds etc..
}
} |
I agree that this would be expert mode. However, I'm a fan of the 'one good way to do it' philosophy to toolkit/language design. Having two modes goes against that, but I can see why it might be necessary. I assume that writes will fail when using these ref objects? |
By writes I mean data structure write i.e. adding/deleting bonds. Set attributes is safe and supported. As Egon says detecting if something goes out of sync is pain - although technically possible. I might investigate, at the least I could have something similar to "concurrent modification exception". |
Runnable JAR: Results (+blog post) to follow: /*
* Copyright (c) 2017 John May <jwmay@users.sf.net>
*
* Contact: cdk-devel@lists.sourceforge.net
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU Lesser General Public License as published by
* the Free Software Foundation; either version 2.1 of the License, or (at
* your option) any later version. All we ask is that proper credit is given
* for our work, which includes - but is not limited to - adding the above
* copyright notice to the beginning of your source code files, and to any
* copyright notice that you may distribute with programs based on this work.
*
* This program is distributed in the hope that it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
* License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
package org.openscience.cdk;
import joptsimple.OptionParser;
import joptsimple.OptionSet;
import joptsimple.OptionSpec;
import org.openscience.cdk.exception.InvalidSmilesException;
import org.openscience.cdk.graph.GraphUtil;
import org.openscience.cdk.interfaces.IAtom;
import org.openscience.cdk.interfaces.IAtomContainer;
import org.openscience.cdk.interfaces.IBond;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.smiles.SmilesParser;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public class Benchmark {
private static OptionParser optpar = new OptionParser();
private static OptionSpec<Integer> stepSpec = optpar.accepts("step").withRequiredArg().ofType(Integer.class).defaultsTo(100000);
private static OptionSpec<String> filterSpec = optpar.accepts("filter").withRequiredArg().ofType(String.class).defaultsTo(".+");
private static OptionSpec<File> filesSpec = optpar.nonOptions().ofType(File.class);
private static abstract class MeasureAlgorithm {
long tInit = 0;
long tRun = 0;
long check = 0;
String desc;
public MeasureAlgorithm(String desc) {
this.desc = desc;
}
abstract void process(IAtomContainer mol);
}
private static MeasureAlgorithm[] algs = new MeasureAlgorithm[]{
new MeasureAlgorithm("AtomContainer\tDepthFirst\t") {
private void dfs(IAtomContainer mol, IAtom src, IBond prev) {
check++;
src.setFlag(CDKConstants.VISITED, true);
for (IBond bond : mol.getConnectedBondsList(src)) {
if (bond == prev)
continue;
IAtom dst = bond.getConnectedAtom(src);
if (!dst.getFlag(CDKConstants.VISITED))
dfs(mol, dst, bond);
}
}
@Override
void process(IAtomContainer mol) {
long t0 = System.nanoTime();
for (IAtom atom : mol.atoms())
atom.setFlag(CDKConstants.VISITED, false);
for (IAtom atom : mol.atoms())
if (!atom.getFlag(CDKConstants.VISITED))
dfs(mol, atom, null);
long t1 = System.nanoTime();
tRun += t1 - t0;
}
},
new MeasureAlgorithm("int[][]\tDepthFirst\tVisitFlag") {
private int dfs(IAtomContainer mol, int[][] adjlist, int src, int prev) {
check++;
IAtom atm = mol.getAtom(src);
atm.setFlag(CDKConstants.VISITED, true);
for (int dst : adjlist[src]) {
if (dst == prev)
continue;
IAtom nbr = mol.getAtom(dst);
if (!nbr.getFlag(CDKConstants.VISITED))
dfs(mol, adjlist, dst, src);
}
return 0;
}
@Override
void process(IAtomContainer mol) {
long t0 = System.nanoTime();
GraphUtil.EdgeToBondMap bmap = GraphUtil.EdgeToBondMap.withSpaceFor(mol);
int[][] adjlist = GraphUtil.toAdjList(mol, bmap);
long t1 = System.nanoTime();
for (IAtom atom : mol.atoms())
atom.setFlag(CDKConstants.VISITED, false);
for (int i = 0; i < mol.getAtomCount(); i++) {
IAtom atom = mol.getAtom(i);
if (!atom.getFlag(CDKConstants.VISITED))
dfs(mol, adjlist, i, i);
}
long t2 = System.nanoTime();
tInit += t1 - t0;
tRun += t2 - t1;
}
},
new MeasureAlgorithm("int[][]\tDepthFirst\tVisitArray") {
private int dfs(IAtomContainer mol, int[][] adjlist, int src, int prev, boolean[] visit) {
check++;
visit[src] = true;
for (int dst : adjlist[src]) {
if (dst == prev)
continue;
if (!visit[dst])
dfs(mol, adjlist, dst, src, visit);
}
return 0;
}
@Override
void process(IAtomContainer mol) {
long t0 = System.nanoTime();
GraphUtil.EdgeToBondMap bmap = GraphUtil.EdgeToBondMap.withSpaceFor(mol);
int[][] adjlist = GraphUtil.toAdjList(mol, bmap);
long t1 = System.nanoTime();
boolean[] visit = new boolean[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
if (!visit[i]) dfs(mol, adjlist, i, i, visit);
}
long t2 = System.nanoTime();
tInit += t1 - t0;
tRun += t2 - t1;
}
},
new MeasureAlgorithm("AtomRef\tDepthFirst\tVisitFlag") {
private final int dfs(AtomRef src, BondRef prev) {
check++;
src.setFlag(CDKConstants.VISITED, true);
for (BondRef bond : src.getBonds()) {
if (bond == prev)
continue;
AtomRef dst = bond.getConnectedAtom(src);
if (!dst.getFlag(CDKConstants.VISITED))
dfs(dst, bond);
}
return 0;
}
@Override
void process(IAtomContainer mol) {
long t0 = System.nanoTime();
AtomRef[] arefs = AtomRef.getAtomRefs(mol);
long t1 = System.nanoTime();
for (IAtom atom : mol.atoms())
atom.setFlag(CDKConstants.VISITED, false);
for (AtomRef aref : arefs)
if (!aref.getFlag(CDKConstants.VISITED))
dfs(aref, null);
long t2 = System.nanoTime();
tInit += t1 - t0;
tRun += t2 - t1;
}
},
new MeasureAlgorithm("AtomRef\tDepthFirst\tVisitArray") {
private int dfs(AtomRef src, BondRef prev, boolean[] visit) {
check++;
visit[src.getIndex()] = true;
for (BondRef bond : src.getBonds()) {
if (bond == prev)
continue;
AtomRef dst = bond.getConnectedAtom(src);
if (!visit[dst.getIndex()])
dfs(dst, bond, visit);
}
return 0;
}
@Override
void process(IAtomContainer mol) {
long t0 = System.nanoTime();
AtomRef[] arefs = AtomRef.getAtomRefs(mol);
long t1 = System.nanoTime();
boolean[] visit = new boolean[mol.getAtomCount()];
for (AtomRef aref : arefs)
if (!visit[aref.getIndex()])
dfs(aref, null, visit);
long t2 = System.nanoTime();
tInit += t1 - t0;
tRun += t2 - t1;
}
},
new MeasureAlgorithm("AtomContainer\tRelaxation\t") {
@Override
void process(IAtomContainer mol) {
long t0 = System.nanoTime();
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
for (int j = 0; j < mol.getAtomCount(); j++) {
IAtom atom = mol.getAtom(j);
for (IBond bond : mol.getConnectedBondsList(atom)) {
IAtom nbr = bond.getConnectedAtom(atom);
next[j] += prev[mol.getAtomNumber(nbr)];
}
}
System.arraycopy(next, 0, prev, 0, next.length);
}
long t1 = System.nanoTime();
tRun += t1 - t0;
for (int aNext : next) check += aNext;
}
},
new MeasureAlgorithm("AtomContainer\tRelaxation\tImproved") {
@Override
void process(IAtomContainer mol) {
long t0 = System.nanoTime();
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
for (IBond bond : mol.bonds()) {
IAtom beg = bond.getAtom(0);
IAtom end = bond.getAtom(1);
int begIdx = mol.getAtomNumber(beg);
int endIdx = mol.getAtomNumber(end);
next[begIdx] += prev[endIdx];
next[endIdx] += prev[begIdx];
}
System.arraycopy(next, 0, prev, 0, next.length);
}
long t1 = System.nanoTime();
tRun += t1 - t0;
for (int aNext : next) check += aNext;
}
},
new MeasureAlgorithm("AtomRef\tRelaxation\t") {
@Override
void process(IAtomContainer mol) {
long t0 = System.nanoTime();
AtomRef[] arefs = AtomRef.getAtomRefs(mol);
long t1 = System.nanoTime();
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
for (AtomRef aref : arefs) {
int idx = aref.getIndex();
for (BondRef bond : aref.getBonds()) {
next[idx] += prev[bond.getConnectedAtom(aref).getIndex()];
}
}
System.arraycopy(next, 0, prev, 0, next.length);
}
long t2 = System.nanoTime();
tInit += t1 - t0;
tRun += t2 - t1;
for (int aNext : next) check += aNext;
}
},
new MeasureAlgorithm("BondRef\tRelaxation\t") {
@Override
void process(IAtomContainer mol) {
long t0 = System.nanoTime();
BondRef[] bonds = BondRef.getBondRefs(mol);
long t1 = System.nanoTime();
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
for (BondRef bond : bonds) {
int begIdx = bond.getAtom(0).getIndex();
int endIdx = bond.getAtom(1).getIndex();
next[begIdx] += prev[endIdx];
next[endIdx] += prev[begIdx];
}
System.arraycopy(next, 0, prev, 0, next.length);
}
long t2 = System.nanoTime();
tInit += t1 - t0;
tRun += t2 - t1;
for (int aNext : next) check += aNext;
}
}
};
public static void main(String[] args) {
long t0 = System.nanoTime();
OptionSet optset = optpar.parse(args);
int step = optset.valueOf(stepSpec);
String filter = optset.valueOf(filterSpec);
int pos = 0;
int num = 0;
Pattern pattern = Pattern.compile(filter);
while (pos < algs.length) {
if (pattern.matcher(algs[pos].desc).find()) {
algs[num++] = algs[pos];
}
pos++;
}
algs = Arrays.copyOf(algs, num);
System.err.println("Filter: " + filter);
System.err.printf("%20s\t%20s\t%-35s\t%15s\t%15s\t%15s\t%15s\t%15s\t%s\n",
"DataSet",
"DataRange",
"DataStruct\tAlgorithm\tTweak",
"tElap",
"tElapSlower",
"tInit",
"tRun",
"tRunSlower",
"CheckSum");
for (File file : filesSpec.values(optset))
try (InputStream in = new FileInputStream(args[0]);
Reader rdr = new InputStreamReader(in, StandardCharsets.UTF_8);
BufferedReader brdr = new BufferedReader(rdr)) {
SmilesParser smipar = new SmilesParser(SilentChemObjectBuilder.getInstance());
smipar.kekulise(false); // not needed here
int count = 0;
int linenum = 0;
String line;
while ((line = brdr.readLine()) != null) {
++count;
++linenum;
if (count == step) {
report(file, count, linenum);
count=0;
}
try {
IAtomContainer mol = smipar.parseSmiles(line);
for (MeasureAlgorithm alg : algs) {
alg.process(mol);
}
} catch (InvalidSmilesException e) {
System.err.println("SMILES ERROR: " + e.getMessage());
}
}
report(file, count, linenum);
} catch (IOException e) {
}
long t1 = System.nanoTime();
System.err.printf("Total Elapsed Time: %.2f s\n", (t1 - t0) / 1e9);
}
private static void report(File file, int count, int linenum) {
long bestElap = Long.MAX_VALUE;
long bestRun = Long.MAX_VALUE;
for (MeasureAlgorithm alg : algs) {
if (alg.tRun < bestRun)
bestRun = alg.tRun;
if (alg.tInit+alg.tRun < bestElap)
bestElap = alg.tInit+alg.tRun;
}
for (MeasureAlgorithm alg : algs) {
System.err.printf("%20s\t%20s\t%-35s\t%15s\t%15s\t%15s\t%15s\t%15s\t%d\n",
file.getName(),
String.format("%d-%d", linenum - count, linenum),
alg.desc,
String.format("%.1f", (alg.tInit + alg.tRun) / 1e6),
String.format("%.1fx", (alg.tInit + alg.tRun) / (double) bestElap),
String.format("%.1f", alg.tInit / 1e6),
String.format("%.1f", alg.tRun / 1e6),
String.format("%.1fx", alg.tRun / (double) bestRun),
alg.check);
alg.tInit = 0;
alg.tRun = 0;
}
}
} |
Impressive! Looking forward to the blog post |
It's a win over even for trivial algorithms like DFS. In reality much more traversal is happening and since you only need to do the init once it's huge wins for algorithms like canonicalisation, substructure search, layout, etc... The Relaxation algorithm is morgan/canonical labelling. Summary: DepthFirstSearch
Scroll Right-->
Relaxation
Scroll Right-->
|
The main thing I wanted to test here was how much slower AtomRef/BondRef was than the int[][]. It turns out to be slightly faster to process with int[][] but slower to initialize (I could speed it up - I don't actually use the |
Nice! |
…ntainerRef that provides improved performance on the existing API.
4683f4e
to
c92ff06
Compare
Looks good. I've also had to use the approach of pre-computing a table of connections and this looks like a much neater way of doing it. |
Having thought more about this I think I do some API acrobatics and fit it in portably with the current IAtom/IAtomContainer... more to follow (will probably be a different pull request) |
Replaced by #368... ek this was back in March! |
IMPORTANT: Proof of concept only (do not apply)
The CDK data structures are write optimised - neither atoms or bonds no about the molecule which they belong and their index. This is efficient for generating structures as atoms and bonds can be re-usued. However this makes traversal painfully inefficient |E|^2 instead of |V|.
In practise object creation is cheap having to re-create atoms/bonds should not be optimised - unfortunately this would be a very destructive change (/rewrite). To avoid the inefficient traversals in the past I introduced and the GraphUtil API. This is still less efficient than just storing the connection table like this but reduces and |E|^2 traversal to an |E|+|V| (|E|=build cost).
This has worked successfully for many algorithms but is clunky to use, accessing bonds was also nasty by a secondary map instance:
This patch introduces two new class
AtomRef
andBondRef
that reference atoms and bonds in anAtomContainer
instance. Each ref stores the index in the container and the adjacency of the bonds. The class implement theIBond
andIAtom
interfaces and pass all methods through to an underlying implementation. This means you can use the atoms and bonds with existing method.I think I also want to integrate stereo chemistry and may well also provide an
AtomContainerRef
.@egonw @rajarshi let me know your thoughts then will put it to mailing list. I know in particularly Till of Scaffold Hunter use to have a fork with faster traversal - this should help him but providing a standard API to do it.