New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extracting the substructure patterns of CircularFingerprint. #231
Conversation
Made some minor clean up, some other quick questions. For uncharged atoms it should probably output '+0' for the charge layer. This makes all the atoms 'complex' SMARTS expressions but is more correct. Likewise, a single bond should be output between aromatic atoms if it's not aromatic. |
For CCC shouldn't the SMARTS produced be:
are the same? Explicit SMARTS might capture more the features:
|
@ntk73 knows better, but we might not have all the info in place (from the internal CircularFingerprints structures) in order to generate SMARTS with some level of details. Regarding explicit SMARTS |
Keeping it simple seems to be easier, though some patterns might be mapped to the wrong molecule, e.g. depending if CCC is a chain or might contain a branch. Then again, since also all methods learning on such patterns would not distinguish, this is what the generalization is about. Thus, the simpler version might just do. If more details are required I would provide an option to put it into the $(environment) of the SMARTS matching routine and make it an optional choice. |
Okay so a quick question, are these SMARTS meant to be used for building models or the matching molecules? @vedina I realised at the gym, that the internal structures are actually cause a bit of a problem. Is the intention these patterns are searchable because that's not possible? Alex implemented the class as a standalone and thus uses it's own internal aromaticity model. Here are the SMARTS for indole:
Although in theory you can use any aromaticity definition in SMARTS, it's best to use Daylight's definition. This ensures portability between implementations and is the default in CDK, am guessing AMBIT's SMARTS? You can work around it but it's a pain. Separate this out would decouple it from the FP internals and still be correct. @joergkurtwegner circular fingerprints do better on benchmarks than path based fingerprints precisely because they distinguish branches. In fact you can encode heavy degree in a path based fingerprint it will perform similarly to circular ones! Side note: OEChem has a nice API for doing this: Given the more precise SMARTS it can be transformed to the more loose form but not vice versa. In it's current implementation the SMARTS will actually distinguish branching: e.g. CC(C)C. |
Here's the SMARTS if you use daylight atom:
|
@johnmay the use case is that models are typically built directly with (ECFP) fingerprints, and SMARTS are for matching molecules after the models identify the "best" fingerprints (i.e. making the models interpretable). Regarding internal structures - neither me nor @ntk73 are original authors of the (this thread aside, I agree one can get comparable performance with path fingerprints if encoding branching :) ) |
Then using the molecular rather than the fingerprint aromaticity is preferable I think? |
I made the modification to print the output listed above. This also means it no longer needs to be in the same class. |
John, do I understand right you already have the modification to use the CDK aromaticity when printing the SMARTS ( could you point to the code) ? May be it's best to have both as different options? Otherwise, fine to have the printing in a separate class. (having in mind that SMARTS may be in principle visualised by arbitrary tools, not necessarily CDK, things might get even more inconsistent than discussed here) |
If you use the Do not use this! It is not correct and does not match exactly SmilesParser smipar = new SmilesParser(SilentChemObjectBuilder.getInstance());
IAtomContainer mol = smipar.parseSmiles("[nH]1ccc2c1cccc2");
// XXX! doesn't not match the semantics of CircularFingerprint aromticity model
Aromaticity arom = new Aromaticity(ElectronDonation.piBonds(),
Cycles.all(6));
// set up SMARTS invariants
SmartsMatchers.prepare(mol, true);
Pattern ptrn = Pattern.findSubstructure(SMARTSParser.parse("N1C=Cc(c1c:a):a", null));
System.err.println(ptrn.matches(mol));
arom.apply(mol);
System.err.println(ptrn.matches(mol)); I really thing the correct option is just to write out the aromaticity flags on the mol and not in the fingerprint. Otherwise it's somewhat of a house of cards. |
@johnmay - I am probably missing something, but how one can use either the |
Sorry I realised I didn't answer the question - that example was if in the current form what you'd have to do to use the SMARTS produced. Here's the changes: in
in
|
John, I think that it is possible (although rare) to have some side effects from you last changes. We intentionally inclined to use the locally defined aromaticity within SMARTS generation because it is used for the fragments identification as well. |
... or to have the same fingerprint associated with different SMARTS for different molecules. (but as said noisy descriptors are not a big deal. and we are not going to solve the different chemistry across all toolkits anyway, so let's have what's more appropriate from the CDK point of view. ) |
@ntk73 Indeed but I think this is actually what you want. Fingerprints can set multiple bits for the same parts, Daylight for example set a variable number depending on the size of the fragment, RDKIt sets a fixed number. It's actually kind of a bug that CDK doesn't do it in the path based fingerprint. Andrew Dalke has a write up somewhere on this... |
Hi all, sorry for being late to the party... when we pulled in the CircularFingerprint, they expressed a very strong wish to not swap internal algorithms, so that their validated implementation matches exactly the upstream implementation. Changing SMARTS, the aromaticity models, etc, clearly would change the calculated results, and making it break from the expected (upstream) results... when they pushed, I indicated to prefer it would reuse existing algorithms from the CDK library for various properties, but they insisted on using their own for the above reasons. Please make sure to include Alex Clark in this discussion! |
OK, @aclarkxyz , please comment (I hope it's the right github account). We are not changing the aromaticity model (or anything else) of generating fingerprints, just assigning SMARTS for each fingerprint for interpretability. I am afraid if we want the generated SMARTS to follow the internal |
What @egonw said. It's absolutely critical that the CircularFingerprint results never change: if they start giving different results, then the entire value proposition goes down the drain. Because they don't make use of any potentially evolving library algorithms like the general purpose aromaticity detector code, and means you don't have any additional restrictions on what you can or can't do with them. |
@aclarkxyz this pull request implements an extension, which does not change the CircularFingerprint, but adds a function to generate SMARTS per fingerprint. Could you please comment on this? If it's not OK to add the function to this class, we may move the extension into a new subclass. |
Wouldn't it be a much better idea to put the SMARTS features into a different file? The circular fingerprinter implementation is already fairly long, is tightly self-contained, and does exactly one thing (and meticulously carefully, might I add). Adding a bunch of peripheral functionality to it doesn't seem like the best call, especially since there's no way to make sure that the additions are actually describing the same concepts ("close enough" is as close as you will ever get, so I don't think it qualifies as core content). |
Nina/Nikolay are you happy for me move the code to a separate class and use the models aromaticity flags: |
Unless I'm missing something from only having skim-read the code, what you have is basically {molecule, atom indices} -> {SMARTS string}, with a side-note that the output should try to be compatible with the CircularFingerprinter's aromaticity model, if possible (which it isn't, so I wouldn't worry too much about it). That sounds like a very generic algorithm that doesn't necessarily need to be associated with this particular fingerprint at all, except by historical coincidence. |
@aclarkxyz kind of... locking in valence, connectivity, and charge as discussed above, it could be general. However it it's current implementation it is specific to circular fingerprints as includes the next iterations neighbour information. For example for benzene the iteration (ECFP0) will have a single atom index but SMARTS will have three atoms.
|
In principle SMARTS generation is a generic algorithm, but in this case it's indeed quite locked to the internal representation. Otherwise it would be nice to have a common interface for such substructure generating functionality for other fingerprints as well, I'm happy to help if we go this way. |
Already extracted Nina, but I'm now wondering with Alex's comment if this is a bit specific and is better kept down stream, JCompoundMapper/AMBIT? I can integrate the SMARTS generation parts which would simplify the method but since the API in the CircularFingerprint already exposes which atom indices are encoded it's a simple transformation from that information to the SMARTs. It's difficult to judge whether this is library functionality or application specific. Thoughts? |
Well, I don't think it specific, it's core functionality of a fingerprint, to be able to see the substructure of each fingerprint (I admit this is not always possible). I definitely think this is NOT to be handled downstream. JCompoundMapper has own implementation of both circular fingerprint and SMARTS generation, so it is not at all interested in such functionality. I would say it's vice versa, theoretically one can port the many JCompoundMapper fingerprints into CDK and ensure they all have the "interpretability" feature. Otherwise we have few other fingerprints in mind where we would like to make them "interpretable" and I was initially looking for implementing a common interface, but learned no such interface exist at the moment in CDK. |
The circular fingerprints are already interpretable: atom indices for each fingerprint means you already have the information you need to do some pretty interesting things (e.g. https://cheminf20.org/2015/08/31/visualisation-of-structure-activity-models-fudging-it-with-a-widget/). That's the kind of low level minimalism you get from the algorithm itself, because it can't be done anywhere else, and doesn't add anything that isn't essential. Converting that into SMARTS is very much an interoperability feature that's not at all part of the core functionality... and there could be dozens of other use cases for it. Seems to me like it should be named accordingly, e.g. "SubfragmentQuery" or somesuch. If other fingerprints could be minimally augmented so that they reveal atom indices, then they could be channelled into the new feature. (The reason for this abstraction would be the the molecule+indices definition is literally correct; whereas shoehorning that into SMARTS is technically incorrect - for this algorithm, at least - and is a high level interpretation with a failure rate that may or may not be acceptable for a given use case... that's a decision to be made elsewhere, not inside the fingerprinter library itself.) |
@johnmay , I have slightly modified the SubstructureSmarts extractor to optionally also include information about valency/degree, implicit H count and atom-mapping, so it can be used to extract reaction rules. Let me just find out how I can share it, as I have not used github yet to push code. |
…e class. Copies exact functionality ATM.
…bonds, it is redundant.
…to find (start with 'Smarts' prefix). Introduce the MODE_EXACT to specify setting the exact valence. We use int constants rather than enums to avoid adding two classes to the public API.
59dd8aa
to
144f4d7
Compare
Sorry for delay - all done now, rebased on master, renamed the class, and added the option (the default) to produce correct SMILES. To match the behaviour discussed in this thread MODE_JCOMPOUNDMAPPER can be set. |
@egonw you okay to do the merge on this one? - I did a lot of modifying in the end. |
I'll look at it today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only major thing is the missing CLW header in one of the files. Others I leave to the authors to decide on.
@@ -0,0 +1,126 @@ | |||
package org.openscience.cdk.fingerprint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the copyright header here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - missed that on the first time through
* needs to match the nitrogen. | ||
* | ||
* <p><b>Basic Usage:</b></p> | ||
* <pre>{@code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "{@code"? Should we start using that for all JavaDoc? If so, sounds like a nice Junior Job...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be already used. Main advantage IIRC is you can write:
<pre>{@code
List<IAtom> atoms = new ArrayList<IAtom>();
}</pre>
instead of
<pre>
List<IAtom> atoms = new ArrayList<IAtom>();
</pre>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, should we check all code examples and add it where it is not used yet? (not for this PR!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the only difference, you can find them with this:
find ~/workspace/github/cdk -name '*.java' -exec grep -H '<' {} \;
/Users/john/workspace/github/cdk/cdk/app/depict/src/test/java/org/openscience/cdk/depict/SvgDrawVisitorTest.java: + " <text x='50.0' y='50.0' fill='#FF0000' text-anchor='middle'>PNG < EPS < SVG</text>\n"
org/openscience/cdk/graph/AllPairsShortestPaths.java: * for (int i = 0; i < benzene.getAtomCount(); i++) {
org/openscience/cdk/graph/AllPairsShortestPaths.java: * for (int j = i + 1; j < benzene.getAtomCount(); j++) {
org/openscience/cdk/graph/PathTools.java: * If number of atoms in or below sphere x<max and number of atoms in or below sphere x+1>max then
org/openscience/cdk/graph/TripletShortCycles.java: * ESSSR <br/>(fingerprints only store cycles |C| <=
org/openscience/cdk/isomorphism/Mappings.java: * Iterable<String> strs = mappings.map(new Function<int[], String>() {
org/openscience/cdk/isomorphism/Mappings.java: * for (int i = 0; i < input.length; i++) {
org/openscience/cdk/isomorphism/Mappings.java: * for (Map<IAtom,IAtom> map : mappings.toAtomMap()) {
org/openscience/cdk/isomorphism/Mappings.java: * for (Map.Entry<IAtom,IAtom> e : map.entrySet()) {
org/openscience/cdk/isomorphism/Mappings.java: * for (Map<IBond,IBond> map : mappings.toBondMap()) {
org/openscience/cdk/isomorphism/Mappings.java: * for (Map.Entry<IBond,IBond> e : map.entrySet()) {
org/openscience/cdk/isomorphism/Mappings.java: * for (Map<IChemObject,IChemObject> map : mappings.toBondMap()) {
org/openscience/cdk/isomorphism/Mappings.java: * for (Map.Entry<IChemObject,IChemObject> e : map.entrySet()) {
org/openscience/cdk/reaction/ReactionSpecification.java: * expected to be <dictionaryNameSpace>:<entryID>.
org/openscience/cdk/qsar/DescriptorSpecification.java: * expected to be <dictionaryNameSpace>:<entryID>.
org/openscience/cdk/qsar/DescriptorSpecification.java: * expected to be <dictionaryNameSpace>:<entryID>.
org/openscience/cdk/signature/MoleculeSignature.java: * List<Orbit> orbits = moleculeSignature.calculateOrbits();
/Users/john/workspace/github/cdk/cdk/doc/javadoc/source/XMIDoclet.java: * into <UML:Class> elements.
/Users/john/workspace/github/cdk/cdk/doc/javadoc/source/XMIDoclet.java: * Method that serializes ClassDoc objects into <UML:Class> elements.
/Users/john/workspace/github/cdk/cdk/doc/javadoc/source/XMIDoclet.java: * into <listitem> elements (Umbrello specific)?.
/Users/john/workspace/github/cdk/cdk/doc/javadoc/source/XMIDoclet.java: * Method that serializes ClassDoc objects into <listitem> elements
org/openscience/cdk/math/qm/GaussiansBasis.java: * S = <phi_i|phi_j><br>
org/openscience/cdk/math/qm/GaussiansBasis.java: * J = <d/dr phi_i | d/dr phi_j><br>
org/openscience/cdk/math/qm/GaussiansBasis.java: * V = <phi_i | 1/r | phi_j><br>
org/openscience/cdk/math/qm/IBasis.java: * Calculate the overlap integral S = <phi_i|phi_j>.
org/openscience/cdk/math/qm/IBasis.java: * Calculates the impulse J = -<d/dr chi_i | d/dr chi_j>.
org/openscience/cdk/math/qm/IBasis.java: * Calculates the potential V = <chi_i | 1/r | chi_j>.
org/openscience/cdk/geometry/alignment/KabschAlignment.java: * for (int i = 0; i < ac1.getAtomCount(); i++) {
org/openscience/cdk/io/MDLRXNWriter.java: * > <key><br>
org/openscience/cdk/io/MDLV2000Reader.java: * not currently return field numbers (e.g. DT<n>).
org/openscience/cdk/io/ReaderFactory.java: * StringReader stringReader = "<molecule/>";
/Users/john/workspace/github/cdk/cdk/storage/io/src/test/java/org/openscience/cdk/io/cml/JmolTest.java: * <ul><li> <crystal></li></ul>
org/openscience/cdk/io/FormatFactory.java: * StringReader stringReader = new StringReader("<molecule/>");
org/openscience/cdk/libio/cml/Convertor.java: * @param useCMLIDs Uses object IDs like 'a1' instead of 'a<hash>'.
org/openscience/cdk/normalize/Normalizer.java: * <replace-set><br>
org/openscience/cdk/normalize/Normalizer.java: * <replace>O=N=O</replace><br>
org/openscience/cdk/normalize/Normalizer.java: * <replacement>[O-][N+]=O</replacement><br>
org/openscience/cdk/normalize/Normalizer.java: * </replace-set><br>
org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java: * System with numbers wrapped in <sub></sub> tags. Useful for
org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java: * System with numbers wrapped in <sub></sub> tags and the
org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java: * isotope of each Element in <sup></sup> tags and the total
org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java: * charge of IMolecularFormula in <sup></sup> tags. Useful for
org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java: * wrapped in <sub></sub> tags and the isotope of each Element
org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java: * in <sup></sup> tags and the total showCharge of IMolecularFormula
org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java: * in <sup></sup> tags. Useful for displaying formulae in Swing
org/openscience/cdk/smiles/smarts/SMARTSQueryTool.java: * for (int i = 0; i < nmatch; i++) {
org/openscience/cdk/smiles/smarts/SMARTSQueryTool.java: * by Craig James the <code>h<n></code> SMARTS pattern should not be used. It was included in the Daylight spec
org/openscience/cdk/smiles/smarts/SMARTSQueryTool.java: * for backwards compatibility. To match hydrogens, use the <code>H<n></cod> pattern.</li> <li>The wild card
net/sf/cdk/tools/bibtex/BibTeXMLFile.java: * Returns an Iterator<BibTeXMLEntry>.
public final class SmartsFragmentExtractor { | ||
|
||
/** | ||
* Sets the mode of the extractor to produce SMARTS similar to JCompoundMapper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a @cdk.cite to the JCompoundMapper paper.
+ license header + citation + resync bib with upstream
Clean up of #224