Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AtomPairs2DFingerprint #417

Merged
merged 4 commits into from Feb 15, 2018
Merged

Conversation

DataSciBurgoon
Copy link
Contributor

Adds new fingerprint from PaDEL -- AtomPairs2DFingerprint. My collaborators want to use CDK, but they have a need for this particular fingerprint to be present. In order to get this to build and test under maven I had to change the version in the pom.xml to LATEST. That also required downloading of other items -- which are here. I'm more than happy to drop some of these unnecessary files from the repo if you want.

@johnmay
Copy link
Member

johnmay commented Feb 5, 2018

I'll have to review this offline as the diff is too big for github. For starters can you fix the build (don't use LATEST in the parent version of cdk-fingerprint/pom.xml) also delete the OS X .DS_store files from finder.

@johnmay
Copy link
Member

johnmay commented Feb 5, 2018

Okay

  • looks like you've accidentally unzipped the circularvalidation.zip
  • please replace TopologicalMatrix, use AllPairsShortestPaths. You then don't need to the dependency on qsar-molecular and it will be faster
  • do you need it to match the PADEL fingerprint exactly... I can write a similar but much better one if you'd like?

@johnmay
Copy link
Member

johnmay commented Feb 5, 2018

Oh and also license file needs to be LGPL.

@johnmay
Copy link
Member

johnmay commented Feb 5, 2018

@egonw will be able to clarify if we can add include the comment about it being public domain as well, open sources licenses are complicated with this respect.

@DataSciBurgoon
Copy link
Contributor Author

Got it. I'll make the changes.

Unfortunately, it does need to match PaDEL exactly -- all of my collab's historical QSAR models use it, but their client wants them to 1) stop using the original PaDEL software, 2) use CDK, and 3) still use that same fingerprint from PaDEL even though it's not in CDK...and then compare its performance to things that are in CDK.

@johnmay
Copy link
Member

johnmay commented Feb 5, 2018

Okay I might add an option to improve the hashing then... running on the Briem-Lessel activity benchmark (https://link.springer.com/article/10.1023%2FA%3A1008793325522) it does quite poorly. Here are the results, for comparison Daylight Path (Fingeprint.java in CDK) get ~65%, ECFP4 (CircularFingerprint in CDK) gets about ~78%.

$ duster benchmark -p atompair_fp.fps 
[duster/benchmark] k=10, measure=Tanimoto, 
[duster/benchmark] Loaded fingerprints in 0s9ms
[duster/benchmark] 5 classes:  ACE (39) TXA2 (48) HMG (109) PAF (132) 5HT3 (47)
[duster/benchmark] Scored fingerprints in 0s45ms
ACE: 51%
     Score(i): 0.63, 0.60, 0.53, 0.55, 0.50, 0.43, 0.48, 0.60, 0.43, 0.38
 CumuScore(i): 0.63, 0.61, 0.58, 0.58, 0.56, 0.54, 0.53, 0.54, 0.52, 0.51
  Dist(i) Min: 0.66, 0.65, 0.74, 0.62, 0.72, 0.76, 0.75, 0.60, 0.76, 0.67
  Dist(i) Max: 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 0.97, 0.95, 0.95
  Dist(i) Avg: 0.94, 0.92, 0.92, 0.90, 0.91, 0.94, 0.92, 0.87, 0.91, 0.90
TXA2: 50%
     Score(i): 0.63, 0.53, 0.59, 0.61, 0.45, 0.35, 0.47, 0.57, 0.47, 0.37
 CumuScore(i): 0.63, 0.58, 0.59, 0.59, 0.56, 0.53, 0.52, 0.53, 0.52, 0.50
  Dist(i) Min: 0.66, 0.66, 0.68, 0.66, 0.56, 0.65, 0.54, 0.62, 0.53, 0.53
  Dist(i) Max: 1.00, 1.00, 0.91, 0.90, 0.89, 0.89, 0.89, 0.88, 0.87, 0.86
  Dist(i) Avg: 0.91, 0.86, 0.83, 0.83, 0.84, 0.81, 0.80, 0.79, 0.75, 0.76
HMG: 77%
     Score(i): 0.84, 0.80, 0.76, 0.78, 0.71, 0.78, 0.81, 0.74, 0.79, 0.73
 CumuScore(i): 0.84, 0.82, 0.80, 0.80, 0.78, 0.78, 0.78, 0.78, 0.78, 0.77
  Dist(i) Min: 0.74, 0.74, 0.68, 0.65, 0.63, 0.63, 0.63, 0.66, 0.66, 0.44
  Dist(i) Max: 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 0.96, 0.96, 0.96
  Dist(i) Avg: 0.95, 0.93, 0.91, 0.90, 0.89, 0.88, 0.88, 0.87, 0.87, 0.86
PAF: 48%
     Score(i): 0.68, 0.55, 0.50, 0.41, 0.46, 0.47, 0.45, 0.45, 0.41, 0.44
 CumuScore(i): 0.68, 0.61, 0.57, 0.53, 0.52, 0.51, 0.50, 0.50, 0.49, 0.48
  Dist(i) Min: 0.66, 0.66, 0.67, 0.66, 0.66, 0.61, 0.61, 0.58, 0.58, 0.58
  Dist(i) Max: 1.00, 0.96, 0.96, 0.96, 0.94, 0.94, 0.94, 0.94, 0.93, 0.93
  Dist(i) Avg: 0.91, 0.88, 0.87, 0.84, 0.84, 0.81, 0.80, 0.81, 0.79, 0.78
5HT3: 53%
     Score(i): 0.65, 0.60, 0.58, 0.54, 0.54, 0.50, 0.56, 0.48, 0.38, 0.44
 CumuScore(i): 0.65, 0.63, 0.61, 0.59, 0.58, 0.57, 0.57, 0.56, 0.54, 0.53
  Dist(i) Min: 0.80, 0.75, 0.71, 0.73, 0.72, 0.70, 0.74, 0.71, 0.68, 0.67
  Dist(i) Max: 1.00, 1.00, 1.00, 0.91, 0.91, 0.88, 0.87, 0.87, 0.84, 0.84
  Dist(i) Avg: 0.94, 0.90, 0.89, 0.85, 0.84, 0.82, 0.82, 0.80, 0.78, 0.77
[duster/benchmark] null: 57.8% (55.9% avg-of-avg)

@DataSciBurgoon
Copy link
Contributor Author

DataSciBurgoon commented Feb 5, 2018 via email

@DataSciBurgoon
Copy link
Contributor Author

RE: LGPL and Public Domain: My understanding is that due to the nature of public domain software, it can be included in anything (b/c it has 0 restrictions). So, we can include public domain code into a LGPL licensed product, so long as the public domain part is identified as such. What we can't do is go the other way and put LGPL into public domain and then treat it as public domain.

This link discusses this with respect to GPL, but I wouldn't think the differences would be significant in this respect: https://www.gnu.org/licenses/old-licenses/gpl-2.0-faq.en.html#GPLUSGovAdd

@johnmay
Copy link
Member

johnmay commented Feb 6, 2018

Almost there, could you squish down the commits (see http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html) then I could annotate specific comments via GitHUb. Also have attach a patch file optimising and fixing the comments below.

Patch File: apfp.diff
Apply with: git apply apfp.txt

  • not quite sure on the halogen logic, I think the original code would map ClC as bother Cl and X
  • if the molecule is less than 11 atoms you're using getSymbol() otherwise getAtomTypeName(), the GetSymbol is more correct, but factor this out to an encode() method
  • simplified loop and produced a raw FP, easier for debugging
  • added test for the halogens, I think the logic matches PaDEL there
  • the count fingerprint API was not completely used correctly
  • made it thread safe
  • change alAtomPairs to a map
  • the loop logic is not correct, in the AllPairsShortestPaths the second argument is distance
  • minor: some unused variables
  • minor: formatting using tabs, I can clean that up later

@egonw
Copy link
Member

egonw commented Feb 7, 2018

It's fine if the code is Public Domain (it must still have a clear header the authors and that it is PD). But any PD code can be used in any LGPL project. We have more public domain bits in the CDK.

Copy link
Member

@egonw egonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things left, I think... and what John said, please remove the .DS_Store/ folders :)

I support the idea of the squashing of commits.

* Note: I adapted this fingerprint from Yap Chun Wei's PaDEL source code, which can be found here:
* http://www.yapcwsoft.com/dd/padeldescriptor/
*
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a clear copyright/license statement. See many other source code files, but it must include:

  • who wrote the code (so, that includes at least Lyle and Yap)
  • the original code's license applies, but since that is public domain (PaDEL website: "The source code is released as public domain."), you actually in this rare occasion the right to set a license (but PD is fine)

return dist + "_" + a.getSymbol() + "_" + b.getSymbol();
}

private static String encodeHalPath(int dist, IAtom a, IAtom b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like to have some brief documentation on private methods too, to support maintenance of the code...

/* This work is the product of a US Government employee as part of his/her regular duties
* and is thus in the public domain.
*
* Author: Lyle D. Burgoon, Ph.D.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a community standard to add your email address too. The reason for that is, if someone would even like to change the license (or resolve some copyright question), there is some guidance how to reach the author. There are legal obligations to do ones best to contact original authors, and this custom is to help out a bit.

@DataSciBurgoon
Copy link
Contributor Author

I'm making changes now, and then I'll squash the commits again. Probably have something up later today/this evening.

@johnmay
Copy link
Member

johnmay commented Feb 12, 2018

Sorry been busy.. if not already done could you fix the test failure as well. I introduced it when I change the counting fingerprint implementation to be more inline with what the API intends. Otherwise just the bits Egon's said then all good to go in:

``
[ERROR] testGetCountFingerprint(org.openscience.cdk.fingerprint.AtomPairs2DFingerprintTest) Time elapsed: 0.012 s <<< FAILURE!
java.lang.AssertionError: expected:<780> but was:<9>
at org.openscience.cdk.fingerprint.AtomPairs2DFingerprintTest.testGetCountFingerprint(AtomPairs2DFingerprintTest.java:78)

@DataSciBurgoon
Copy link
Contributor Author

DataSciBurgoon commented Feb 12, 2018 via email

@DataSciBurgoon
Copy link
Contributor Author

Made the change in the test case. Made Egon's changes.

@johnmay johnmay merged commit 01e43f4 into cdk:master Feb 15, 2018
@johnmay
Copy link
Member

johnmay commented Feb 15, 2018

Thanks, looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants