Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

story 71751536. rdf:type triples are duplicated in output #421

Merged
merged 1 commit into from Jul 23, 2014
Merged

story 71751536. rdf:type triples are duplicated in output #421

merged 1 commit into from Jul 23, 2014

Conversation

mohideen
Copy link
Contributor

awoods pushed a commit that referenced this pull request Jul 23, 2014
story 71751536. rdf:type triples are duplicated in output
@awoods awoods merged commit fd8eccd into fcrepo:master Jul 23, 2014
* Removes duplicate triples.
*/
public void removeDuplicates() {
final LinkedHashSet triplesLHS = new LinkedHashSet(Lists.newArrayList(triples));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean we have to load every triple in RAM in order to de-dupe, or is there something clever happening under the hood?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to de-dupe? (And yes, @cbeer, putting these triples in an ArrayList or LinkedHashSet will pull them all into memory. Pretty much anything from the basic Java Collection API that impls Collection is an in-memory construct.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should de-dupe the rdf:type statements -- it looks sloppy to repeat them.

Ideally, we would be able to avoid adding the duplicate types in the first place. But if that doesn't work, then a better pattern would be an Iterator implementation that wraps the triples and suppresses duplicate rdf:type triples. We would only need the rdf:type triples in memory, which would avoid the typical cases where this would be a problem (e.g. large numbers of children, etc.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much better to do some kind of limited filtering. The extant filter() method could be reused for this. The whole purpose of RdfStream was originally precisely to avoid pulling all the triples into heap, because we saw severe performance degradations when that happened.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbeer, @ajs6f, are you ok with duplicate triples? Should we find another technique for maintaining uniqueness or should this ticket be dropped?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should merge this as is. It would be good to understand why there are duplicate triples in the first place. It'd surprise me if that was necessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the ticket describes why (or rather "when") the duplicate triples are introduced.
https://www.pivotaltracker.com/story/show/71751536

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of not introducing duplicates in the first place is appealing, as it is not clear to me how we would filter duplicates without building up the triples in-memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm actually fine with duplicate triples. I understand @escowles's point, but it just doesn't bother me that much. RDF is a machine format.

  2. If we can understand how to avoid the duplicates at some closer-to-the-JCR layer, that is without question the right thing to do. It shouldn't be that hard. It currently seems to be a matter of maintaining some in-flight state in the subclasses of NodeRdfContext.

  3. NodeRdfContext is a piece of junk. In fact, the whole triple generation subsystem should be reworked to bring far more of the action out into the type system. And yes, I wrote a huge amount of that and that's one reason I'm so sure about it.

@awoods
Copy link

awoods commented Jul 23, 2014

This commit has been reverted.

@peichman-umd peichman-umd deleted the rdf-dedup branch December 10, 2014 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants