Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add artist and recording MSID checks to dedup #212

Merged
merged 4 commits into from Jul 5, 2017

Conversation

paramsingh
Copy link
Collaborator

To allow rows with the same timestamp, we have to add artist_msid and recording_msid to the influx tags too.

This test started failing, I don't know exactly why, but the
number of listens came out to be 200 instead of 100. Moving the
insert from setUp to the test fixed it, and to be fair, that's
where the insert should be anyways.
Forgot about this
@@ -63,5 +56,9 @@ def test_user_load_by_id(self):
self.assertDictEqual(user.__dict__, self.user.__dict__)

def test_user_get_play_count(self):
date = datetime(2015, 9, 3, 0, 0, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you move this block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test had started failing for me randomly :(

I had added the reason for moving this block to the commit message 7056af2

@@ -179,6 +179,8 @@ def to_influx(self, measurement):
'time' : self.ts_since_epoch,
'tags' : {
'user_name' : escape(self.user_name),
'artist_msid' : self.artist_msid,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using these tags to carry these msids?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

influx cannot store two rows with the same tags and time, so if we want tracks with the same timestamp but different metadata to be stored, we have to add artist_msid and recording_msid to tags as well. Otherwise the old listen will just get overwritten by the new one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Sadly this will bloat our data for little gain, but I can't think of a much better way of doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, finally coming back to this. Why do we need both artist_msid and recording_msid? Shouldn't just recording_msid be sufficient?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're right, recording_msid should suffice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do tags not get returned when we do select count(*) from measurement? Asking this because I removed them both from fields and added into tags, and the listen count test here started to fail because there wasn't a count_recording_msid field in the results of the query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is correct.

So, given all of the above, all fields should be restored as they were and then only recording_msid added to tags. Then we should be ready to merge this -- this now makes sense to me.

Also, only use it for dedup.
Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍫 🔪 📌 🧀 !!!

@mayhem mayhem merged commit 62999cb into metabrainz:master Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants