New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add artist and recording MSID checks to dedup #212
Conversation
This test started failing, I don't know exactly why, but the number of listens came out to be 200 instead of 100. Moving the insert from setUp to the test fixed it, and to be fair, that's where the insert should be anyways.
Forgot about this
@@ -63,5 +56,9 @@ def test_user_load_by_id(self): | |||
self.assertDictEqual(user.__dict__, self.user.__dict__) | |||
|
|||
def test_user_get_play_count(self): | |||
date = datetime(2015, 9, 3, 0, 0, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you move this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test had started failing for me randomly :(
I had added the reason for moving this block to the commit message 7056af2
listenbrainz/listen.py
Outdated
@@ -179,6 +179,8 @@ def to_influx(self, measurement): | |||
'time' : self.ts_since_epoch, | |||
'tags' : { | |||
'user_name' : escape(self.user_name), | |||
'artist_msid' : self.artist_msid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using these tags to carry these msids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
influx cannot store two rows with the same tags and time, so if we want tracks with the same timestamp but different metadata to be stored, we have to add artist_msid
and recording_msid
to tags as well. Otherwise the old listen will just get overwritten by the new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Sadly this will bloat our data for little gain, but I can't think of a much better way of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, finally coming back to this. Why do we need both artist_msid and recording_msid? Shouldn't just recording_msid be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you're right, recording_msid
should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: do tags not get returned when we do select count(*) from measurement
? Asking this because I removed them both from fields and added into tags, and the listen count test here started to fail because there wasn't a count_recording_msid
field in the results of the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that is correct.
So, given all of the above, all fields should be restored as they were and then only recording_msid added to tags. Then we should be ready to merge this -- this now makes sense to me.
Also, only use it for dedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍫 🔪 📌 🧀 !!!
To allow rows with the same timestamp, we have to add
artist_msid
andrecording_msid
to the influx tags too.