New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LB-176: Create a stats module and add functions to run queries on Google BigQuery #202
Conversation
Right now, |
The stats module is gonna be similar to the db module. All the queries we make to BigQuery should be contained here.
run_query takes a query and its parameters and then returns the results of the that query. get_top_tracks is an example query that gets the top tracks of the user over a particular interval of time.
Also, add a non-parametrized query for getting the sitewide total artist count to check if it works.
Rebased after merge of #192. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the MSIDs,but the query limits are a must.
listenbrainz/stats/user.py
Outdated
[ | ||
{ | ||
"track_name" (str) | ||
"recording_msid" (uuid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be fetching MBIDs or MSIDs here? Both? Using MBIDs is useful for linking to MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both sounds good to me.
listenbrainz/stats/user.py
Outdated
|
||
filter_clause = "" | ||
if time_interval: | ||
filter_clause = "AND listened_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {})".format(time_interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interval is not defined anywhere that I can see. Should it be inclusive of the boundaries? (I think so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interval is a parameter to be passed to the function. But yes, it should be inclusive.
listenbrainz/stats/user.py
Outdated
{time_filter_clause} | ||
GROUP BY recording_msid, track_name, artist_name, artist_msid | ||
ORDER BY listen_count DESC | ||
""".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to LIMIT the data for each and every query in order to keep the amount of traffic low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the ideal value of the limit be? Would top 100 artists / songs be more than enough, I think it would be. Should this value be in the config file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100 seems fine for now -- leave it in the module for now. Easy enough to change once we know what it is we really want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nitpicks now...
listenbrainz/stats/user.py
Outdated
'artist_name' (str) | ||
'artist_msid' (uuid) | ||
'artist_mbids' (string of comma-seperated uuids) | ||
'listen_count' (int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recording_mbid is missing above, but in the code below.
listenbrainz/stats/user.py
Outdated
{time_filter_clause} | ||
GROUP BY recording_msid, track_name, artist_name, artist_msid | ||
ORDER BY listen_count DESC | ||
""".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100 seems fine for now -- leave it in the module for now. Easy enough to change once we know what it is we really want.
listenbrainz/stats/user.py
Outdated
@@ -136,6 +144,7 @@ def get_top_releases(musicbrainz_id, time_interval=None): | |||
{ | |||
'artist_name' (str), | |||
'artist_msid' (uuid), | |||
'artist_mbids' (string of comma seperated uuids), | |||
'release_name' (str), | |||
'release_msid' (uuid), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release_mbid is not listed above, but is listed in the code.
Add release_mbid and recording_mbid to docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🌗 🌜 🍾 !!
This PR adds a stats module that contains
run_query
- a function that takes queries in BigQuery Standard SQL and parameters and then runs them as a synchronous job. It doesn't really depend on #192 but I was waiting on that to get merged before opening this, but now this seemed like a significant amount of work and I thought it best to get some reviews first before continuing further. I've also added a function that usesrun_query
to get the top tracks for a user. Other statistics would be added in a similar way.Hopefully, we can ignore the schema changes in this PR for now. Once #192 gets merged, I'll rebase and remove those changes from this PR.