Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LB-176: Create a stats module and add functions to run queries on Google BigQuery #202

Merged
merged 14 commits into from Aug 17, 2017

Conversation

paramsingh
Copy link
Collaborator

This PR adds a stats module that contains run_query - a function that takes queries in BigQuery Standard SQL and parameters and then runs them as a synchronous job. It doesn't really depend on #192 but I was waiting on that to get merged before opening this, but now this seemed like a significant amount of work and I thought it best to get some reviews first before continuing further. I've also added a function that uses run_query to get the top tracks for a user. Other statistics would be added in a similar way.

Hopefully, we can ignore the schema changes in this PR for now. Once #192 gets merged, I'll rebase and remove those changes from this PR.

@paramsingh
Copy link
Collaborator Author

Right now, run_query doesn't take non-parameterized queries, working on that and thinking about what the ideal way to run the stats calculation at uniform intervals of time is. I think adding jobs to webserver/scheduler.py would be okay? I'm thinking of making a file in the stats module that can define a function that makes queries and saves the results and then add it as a job to webserver/scheduler.py

The stats module is gonna be similar to the db module. All the
queries we make to BigQuery should be contained here.
run_query takes a query and its parameters and then returns the
results of the that query.

get_top_tracks is an example query that gets the top tracks of
the user over a particular interval of time.
Also, add a non-parametrized query for getting the sitewide
total artist count to check if it works.
@paramsingh
Copy link
Collaborator Author

Rebased after merge of #192.

Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the MSIDs,but the query limits are a must.

[
{
"track_name" (str)
"recording_msid" (uuid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be fetching MBIDs or MSIDs here? Both? Using MBIDs is useful for linking to MB.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both sounds good to me.


filter_clause = ""
if time_interval:
filter_clause = "AND listened_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {})".format(time_interval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interval is not defined anywhere that I can see. Should it be inclusive of the boundaries? (I think so).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interval is a parameter to be passed to the function. But yes, it should be inclusive.

{time_filter_clause}
GROUP BY recording_msid, track_name, artist_name, artist_msid
ORDER BY listen_count DESC
""".format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to LIMIT the data for each and every query in order to keep the amount of traffic low.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the ideal value of the limit be? Would top 100 artists / songs be more than enough, I think it would be. Should this value be in the config file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 seems fine for now -- leave it in the module for now. Easy enough to change once we know what it is we really want.

Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nitpicks now...

'artist_name' (str)
'artist_msid' (uuid)
'artist_mbids' (string of comma-seperated uuids)
'listen_count' (int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recording_mbid is missing above, but in the code below.

{time_filter_clause}
GROUP BY recording_msid, track_name, artist_name, artist_msid
ORDER BY listen_count DESC
""".format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 seems fine for now -- leave it in the module for now. Easy enough to change once we know what it is we really want.

@@ -136,6 +144,7 @@ def get_top_releases(musicbrainz_id, time_interval=None):
{
'artist_name' (str),
'artist_msid' (uuid),
'artist_mbids' (string of comma seperated uuids),
'release_name' (str),
'release_msid' (uuid),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release_mbid is not listed above, but is listed in the code.

Add release_mbid and recording_mbid to docstrings
Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌗 🌜 🍾 !!

@mayhem mayhem merged commit ea77c4f into metabrainz:master Aug 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants