LB-176: Create a stats module and add functions to run queries on Google BigQuery #202

paramsingh · 2017-06-21T13:08:40Z

This PR adds a stats module that contains run_query - a function that takes queries in BigQuery Standard SQL and parameters and then runs them as a synchronous job. It doesn't really depend on #192 but I was waiting on that to get merged before opening this, but now this seemed like a significant amount of work and I thought it best to get some reviews first before continuing further. I've also added a function that uses run_query to get the top tracks for a user. Other statistics would be added in a similar way.

Hopefully, we can ignore the schema changes in this PR for now. Once #192 gets merged, I'll rebase and remove those changes from this PR.

paramsingh · 2017-07-01T10:11:35Z

Right now, run_query doesn't take non-parameterized queries, working on that and thinking about what the ideal way to run the stats calculation at uniform intervals of time is. I think adding jobs to webserver/scheduler.py would be okay? I'm thinking of making a file in the stats module that can define a function that makes queries and saves the results and then add it as a job to webserver/scheduler.py

The stats module is gonna be similar to the db module. All the queries we make to BigQuery should be contained here.

run_query takes a query and its parameters and then returns the results of the that query. get_top_tracks is an example query that gets the top tracks of the user over a particular interval of time.

Also, add a non-parametrized query for getting the sitewide total artist count to check if it works.

paramsingh · 2017-08-01T16:01:03Z

Rebased after merge of #192.

mayhem

Not sure about the MSIDs,but the query limits are a must.

mayhem · 2017-08-16T15:12:10Z

listenbrainz/stats/user.py

+            [
+                {
+                    "track_name" (str)
+                    "recording_msid" (uuid)


Should we be fetching MBIDs or MSIDs here? Both? Using MBIDs is useful for linking to MB.

Both sounds good to me.

mayhem · 2017-08-16T15:13:10Z

listenbrainz/stats/user.py

+
+    filter_clause = ""
+    if time_interval:
+        filter_clause = "AND listened_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {})".format(time_interval)


The interval is not defined anywhere that I can see. Should it be inclusive of the boundaries? (I think so).

Interval is a parameter to be passed to the function. But yes, it should be inclusive.

mayhem · 2017-08-16T15:14:07Z

listenbrainz/stats/user.py

+                {time_filter_clause}
+             GROUP BY recording_msid, track_name, artist_name, artist_msid
+             ORDER BY listen_count DESC
+            """.format(


You'll want to LIMIT the data for each and every query in order to keep the amount of traffic low.

What would the ideal value of the limit be? Would top 100 artists / songs be more than enough, I think it would be. Should this value be in the config file?

100 seems fine for now -- leave it in the module for now. Easy enough to change once we know what it is we really want.

mayhem

Just nitpicks now...

mayhem · 2017-08-16T21:44:39Z

listenbrainz/stats/user.py

+                    'artist_name' (str)
+                    'artist_msid' (uuid)
+                    'artist_mbids' (string of comma-seperated uuids)
+                    'listen_count' (int)


recording_mbid is missing above, but in the code below.

mayhem · 2017-08-16T21:45:51Z

listenbrainz/stats/user.py

+                {time_filter_clause}
+             GROUP BY recording_msid, track_name, artist_name, artist_msid
+             ORDER BY listen_count DESC
+            """.format(


100 seems fine for now -- leave it in the module for now. Easy enough to change once we know what it is we really want.

mayhem · 2017-08-16T21:46:31Z

listenbrainz/stats/user.py

@@ -136,6 +144,7 @@ def get_top_releases(musicbrainz_id, time_interval=None):
                    {
                        'artist_name' (str),
                        'artist_msid' (uuid),
+                        'artist_mbids' (string of comma seperated uuids),
                        'release_name' (str),
                        'release_msid' (uuid),


release_mbid is not listed above, but is listed in the code.

Add release_mbid and recording_mbid to docstrings

mayhem

🌗 🌜 🍾 !!

paramsingh added 10 commits August 1, 2017 21:29

Create a stats module

f615f74

The stats module is gonna be similar to the db module. All the queries we make to BigQuery should be contained here.

Add a test for BigQuery initialization in stats

f0ba517

Fix name of bigquery variable

a8c2792

Add run_query method and get_top_tracks function for users

3762c7f

run_query takes a query and its parameters and then returns the results of the that query. get_top_tracks is an example query that gets the top tracks of the user over a particular interval of time.

Remove limit on how many rows can be returned in a BQ request

f64445d

Format BigQuery results to a better format

1a60c3d

Fix filter clause in get_top_tracks

423f229

Wrap API calls in try-catch blocks and write test for run_query

6b680c5

Add functions to get top artists and releases

8390c9d

Make run_query work with non-parametrized queries

250ba94

Also, add a non-parametrized query for getting the sitewide total artist count to check if it works.

paramsingh force-pushed the user-stats branch from 6b6f77b to 250ba94 Compare August 1, 2017 16:00

paramsingh mentioned this pull request Aug 8, 2017

Schedule calculations and show artist count on user page. #244

Merged

mayhem requested changes Aug 16, 2017

View reviewed changes

paramsingh added 3 commits August 16, 2017 23:55

Limit top entity stats calculation

7067522

Make time filter clause in stats query inclusive

5afadb1

Get MBIDs along with MSIDs in queries

07d2e0c

mayhem requested changes Aug 16, 2017

View reviewed changes

Fix docstrings of functions in stats.user

a18b20a

Add release_mbid and recording_mbid to docstrings

mayhem approved these changes Aug 17, 2017

View reviewed changes

mayhem merged commit ea77c4f into metabrainz:master Aug 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LB-176: Create a stats module and add functions to run queries on Google BigQuery #202

LB-176: Create a stats module and add functions to run queries on Google BigQuery #202

paramsingh commented Jun 21, 2017

paramsingh commented Jul 1, 2017

paramsingh commented Aug 1, 2017

mayhem left a comment

mayhem Aug 16, 2017

paramsingh Aug 16, 2017

mayhem Aug 16, 2017

paramsingh Aug 16, 2017

mayhem Aug 16, 2017

paramsingh Aug 16, 2017

mayhem Aug 16, 2017

mayhem left a comment

mayhem Aug 16, 2017

mayhem Aug 16, 2017

mayhem Aug 16, 2017

mayhem left a comment

LB-176: Create a stats module and add functions to run queries on Google BigQuery #202

LB-176: Create a stats module and add functions to run queries on Google BigQuery #202

Conversation

paramsingh commented Jun 21, 2017

paramsingh commented Jul 1, 2017

paramsingh commented Aug 1, 2017

mayhem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayhem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayhem left a comment

Choose a reason for hiding this comment