Schedule calculations and show artist count on user page. #244

paramsingh · 2017-08-08T17:28:35Z

This PR is based on #202 .

It does the following things.

Removes unneeded code from a time when listens were store in postgres in webserver.scheduler.
Add a function to get recently logged in users (timeframe defined by a variable in config.py)
Makes the last_updated field in the stats tables NOT NULL (schema change)
Adds functions that calculate user stats and store them in the db
Schedule these functions according to a variable in config.py
Show artist counts as proof-of-concept on the user page.

paramsingh · 2017-08-08T17:30:58Z

Now that I look at it, this PR could be broken into smaller (more easily reviewable) parts, but I would prefer to start with breaking #202 into smaller parts first. I'm gonna keep this open for now, so that the work that has been done is not cooped up in some branch in my fork, but I think I'll eventually break parts of this PR into smaller ones as those would get reviewed more easily.

paramsingh · 2017-08-08T17:31:39Z

Artist count btw :)

paramsingh · 2017-08-17T11:00:51Z

Gonna rebase this now, with the merge of #202. Hopefully that makes it more manageable. If not, then I'll try to split this into different PRs.

Also add some docstrings.

Also fix a couple bugs in the code

Ignore the styling for now, this is more of a proof-of-concept.

paramsingh · 2017-08-17T11:30:06Z

So, I was calculating artist count by just taking the length of the top artists field in the json, but now that we've started limiting the top artists, that will have to be a different query.

Also, just a note: you can trigger manual calculation of stats by python -m listenbrainz.stats.calculate

I'm gonna start with showing these stats on the user page now, I've left a bunch of XXX comments that I would like second opinions on.

mayhem · 2017-08-17T11:32:53Z

Ok, I'll have a look soon. I think the stats need better formatting too -- perhaps start with a table that is similar to the one on the current-status page?

mayhem · 2017-08-18T06:32:42Z

admin/sql/updates/2017-08-03-make-stats-updated-not-null.sql

+BEGIN;
+
+-- XXX(param): What should values that are null already be set to here now?
+-- timestamp 0 or NOW() ?


timestamp 0

mayhem · 2017-08-18T06:34:36Z

listenbrainz/db/stats.py

+from listenbrainz import db
+
+
+# XXX(param): think about the names of these stats variables


which variables are you referring to?

https://github.com/metabrainz/listenbrainz-server/pull/244/files/b9e6120692cc688b3d8e1867918911350893a66e#diff-d4071fbd1594577acc0c16faf1b47578R14

Which is more appropriate artists or artist_stats?

me thinks artist_stats but the column in the db is artists

Ugh. DB table names should always be singular. Can we still change this?

Yeah, I did this before I became aware of the singular names thing. Can be easily changed, I'll make an issue for it. https://tickets.metabrainz.org/browse/LB-201

mayhem · 2017-08-18T06:36:29Z

listenbrainz/db/stats.py

+
+    # put all artist stats into one dict which will then be inserted
+    # into the artist column of the stats.user table
+    # XXX(param): This makes the schema of the stats very variable though.


Please elaborate.

We have two artist related queries we run for users right now, the results of both of these queries go into the same jsonb field in the db. The jsonb looks like:

{ "artist_count": 67, "all_time": [ { "name": XXX, "listen_count": XXX} ] }

If someone were to add a new artist stat for users, it would go in the same field, so I guess we should document the format of these fields somewhere.

mayhem · 2017-08-18T06:36:54Z

listenbrainz/db/tests/test_stats.py

+    def path_to_data_file(self, filename):
+        # XXX(param): we have this function in IntegrationTestCase also,
+        # maybe find some way to share it
+        # ListenBrainzTestCase?


thats probably ok.

Also improve tests and variable names a bit

paramsingh · 2017-09-01T09:55:40Z

Ready for review, it is a bit big, so if needed, I'll break it into smaller parts.

mayhem

OK, given the one caveat about errors and whatnot, this is fine for now. I'm interested in merging this soon so we don't have too many things piling up.

mayhem · 2017-10-02T12:15:11Z

listenbrainz/stats/calculate.py

+def calculate_stats():
+    calculate_user_stats()
+
+if __name__ == '__main__':


https://tickets.metabrainz.org/browse/LB-214

Overall, this is too simplistic for production.

alastair

Looks good to me

alastair · 2017-10-04T16:06:06Z

listenbrainz/stats/calculate.py

+def calculate_user_stats():
+    for user in db_user.get_recently_logged_in_users():
+        recordings = stats_user.get_top_recordings(musicbrainz_id=user['musicbrainz_id'])
+        artists    = stats_user.get_top_artists(musicbrainz_id=user['musicbrainz_id'])


don't align equals signs

alastair · 2017-10-04T17:06:45Z

listenbrainz/webserver/scheduler.py

-    def shutdown(self):
-        self.scheduler.shutdown()
+        """Add the jobs that need to be run to the scheduler"""
+        self.scheduler.add_job(calculate_stats, 'interval', days=self.conf['STATS_CALCULATION_INTERVAL'])


I never really liked this - it means that we're doing calculations inside the webserver that really should be done in a separate thread. I'll open a ticket for this, we should move it to a separate container running with cron or some other periodic runner.

K, this seems like something I should do in a new PR instead of here. LB-215

alastair · 2017-10-04T17:08:07Z

listenbrainz/stats/user.py

+        }
+    ]
+
+    return stats.run_query(query, parameters)[0]['artist_count']


It looks like we create a BigQuery object at least twice, for the stats module and in the bigquery writer. Can we abstract this into a single place like the database engine?

alastair · 2017-10-04T17:09:54Z

listenbrainz/stats/user.py

+    if time_interval:
+        filter_clause = "AND listened_at >= TIMESTAMP_SUB(CURRENT_TIME(), INTERVAL {})".format(time_interval)
+
+    query = """SELECT COUNT(DISTINCT(artist_msid)) as artist_count


just thinking out loud here...
these are bigquery statements, not SQL. Does it make sense to have them here, or is it worth having a generic "bigquery" module like we have for db?

The stats module is where I intended to keep all BigQuery SQL statements. I think that is what you mean by the "bigquery" module? Should I maybe rename it? (or keep a bigquery module inside the stats module?)

alastair · 2017-10-04T17:10:24Z

listenbrainz/db/user.py

+
+
+def get_recently_logged_in_users():
+    """Returns a list of users who have logged-in in the last X days, X is


you can say
Returns a list of users who have logged-in in the last config.STATS_CALCULATION_LOGIN_TIME days

This way, both bigquery-writer and the stats stuff can use the same initialization code.

mayhem

Ok, lets get this merged. If we find other problems with it, make a new ticket!

jwflory · 2017-10-11T21:01:09Z

@mayhem @paramsingh This didn't break your development environments, did it? I can't start my development environment now because of an issue with BigQuery credentials being initialized (maybe in 650d079). I'm still working through it, but will try to take a closer look soon. For now, here's my stacktrace.

web_1             | 2017-10-11 20:58:23,228 ERROR The BigQuery credentials file does not exist, cannot connect to BigQuery
web_1             | Traceback (most recent call last):
web_1             |   File "/code/listenbrainz/manage.py", line 196, in <module>
web_1             |     cli()
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 664, in __call__
web_1             |     return self.main(*args, **kwargs)
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 644, in main
web_1             |     rv = self.invoke(ctx)
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 991, in invoke
web_1             |     return _process_result(sub_ctx.command.invoke(sub_ctx))
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 837, in invoke
web_1             |     return ctx.invoke(self.callback, **ctx.params)
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 464, in invoke
web_1             |     return callback(*args, **kwargs)
web_1             |   File "/code/listenbrainz/manage.py", line 28, in runserver
web_1             |     webserver.schedule_jobs(application)
web_1             |   File "/code/listenbrainz/listenbrainz/webserver/__init__.py", line 29, in schedule_jobs
web_1             |     app.scheduledJobs = ScheduledJobs(app.config)
web_1             |   File "/code/listenbrainz/listenbrainz/webserver/scheduler.py", line 19, in __init__
web_1             |     stats.init_bigquery_connection()
web_1             |   File "/code/listenbrainz/listenbrainz/stats/__init__.py", line 19, in init_bigquery_connection
web_1             |     bigquery = create_bigquery_object()
web_1             |   File "/code/listenbrainz/listenbrainz/bigquery.py", line 18, in create_bigquery_object
web_1             |     raise NoCredentialsFileException
web_1             | listenbrainz.bigquery.NoCredentialsFileException

Note: I had to import logging to get this to work, otherwise I get an error about missing module.

paramsingh · 2017-10-12T18:34:52Z

@jflory7 yeah, you found a bug. :)

paramsingh · 2017-10-12T18:37:56Z

I've fixed this in #267.

paramsingh added 10 commits August 17, 2017 16:52

Refactor scheduler code to remove unneeded code

d999917

Also add some docstrings.

Add function for getting recently logged in users

c3404f3

Add stats calculation variable to config.py.sample

cade177

Write code for calculation of user stats

6a4ff2d

Make last_updated columns in stats tables NOT NULL

fe850d3

Comment to think about

3b4606c

Add code for inserting calculated stats into db

8b6f3ec

First cut of adding jobs to scheduler

0a9a3f4

Add a script that can be run to manually calculate stats

1252824

Also fix a couple bugs in the code

Show artist count on user page

f91488f

Ignore the styling for now, this is more of a proof-of-concept.

paramsingh force-pushed the schedule-calculations branch from 5d25670 to f91488f Compare August 17, 2017 11:25

Remove extra line that came from rebase

addea60

Add different stat function for getting artist_count

b9e6120

mayhem reviewed Aug 18, 2017

View reviewed changes

paramsingh added 5 commits August 22, 2017 14:34

Set NULL values of last_updated in stat tables to 0

7362d80

Show stats in a table on user page

5b6ff06

Fix error if stats are not calculated for user

f6eacfd

Change indentation to spaces in profile.html

3b9b043

Address TODO comments and add docstrings to new functions

4eb5b5c

Also improve tests and variable names a bit

paramsingh changed the title ~~[WIP] Schedule calculations and show artist count on user page.~~ Schedule calculations and show artist count on user page. Sep 1, 2017

mayhem approved these changes Oct 2, 2017

View reviewed changes

alastair reviewed Oct 4, 2017

View reviewed changes

paramsingh added 2 commits October 5, 2017 21:48

Don't align equals signs

54e80b6

Change docstring to better english

18de196

Put the bigquery initialization code into a module

650d079

This way, both bigquery-writer and the stats stuff can use the same initialization code.

mayhem approved these changes Oct 10, 2017

View reviewed changes

mayhem merged commit ea34b89 into metabrainz:master Oct 10, 2017

paramsingh deleted the schedule-calculations branch October 12, 2017 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schedule calculations and show artist count on user page. #244

Schedule calculations and show artist count on user page. #244

paramsingh commented Aug 8, 2017

paramsingh commented Aug 8, 2017

paramsingh commented Aug 8, 2017

paramsingh commented Aug 17, 2017

paramsingh commented Aug 17, 2017

mayhem commented Aug 17, 2017

mayhem Aug 18, 2017

mayhem Aug 18, 2017

paramsingh Aug 22, 2017

paramsingh Aug 22, 2017

mayhem Aug 31, 2017

paramsingh Aug 31, 2017

mayhem Aug 18, 2017

paramsingh Aug 22, 2017

mayhem Aug 18, 2017

paramsingh commented Sep 1, 2017

mayhem left a comment

mayhem Oct 2, 2017

alastair left a comment

alastair Oct 4, 2017

alastair Oct 4, 2017

paramsingh Oct 5, 2017

alastair Oct 4, 2017

paramsingh Oct 5, 2017

alastair Oct 4, 2017

paramsingh Oct 5, 2017 •

edited

alastair Oct 4, 2017

mayhem left a comment

jwflory commented Oct 11, 2017 •

edited

paramsingh commented Oct 12, 2017

paramsingh commented Oct 12, 2017

		from listenbrainz import db


		# XXX(param): think about the names of these stats variables



		def get_recently_logged_in_users():
		"""Returns a list of users who have logged-in in the last X days, X is

Schedule calculations and show artist count on user page. #244

Schedule calculations and show artist count on user page. #244

Conversation

paramsingh commented Aug 8, 2017

paramsingh commented Aug 8, 2017

paramsingh commented Aug 8, 2017

paramsingh commented Aug 17, 2017

paramsingh commented Aug 17, 2017

mayhem commented Aug 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paramsingh commented Sep 1, 2017

mayhem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alastair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paramsingh Oct 5, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayhem left a comment

Choose a reason for hiding this comment

jwflory commented Oct 11, 2017 • edited

paramsingh commented Oct 12, 2017

paramsingh commented Oct 12, 2017

paramsingh Oct 5, 2017 •

edited

jwflory commented Oct 11, 2017 •

edited