Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedule calculations and show artist count on user page. #244

Merged
merged 20 commits into from Oct 10, 2017

Conversation

paramsingh
Copy link
Collaborator

This PR is based on #202 .

It does the following things.

  • Removes unneeded code from a time when listens were store in postgres in webserver.scheduler.
  • Add a function to get recently logged in users (timeframe defined by a variable in config.py)
  • Makes the last_updated field in the stats tables NOT NULL (schema change)
  • Adds functions that calculate user stats and store them in the db
  • Schedule these functions according to a variable in config.py
  • Show artist counts as proof-of-concept on the user page.

@paramsingh
Copy link
Collaborator Author

Now that I look at it, this PR could be broken into smaller (more easily reviewable) parts, but I would prefer to start with breaking #202 into smaller parts first. I'm gonna keep this open for now, so that the work that has been done is not cooped up in some branch in my fork, but I think I'll eventually break parts of this PR into smaller ones as those would get reviewed more easily.

@paramsingh
Copy link
Collaborator Author

Artist count btw :)
screenshot_2017-08-08_22-48-54

@paramsingh
Copy link
Collaborator Author

Gonna rebase this now, with the merge of #202. Hopefully that makes it more manageable. If not, then I'll try to split this into different PRs.

@paramsingh
Copy link
Collaborator Author

So, I was calculating artist count by just taking the length of the top artists field in the json, but now that we've started limiting the top artists, that will have to be a different query.

Also, just a note: you can trigger manual calculation of stats by python -m listenbrainz.stats.calculate

I'm gonna start with showing these stats on the user page now, I've left a bunch of XXX comments that I would like second opinions on.

@mayhem
Copy link
Member

mayhem commented Aug 17, 2017

Ok, I'll have a look soon. I think the stats need better formatting too -- perhaps start with a table that is similar to the one on the current-status page?

BEGIN;

-- XXX(param): What should values that are null already be set to here now?
-- timestamp 0 or NOW() ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp 0

from listenbrainz import db


# XXX(param): think about the names of these stats variables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which variables are you referring to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me thinks artist_stats but the column in the db is artists

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. DB table names should always be singular. Can we still change this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did this before I became aware of the singular names thing. Can be easily changed, I'll make an issue for it. https://tickets.metabrainz.org/browse/LB-201


# put all artist stats into one dict which will then be inserted
# into the artist column of the stats.user table
# XXX(param): This makes the schema of the stats very variable though.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two artist related queries we run for users right now, the results of both of these queries go into the same jsonb field in the db. The jsonb looks like:

{
    "artist_count": 67,
    "all_time": [
        { "name": XXX, "listen_count": XXX}
     ]
}

If someone were to add a new artist stat for users, it would go in the same field, so I guess we should document the format of these fields somewhere.

def path_to_data_file(self, filename):
# XXX(param): we have this function in IntegrationTestCase also,
# maybe find some way to share it
# ListenBrainzTestCase?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats probably ok.

@paramsingh paramsingh changed the title [WIP] Schedule calculations and show artist count on user page. Schedule calculations and show artist count on user page. Sep 1, 2017
@paramsingh
Copy link
Collaborator Author

Ready for review, it is a bit big, so if needed, I'll break it into smaller parts.

Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, given the one caveat about errors and whatnot, this is fine for now. I'm interested in merging this soon so we don't have too many things piling up.

def calculate_stats():
calculate_user_stats()

if __name__ == '__main__':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://tickets.metabrainz.org/browse/LB-214

Overall, this is too simplistic for production.

Copy link
Collaborator

@alastair alastair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

def calculate_user_stats():
for user in db_user.get_recently_logged_in_users():
recordings = stats_user.get_top_recordings(musicbrainz_id=user['musicbrainz_id'])
artists = stats_user.get_top_artists(musicbrainz_id=user['musicbrainz_id'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't align equals signs

def shutdown(self):
self.scheduler.shutdown()
"""Add the jobs that need to be run to the scheduler"""
self.scheduler.add_job(calculate_stats, 'interval', days=self.conf['STATS_CALCULATION_INTERVAL'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never really liked this - it means that we're doing calculations inside the webserver that really should be done in a separate thread. I'll open a ticket for this, we should move it to a separate container running with cron or some other periodic runner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K, this seems like something I should do in a new PR instead of here. LB-215

}
]

return stats.run_query(query, parameters)[0]['artist_count']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we create a BigQuery object at least twice, for the stats module and in the bigquery writer. Can we abstract this into a single place like the database engine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

if time_interval:
filter_clause = "AND listened_at >= TIMESTAMP_SUB(CURRENT_TIME(), INTERVAL {})".format(time_interval)

query = """SELECT COUNT(DISTINCT(artist_msid)) as artist_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking out loud here...
these are bigquery statements, not SQL. Does it make sense to have them here, or is it worth having a generic "bigquery" module like we have for db?

Copy link
Collaborator Author

@paramsingh paramsingh Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stats module is where I intended to keep all BigQuery SQL statements. I think that is what you mean by the "bigquery" module? Should I maybe rename it? (or keep a bigquery module inside the stats module?)



def get_recently_logged_in_users():
"""Returns a list of users who have logged-in in the last X days, X is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can say
Returns a list of users who have logged-in in the last config.STATS_CALCULATION_LOGIN_TIME days

This way, both bigquery-writer and the stats stuff can use the
same initialization code.
Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets get this merged. If we find other problems with it, make a new ticket!

@mayhem mayhem merged commit ea34b89 into metabrainz:master Oct 10, 2017
@jwflory
Copy link
Contributor

jwflory commented Oct 11, 2017

@mayhem @paramsingh This didn't break your development environments, did it? I can't start my development environment now because of an issue with BigQuery credentials being initialized (maybe in 650d079). I'm still working through it, but will try to take a closer look soon. For now, here's my stacktrace.

web_1             | 2017-10-11 20:58:23,228 ERROR The BigQuery credentials file does not exist, cannot connect to BigQuery
web_1             | Traceback (most recent call last):
web_1             |   File "/code/listenbrainz/manage.py", line 196, in <module>
web_1             |     cli()
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 664, in __call__
web_1             |     return self.main(*args, **kwargs)
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 644, in main
web_1             |     rv = self.invoke(ctx)
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 991, in invoke
web_1             |     return _process_result(sub_ctx.command.invoke(sub_ctx))
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 837, in invoke
web_1             |     return ctx.invoke(self.callback, **ctx.params)
web_1             |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 464, in invoke
web_1             |     return callback(*args, **kwargs)
web_1             |   File "/code/listenbrainz/manage.py", line 28, in runserver
web_1             |     webserver.schedule_jobs(application)
web_1             |   File "/code/listenbrainz/listenbrainz/webserver/__init__.py", line 29, in schedule_jobs
web_1             |     app.scheduledJobs = ScheduledJobs(app.config)
web_1             |   File "/code/listenbrainz/listenbrainz/webserver/scheduler.py", line 19, in __init__
web_1             |     stats.init_bigquery_connection()
web_1             |   File "/code/listenbrainz/listenbrainz/stats/__init__.py", line 19, in init_bigquery_connection
web_1             |     bigquery = create_bigquery_object()
web_1             |   File "/code/listenbrainz/listenbrainz/bigquery.py", line 18, in create_bigquery_object
web_1             |     raise NoCredentialsFileException
web_1             | listenbrainz.bigquery.NoCredentialsFileException

Note: I had to import logging to get this to work, otherwise I get an error about missing module.

@paramsingh
Copy link
Collaborator Author

@jflory7 yeah, you found a bug. :)

@paramsingh paramsingh deleted the schedule-calculations branch October 12, 2017 18:35
@paramsingh
Copy link
Collaborator Author

I've fixed this in #267.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants