New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schedule calculations and show artist count on user page. #244
Conversation
Now that I look at it, this PR could be broken into smaller (more easily reviewable) parts, but I would prefer to start with breaking #202 into smaller parts first. I'm gonna keep this open for now, so that the work that has been done is not cooped up in some branch in my fork, but I think I'll eventually break parts of this PR into smaller ones as those would get reviewed more easily. |
Gonna rebase this now, with the merge of #202. Hopefully that makes it more manageable. If not, then I'll try to split this into different PRs. |
Also add some docstrings.
Also fix a couple bugs in the code
Ignore the styling for now, this is more of a proof-of-concept.
5d25670
to
f91488f
Compare
So, I was calculating artist count by just taking the length of the top artists field in the json, but now that we've started limiting the top artists, that will have to be a different query. Also, just a note: you can trigger manual calculation of stats by I'm gonna start with showing these stats on the user page now, I've left a bunch of |
Ok, I'll have a look soon. I think the stats need better formatting too -- perhaps start with a table that is similar to the one on the current-status page? |
BEGIN; | ||
|
||
-- XXX(param): What should values that are null already be set to here now? | ||
-- timestamp 0 or NOW() ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp 0
listenbrainz/db/stats.py
Outdated
from listenbrainz import db | ||
|
||
|
||
# XXX(param): think about the names of these stats variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which variables are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is more appropriate artists
or artist_stats
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me thinks artist_stats but the column in the db is artists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. DB table names should always be singular. Can we still change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did this before I became aware of the singular names thing. Can be easily changed, I'll make an issue for it. https://tickets.metabrainz.org/browse/LB-201
listenbrainz/db/stats.py
Outdated
|
||
# put all artist stats into one dict which will then be inserted | ||
# into the artist column of the stats.user table | ||
# XXX(param): This makes the schema of the stats very variable though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two artist related queries we run for users right now, the results of both of these queries go into the same jsonb field in the db. The jsonb looks like:
{
"artist_count": 67,
"all_time": [
{ "name": XXX, "listen_count": XXX}
]
}
If someone were to add a new artist stat for users, it would go in the same field, so I guess we should document the format of these fields somewhere.
listenbrainz/db/tests/test_stats.py
Outdated
def path_to_data_file(self, filename): | ||
# XXX(param): we have this function in IntegrationTestCase also, | ||
# maybe find some way to share it | ||
# ListenBrainzTestCase? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats probably ok.
Also improve tests and variable names a bit
Ready for review, it is a bit big, so if needed, I'll break it into smaller parts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, given the one caveat about errors and whatnot, this is fine for now. I'm interested in merging this soon so we don't have too many things piling up.
def calculate_stats(): | ||
calculate_user_stats() | ||
|
||
if __name__ == '__main__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://tickets.metabrainz.org/browse/LB-214
Overall, this is too simplistic for production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
listenbrainz/stats/calculate.py
Outdated
def calculate_user_stats(): | ||
for user in db_user.get_recently_logged_in_users(): | ||
recordings = stats_user.get_top_recordings(musicbrainz_id=user['musicbrainz_id']) | ||
artists = stats_user.get_top_artists(musicbrainz_id=user['musicbrainz_id']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't align equals signs
def shutdown(self): | ||
self.scheduler.shutdown() | ||
"""Add the jobs that need to be run to the scheduler""" | ||
self.scheduler.add_job(calculate_stats, 'interval', days=self.conf['STATS_CALCULATION_INTERVAL']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never really liked this - it means that we're doing calculations inside the webserver that really should be done in a separate thread. I'll open a ticket for this, we should move it to a separate container running with cron or some other periodic runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K, this seems like something I should do in a new PR instead of here. LB-215
} | ||
] | ||
|
||
return stats.run_query(query, parameters)[0]['artist_count'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we create a BigQuery object at least twice, for the stats module and in the bigquery writer. Can we abstract this into a single place like the database engine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
if time_interval: | ||
filter_clause = "AND listened_at >= TIMESTAMP_SUB(CURRENT_TIME(), INTERVAL {})".format(time_interval) | ||
|
||
query = """SELECT COUNT(DISTINCT(artist_msid)) as artist_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just thinking out loud here...
these are bigquery statements, not SQL. Does it make sense to have them here, or is it worth having a generic "bigquery" module like we have for db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stats module is where I intended to keep all BigQuery SQL statements. I think that is what you mean by the "bigquery" module? Should I maybe rename it? (or keep a bigquery
module inside the stats module?)
listenbrainz/db/user.py
Outdated
|
||
|
||
def get_recently_logged_in_users(): | ||
"""Returns a list of users who have logged-in in the last X days, X is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can say
Returns a list of users who have logged-in in the last config.STATS_CALCULATION_LOGIN_TIME days
This way, both bigquery-writer and the stats stuff can use the same initialization code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lets get this merged. If we find other problems with it, make a new ticket!
@mayhem @paramsingh This didn't break your development environments, did it? I can't start my development environment now because of an issue with BigQuery credentials being initialized (maybe in 650d079). I'm still working through it, but will try to take a closer look soon. For now, here's my stacktrace.
Note: I had to |
@jflory7 yeah, you found a bug. :) |
I've fixed this in #267. |
This PR is based on #202 .
It does the following things.
webserver.scheduler
.