LB-352: Add script to fetch artist MBIDs provided recording MBIDs from MusicBrainz. #36
Conversation
Can you please leave a detailed description when you open a PR? |
What exactly is the thinking behind storing artist_mbids in a new table instead of just querying the musicbrainz_db when the mbids are needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that schema changes should include a schema-update script too. See examples here: https://github.com/metabrainz/listenbrainz-server/tree/master/admin/sql/updates
I haven't tested this yet, I'll do it in the next round of reviews, once we know if the schema change is definitely needed or not.
manage.py
Outdated
if reset: | ||
db.init_db_engine(config.SQLALCHEMY_DATABASE_URI) | ||
with db.engine.begin() as connection: | ||
query = text("""TRUNCATE TABLE recording_artist""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All db queries should work as functions in the db module.
messybrainz/fetch_artist_mbids.py
Outdated
from sqlalchemy import text | ||
|
||
|
||
def check_valid_uuid(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other projects, we just create a uuid.UUID
object and return False if it raises a ValueError. I think that is a better method when compared to this long regex.
See here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Never reinvent the wheel if you don't have to -- that means you don't have to debug the wheel when inevitable bugs appear.
messybrainz/fetch_artist_mbids.py
Outdated
""" | ||
with db.engine.begin() as connection: | ||
for artist_mbid in artist_mbids: | ||
query = text("""INSERT INTO recording_artist (recording_mbid, artist_mbid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at how queries are formatted, SQL keywords should be in upper-case and right aligned.
See example here: https://github.com/metabrainz/guidelines/blob/master/Python.md#sql
messybrainz/fetch_artist_mbids.py
Outdated
|
||
# Get a list of all distinct recording MBIDs from recording_json table | ||
with db.engine.begin() as connection: | ||
query = text("""SELECT DISTINCT data ->> 'recording_mbid' AS recording_mbid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in the db module.
admin/sql/create_tables.sql
Outdated
CREATE TABLE recording_artist ( | ||
recording_mbid UUID NOT NULL, | ||
artist_mbid UUID NOT NULL | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also have a
updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW
column on this table.
manage.py
Outdated
|
||
result = fetch_artist_mbids_for_all_recording_mbids() | ||
|
||
print("Total recording MBIDs processed: {0}.".format(result[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code could be more readable by assigning the return value of fetch_artist_mbids_for_all_recording_mbids() to total_recordings_processed and total_recordings_added. While this isn't a big deal in this context, it makes code nicer to read and will become more important in other contexts later.
messybrainz/fetch_artist_mbids.py
Outdated
from sqlalchemy import text | ||
|
||
|
||
def check_valid_uuid(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Never reinvent the wheel if you don't have to -- that means you don't have to debug the wheel when inevitable bugs appear.
messybrainz/fetch_artist_mbids.py
Outdated
"artist_mbid": artist_mbid, | ||
}) | ||
|
||
return is_recording_mbid_present(recording_mbid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is strange -- you insert data and then you query for the data again. This means that for every insert there is an unnecessary query to the DB to fetch the data again. Your call to add_artist_mbids() should catch exceptions and add accordingly -- this means that you can trust the INSERT statement to execute correctly and insert the data as you expect.
messybrainz/fetch_artist_mbids.py
Outdated
""" Fetches artist MBIDs from the MusicBrainz database for the recording MBID. | ||
And inserts the artist MBIDs into the recording_artist table. | ||
""" | ||
try : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more pythonic to catch this exception where this function is called, rather than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the case if someone submitted a wrong UUID in listen which doesn't exist in MusicBrainz database.
messybrainz/fetch_artist_mbids.py
Outdated
for artist in recording['artists']: | ||
artist_mbids.append(artist['id']) | ||
|
||
result = add_artist_mbids(recording_mbid, artist_mbids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a function called fetch_artist_mbids() -- from a caller's perspective, this function should not alter the database. If this is how you want the function to really work, you should change the name to something that indicates that the database state my be changed. fetch_and_insert_artist_mbids() could be a better name... But, more likely you should examine your algorithmic flow and see if this is perhaps better broken into two separate functions...
messybrainz/fetch_artist_mbids.py
Outdated
recording MBIDs that were added to the recording_artist table. | ||
""" | ||
# Init databases | ||
db.init_db_engine(config.SQLALCHEMY_DATABASE_URI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the init steps being done in a function to fetch artist mbids? It should be done by the main function that sets up the correct environment...
27d045d
to
997f7e8
Compare
messybrainz/fetch_artist_mbids.py
Outdated
""" | ||
|
||
with db.engine.begin() as connection: | ||
if reset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, for the next set of improvements to this PR, we should adapt these features to work in an incremental way. By running this script it should examine any data that hasn't been matched yet and attempt to match it. Then we should expose a separate function for truncating the tables -- this then gives the user the ability run this program once or repeatedly and if choosing to start over, the tables can be truncated.
messybrainz/fetch_artist_mbids.py
Outdated
num_recording_mbids_added = 0 | ||
num_recording_mbids_processed = recording_mbids.rowcount | ||
for recording_mbid in recording_mbids: | ||
if is_valid_uuid(recording_mbid[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this step is even needed -- we can't insert invalid UUIDs into the database... But, if we keep it, then we should at least report an error to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These UUIDs are inserted in JSON fields of recording_json table. And we don't have any check in messybrainz submit_listen to check if recording_mbid is valid UUID or not. But we get recordings only from LB which checks for valid UUIDs. So, as you said it's not possible to have invalid UUIDs. I'll remove this check. And in case we need it we should add it to submit_listen function.
messybrainz/fetch_artist_mbids.py
Outdated
|
||
def is_recording_mbid_present(connection, recording_mbid): | ||
""" | ||
Check if recording MBID is already present in recording_artist table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the table name in the comment.
messybrainz/fetch_artist_mbids.py
Outdated
num_recording_mbids_processed = recording_mbids.rowcount | ||
for recording_mbid in recording_mbids: | ||
if is_valid_uuid(recording_mbid[0]): | ||
if not is_recording_mbid_present(connection, recording_mbid[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the same pattern problem again -- you could write one query to combining the fetching of unique recording_mbids AND to figure out which ones are not in clusters yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer!
messybrainz/fetch_artist_mbids.py
Outdated
those recording MBIDs. | ||
""" | ||
|
||
query = text("""SELECT DISTINCT data ->> 'recording_mbid' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better -- however, this approach will still get into trouble with large tables. You should be able to rewrite this query with a LEFT JOIN query that allows you to find rows that exist in one table, but not the other.
manage.py
Outdated
db.init_db_engine(config.SQLALCHEMY_DATABASE_URI) | ||
musicbrainz_db.init_db_engine(config.MB_DATABASE_URI) | ||
|
||
num_recording_mbids_processed, num_recording_mbids_added = fetch_and_store_artist_mbids_for_all_recording_mbids() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to catch exceptions and report errors to the user here.
00f10cd
to
5004e35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to add tests for the new db functions in a new PR?
@@ -0,0 +1,92 @@ | |||
# Script to fetch artist MBIDs from MusicBrainz Database using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this would be better as messybrainz/db/artist.py
, also this isn't a script but a module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On IRC, we realized that msb doesn't have a db
module yet. Should probably do this in a different PR then.
|
||
result = connection.execute(query) | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that directly returning result
here returns an sqlalchemy ResultProxy element. Maybe we should do a fetchall and only return a list of uuids?
num_recording_mbids_processed = recording_mbids.rowcount | ||
for recording_mbid in recording_mbids: | ||
try: | ||
fetch_and_store_artist_mbids(connection, recording_mbid[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recording_mbid[0]
would be much more readable as just recording_mbid
. This would be possible if we returned a list of mbids instead of the direct result of the query in fetch_recording_mbids_not_in_recording_artist_join
.
@@ -11,6 +11,7 @@ SECRET_KEY = '''{{template "KEY" "secret_key"}}''' | |||
SQLALCHEMY_DATABASE_URI = "postgresql://messybrainz@{{.Address}}:{{.Port}}/messybrainz" | |||
TEST_SQLALCHEMY_DATABASE_URI = "postgresql://msb_test@{{.Address}}:{{.Port}}/msb_test" | |||
POSTGRES_ADMIN_URI="postgresql://postgres@{{.Address}}:{{.Port}}/template1" | |||
MB_DATABASE_URI = "postgresql://musicbrainz_ro@{{.Address}}:{{.Port}}/musicbrainz_db" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes messybrainz connect to the master postgres server. Please open a ticket saying that MessyBrainz should use pgbouncer-slave and assign to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should now move the new db functions to the db module. |
This script will fetch artist MBIDs from the MusicBrainz Database for the recording MBIDs present in the table.
Queries should not be in manage.py scripts and query to fetch recording_mbids from recording_json should be in data.py. Cache is not to be used so remove its initialization.
If UUID is valid or not can be checked by creating an UUID object and checking for value error.
Format queries according to https://github.com/metabrainz/guidelines/blob/master/Python.md#sql. Move init code to manage.py and remove unused imports and code.
Renaming table recording_artist to recording_artist_join and add a column updated to store information on when mbids are added to this table.
Rename recording_artist to recording_artist_json in comments.
Remove checks for valid UUID as submitted recording_json contain only valid UUIDs. This is ensured by ListenBrainz which is the only source of data currently in MessyBrainz. Remove check for recording MBIDs is present in recording_artist_json table or not. As one query while selecting MBIDs does the trick.
It is efficient to use only one insert query instead of multiple queries for each value we want to insert.
Use a JOIN query to get recording MBIDs present in recording_json table but not in recording_artist_join table. And show exceptions triggered by the script to the user running the script.
Rename fetch_artist_mbids.py to fetch_and_store_artist_mbids.py as fetch_artist_mbids doesnot give user clue about the fact that this script also stores the fetched MBIDs.
@paramsingh I'll create a different PR for tests. I plan to use test data from LB-367 for testing. |
Fabric == 1.10.2 | ||
Flask-Testing == 0.4.2 | ||
Flask-SQLAlchemy==2.0 | ||
Flask-UUID == 0.2 | ||
Jinja2 == 2.8 | ||
SQLAlchemy==1.0.8 | ||
SQLAlchemy==1.2.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should update this here too: https://github.com/metabrainz/messybrainz-server/blob/master/setup.py#L16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Move fetch_and_store_artist_mbids.py to db module and rename it to artist.py
Remove fetch_and_store_artist_mbids and add fetch_artist_mbids which does the fetching part and insert_artist_mbids does the storing part. And add the function get_artist_mbids_for_recording_mbid to get artist MBIDs using recording MBID.
Add valid_recordings_with_recording_mbids.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now, just a few questions.
messybrainz/db/tests/test_artist.py
Outdated
artist_mbids = artist.fetch_artist_mbids(connection, recording_mbid) | ||
artist.insert_artist_mbids(connection, recording_mbid, artist_mbids) | ||
artist_mbids_from_join = artist.get_artist_mbids_for_recording_mbid(connection, recording_mbid) | ||
self.assertEqual(set(artist_mbids), set(artist_mbids_from_join)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be assertSetEqual
?
requirements.txt
Outdated
@@ -18,3 +18,4 @@ ujson==1.33 | |||
pytest==3.3.1 | |||
pytest-cov==2.5.1 | |||
unittest2==1.1.0 | |||
mock == 2.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be using this https://docs.python.org/3/library/unittest.mock.html instead of installing via pip.
messybrainz/db/tests/test_artist.py
Outdated
artist.fetch_and_store_artist_mbids_for_all_recording_mbids() | ||
|
||
artist_mbids = artist.get_artist_mbids_for_recording_mbid(connection, "cad174ad-d683-4858-a205-7bdc4175fff7") | ||
self.assertSetEqual(set(artist_mbids), set(artist_mbids_fetched[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exactly are you creating a set here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sets for assertions because we can get multiple artist MBIDs for a single recording MBID, but the order of retrieval is not known.
Add tests to test the functionality to fetch artist MBIDs using recording MBID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\ø/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This PR has 2 major parts.
and then we check for valid MBIDs and is MBID already present in MessyBrainz database(recording_artist table (a new table to store recording artist MBIDs mapping)). In accordance with the results of our checks, we either put the recording MBIDs into the table or go to next one.
This depends on BU-11: Add MusicBrainz database module to brainzutils brainzutils-python#14 for access to MusicBrainzDB.