Skip to content
This repository has been archived by the owner on May 1, 2020. It is now read-only.

LB-352: Add script to fetch artist MBIDs provided recording MBIDs from MusicBrainz. #36

Merged
merged 21 commits into from Jun 11, 2018

Conversation

kartikeyaSh
Copy link
Contributor

@kartikeyaSh kartikeyaSh commented May 6, 2018

This PR has 2 major parts.

  1. Setup for accessing MusicBrainz Database: This way we can access the image of MusicBrainz directly.
  2. A script which uses MusicBrainz database to fetch artist MBIDs using the recording MBIDs present in the database. In this script we first get all the recording MBIDs from the recording_json table.
    and then we check for valid MBIDs and is MBID already present in MessyBrainz database(recording_artist table (a new table to store recording artist MBIDs mapping)). In accordance with the results of our checks, we either put the recording MBIDs into the table or go to next one.
    This depends on BU-11: Add MusicBrainz database module to brainzutils brainzutils-python#14 for access to MusicBrainzDB.

@mayhem mayhem requested a review from paramsingh May 14, 2018 15:26
@mayhem
Copy link
Member

mayhem commented May 14, 2018

Can you please leave a detailed description when you open a PR?

@paramsingh paramsingh self-assigned this May 18, 2018
@paramsingh
Copy link
Collaborator

What exactly is the thinking behind storing artist_mbids in a new table instead of just querying the musicbrainz_db when the mbids are needed?

Copy link
Collaborator

@paramsingh paramsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that schema changes should include a schema-update script too. See examples here: https://github.com/metabrainz/listenbrainz-server/tree/master/admin/sql/updates

I haven't tested this yet, I'll do it in the next round of reviews, once we know if the schema change is definitely needed or not.

manage.py Outdated
if reset:
db.init_db_engine(config.SQLALCHEMY_DATABASE_URI)
with db.engine.begin() as connection:
query = text("""TRUNCATE TABLE recording_artist""")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All db queries should work as functions in the db module.

from sqlalchemy import text


def check_valid_uuid(s):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other projects, we just create a uuid.UUID object and return False if it raises a ValueError. I think that is a better method when compared to this long regex.

See here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Never reinvent the wheel if you don't have to -- that means you don't have to debug the wheel when inevitable bugs appear.

"""
with db.engine.begin() as connection:
for artist_mbid in artist_mbids:
query = text("""INSERT INTO recording_artist (recording_mbid, artist_mbid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at how queries are formatted, SQL keywords should be in upper-case and right aligned.

See example here: https://github.com/metabrainz/guidelines/blob/master/Python.md#sql


# Get a list of all distinct recording MBIDs from recording_json table
with db.engine.begin() as connection:
query = text("""SELECT DISTINCT data ->> 'recording_mbid' AS recording_mbid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in the db module.

CREATE TABLE recording_artist (
recording_mbid UUID NOT NULL,
artist_mbid UUID NOT NULL
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also have a

updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW

column on this table.

manage.py Outdated

result = fetch_artist_mbids_for_all_recording_mbids()

print("Total recording MBIDs processed: {0}.".format(result[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code could be more readable by assigning the return value of fetch_artist_mbids_for_all_recording_mbids() to total_recordings_processed and total_recordings_added. While this isn't a big deal in this context, it makes code nicer to read and will become more important in other contexts later.

from sqlalchemy import text


def check_valid_uuid(s):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Never reinvent the wheel if you don't have to -- that means you don't have to debug the wheel when inevitable bugs appear.

"artist_mbid": artist_mbid,
})

return is_recording_mbid_present(recording_mbid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is strange -- you insert data and then you query for the data again. This means that for every insert there is an unnecessary query to the DB to fetch the data again. Your call to add_artist_mbids() should catch exceptions and add accordingly -- this means that you can trust the INSERT statement to execute correctly and insert the data as you expect.

""" Fetches artist MBIDs from the MusicBrainz database for the recording MBID.
And inserts the artist MBIDs into the recording_artist table.
"""
try :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more pythonic to catch this exception where this function is called, rather than here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case if someone submitted a wrong UUID in listen which doesn't exist in MusicBrainz database.

for artist in recording['artists']:
artist_mbids.append(artist['id'])

result = add_artist_mbids(recording_mbid, artist_mbids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a function called fetch_artist_mbids() -- from a caller's perspective, this function should not alter the database. If this is how you want the function to really work, you should change the name to something that indicates that the database state my be changed. fetch_and_insert_artist_mbids() could be a better name... But, more likely you should examine your algorithmic flow and see if this is perhaps better broken into two separate functions...

recording MBIDs that were added to the recording_artist table.
"""
# Init databases
db.init_db_engine(config.SQLALCHEMY_DATABASE_URI)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the init steps being done in a function to fetch artist mbids? It should be done by the main function that sets up the correct environment...

"""

with db.engine.begin() as connection:
if reset:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for the next set of improvements to this PR, we should adapt these features to work in an incremental way. By running this script it should examine any data that hasn't been matched yet and attempt to match it. Then we should expose a separate function for truncating the tables -- this then gives the user the ability run this program once or repeatedly and if choosing to start over, the tables can be truncated.

num_recording_mbids_added = 0
num_recording_mbids_processed = recording_mbids.rowcount
for recording_mbid in recording_mbids:
if is_valid_uuid(recording_mbid[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this step is even needed -- we can't insert invalid UUIDs into the database... But, if we keep it, then we should at least report an error to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These UUIDs are inserted in JSON fields of recording_json table. And we don't have any check in messybrainz submit_listen to check if recording_mbid is valid UUID or not. But we get recordings only from LB which checks for valid UUIDs. So, as you said it's not possible to have invalid UUIDs. I'll remove this check. And in case we need it we should add it to submit_listen function.


def is_recording_mbid_present(connection, recording_mbid):
"""
Check if recording MBID is already present in recording_artist table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the table name in the comment.

num_recording_mbids_processed = recording_mbids.rowcount
for recording_mbid in recording_mbids:
if is_valid_uuid(recording_mbid[0]):
if not is_recording_mbid_present(connection, recording_mbid[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the same pattern problem again -- you could write one query to combining the fetching of unique recording_mbids AND to figure out which ones are not in clusters yet.

Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer!

those recording MBIDs.
"""

query = text("""SELECT DISTINCT data ->> 'recording_mbid'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better -- however, this approach will still get into trouble with large tables. You should be able to rewrite this query with a LEFT JOIN query that allows you to find rows that exist in one table, but not the other.

manage.py Outdated
db.init_db_engine(config.SQLALCHEMY_DATABASE_URI)
musicbrainz_db.init_db_engine(config.MB_DATABASE_URI)

num_recording_mbids_processed, num_recording_mbids_added = fetch_and_store_artist_mbids_for_all_recording_mbids()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to catch exceptions and report errors to the user here.

Copy link
Collaborator

@paramsingh paramsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to add tests for the new db functions in a new PR?

@@ -0,0 +1,92 @@
# Script to fetch artist MBIDs from MusicBrainz Database using
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this would be better as messybrainz/db/artist.py, also this isn't a script but a module.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On IRC, we realized that msb doesn't have a db module yet. Should probably do this in a different PR then.


result = connection.execute(query)

return result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that directly returning result here returns an sqlalchemy ResultProxy element. Maybe we should do a fetchall and only return a list of uuids?

num_recording_mbids_processed = recording_mbids.rowcount
for recording_mbid in recording_mbids:
try:
fetch_and_store_artist_mbids(connection, recording_mbid[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recording_mbid[0] would be much more readable as just recording_mbid. This would be possible if we returned a list of mbids instead of the direct result of the query in fetch_recording_mbids_not_in_recording_artist_join.

@@ -11,6 +11,7 @@ SECRET_KEY = '''{{template "KEY" "secret_key"}}'''
SQLALCHEMY_DATABASE_URI = "postgresql://messybrainz@{{.Address}}:{{.Port}}/messybrainz"
TEST_SQLALCHEMY_DATABASE_URI = "postgresql://msb_test@{{.Address}}:{{.Port}}/msb_test"
POSTGRES_ADMIN_URI="postgresql://postgres@{{.Address}}:{{.Port}}/template1"
MB_DATABASE_URI = "postgresql://musicbrainz_ro@{{.Address}}:{{.Port}}/musicbrainz_db"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes messybrainz connect to the master postgres server. Please open a ticket saying that MessyBrainz should use pgbouncer-slave and assign to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paramsingh
Copy link
Collaborator

We should now move the new db functions to the db module.

This script will fetch artist MBIDs from the MusicBrainz Database
for the recording MBIDs present in the table.
Queries should not be in manage.py scripts
and query to fetch recording_mbids from recording_json
should be in data.py.
Cache is not to be used so remove its initialization.
If UUID is valid or not can be checked by creating
an UUID object and checking for value error.
Format queries according to https://github.com/metabrainz/guidelines/blob/master/Python.md#sql.
Move init code to manage.py and remove unused imports and code.
Renaming table recording_artist to recording_artist_join
and add a column updated to store information on when mbids
are added to this table.
Rename recording_artist to recording_artist_json in comments.
Remove checks for valid UUID as submitted recording_json
contain only valid UUIDs. This is ensured by ListenBrainz
which is the only source of data currently in MessyBrainz.
Remove check for recording MBIDs is present in recording_artist_json
table or not. As one query while selecting MBIDs does the trick.
It is efficient to use only one insert query
instead of multiple queries for each value we
want to insert.
Use a JOIN query to get recording MBIDs present in
recording_json table but not in recording_artist_join table.
And show exceptions triggered by the script to the user running
the script.
Rename fetch_artist_mbids.py to fetch_and_store_artist_mbids.py as
fetch_artist_mbids doesnot give user clue about the fact that this
script also stores the fetched MBIDs.
@kartikeyaSh
Copy link
Contributor Author

@paramsingh I'll create a different PR for tests. I plan to use test data from LB-367 for testing.

Fabric == 1.10.2
Flask-Testing == 0.4.2
Flask-SQLAlchemy==2.0
Flask-UUID == 0.2
Jinja2 == 2.8
SQLAlchemy==1.0.8
SQLAlchemy==1.2.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Move fetch_and_store_artist_mbids.py to db module
and rename it to artist.py
Remove fetch_and_store_artist_mbids and add fetch_artist_mbids
which does the fetching part and insert_artist_mbids does the
storing part. And add the function get_artist_mbids_for_recording_mbid
to get artist MBIDs using recording MBID.
Add valid_recordings_with_recording_mbids.json
Copy link
Collaborator

@paramsingh paramsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now, just a few questions.

artist_mbids = artist.fetch_artist_mbids(connection, recording_mbid)
artist.insert_artist_mbids(connection, recording_mbid, artist_mbids)
artist_mbids_from_join = artist.get_artist_mbids_for_recording_mbid(connection, recording_mbid)
self.assertEqual(set(artist_mbids), set(artist_mbids_from_join))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be assertSetEqual?

requirements.txt Outdated
@@ -18,3 +18,4 @@ ujson==1.33
pytest==3.3.1
pytest-cov==2.5.1
unittest2==1.1.0
mock == 2.0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be using this https://docs.python.org/3/library/unittest.mock.html instead of installing via pip.

artist.fetch_and_store_artist_mbids_for_all_recording_mbids()

artist_mbids = artist.get_artist_mbids_for_recording_mbid(connection, "cad174ad-d683-4858-a205-7bdc4175fff7")
self.assertSetEqual(set(artist_mbids), set(artist_mbids_fetched[0]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly are you creating a set here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sets for assertions because we can get multiple artist MBIDs for a single recording MBID, but the order of retrieval is not known.

Add tests to test the functionality to fetch artist MBIDs using
recording MBID.
Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\ø/

Copy link
Collaborator

@paramsingh paramsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@paramsingh paramsingh merged commit 11c9542 into metabrainz:master Jun 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants