LB-352: Add script to fetch artist MBIDs provided recording MBIDs from MusicBrainz. #36

kartikeyaSh · 2018-05-06T17:56:29Z

This PR has 2 major parts.

Setup for accessing MusicBrainz Database: This way we can access the image of MusicBrainz directly.
A script which uses MusicBrainz database to fetch artist MBIDs using the recording MBIDs present in the database. In this script we first get all the recording MBIDs from the recording_json table.
and then we check for valid MBIDs and is MBID already present in MessyBrainz database(recording_artist table (a new table to store recording artist MBIDs mapping)). In accordance with the results of our checks, we either put the recording MBIDs into the table or go to next one.
This depends on BU-11: Add MusicBrainz database module to brainzutils brainzutils-python#14 for access to MusicBrainzDB.

mayhem · 2018-05-14T15:37:26Z

Can you please leave a detailed description when you open a PR?

paramsingh · 2018-05-18T22:30:39Z

What exactly is the thinking behind storing artist_mbids in a new table instead of just querying the musicbrainz_db when the mbids are needed?

paramsingh

Note that schema changes should include a schema-update script too. See examples here: https://github.com/metabrainz/listenbrainz-server/tree/master/admin/sql/updates

I haven't tested this yet, I'll do it in the next round of reviews, once we know if the schema change is definitely needed or not.

paramsingh · 2018-05-18T22:31:37Z

manage.py

+    if reset:
+        db.init_db_engine(config.SQLALCHEMY_DATABASE_URI)
+        with db.engine.begin() as connection:
+            query = text("""TRUNCATE TABLE recording_artist""")


All db queries should work as functions in the db module.

paramsingh · 2018-05-18T22:32:56Z

messybrainz/fetch_artist_mbids.py

+from sqlalchemy import text
+
+
+def check_valid_uuid(s):


In other projects, we just create a uuid.UUID object and return False if it raises a ValueError. I think that is a better method when compared to this long regex.

See here

+1

Never reinvent the wheel if you don't have to -- that means you don't have to debug the wheel when inevitable bugs appear.

paramsingh · 2018-05-18T22:33:55Z

messybrainz/fetch_artist_mbids.py

+    """
+    with db.engine.begin() as connection:
+        for artist_mbid in artist_mbids:
+            query = text("""INSERT INTO recording_artist (recording_mbid, artist_mbid)


Please take a look at how queries are formatted, SQL keywords should be in upper-case and right aligned.

See example here: https://github.com/metabrainz/guidelines/blob/master/Python.md#sql

paramsingh · 2018-05-18T22:34:41Z

messybrainz/fetch_artist_mbids.py

+
+    # Get a list of all distinct recording MBIDs from recording_json table
+    with db.engine.begin() as connection:
+        query = text("""SELECT DISTINCT data ->> 'recording_mbid' AS recording_mbid


this should be in the db module.

mayhem · 2018-05-17T15:37:31Z

admin/sql/create_tables.sql

+CREATE TABLE recording_artist (
+  recording_mbid UUID NOT NULL,
+  artist_mbid UUID NOT NULL
+);


I think we should also have a

updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW

column on this table.

mayhem · 2018-05-28T09:59:51Z

manage.py

+
+    result = fetch_artist_mbids_for_all_recording_mbids()
+
+    print("Total recording MBIDs processed: {0}.".format(result[0]))


This code could be more readable by assigning the return value of fetch_artist_mbids_for_all_recording_mbids() to total_recordings_processed and total_recordings_added. While this isn't a big deal in this context, it makes code nicer to read and will become more important in other contexts later.

mayhem · 2018-05-28T10:00:38Z

messybrainz/fetch_artist_mbids.py

+from sqlalchemy import text
+
+
+def check_valid_uuid(s):


+1

Never reinvent the wheel if you don't have to -- that means you don't have to debug the wheel when inevitable bugs appear.

mayhem · 2018-05-28T10:10:07Z

messybrainz/fetch_artist_mbids.py

+            "artist_mbid": artist_mbid,
+            })
+
+    return is_recording_mbid_present(recording_mbid)


This part is strange -- you insert data and then you query for the data again. This means that for every insert there is an unnecessary query to the DB to fetch the data again. Your call to add_artist_mbids() should catch exceptions and add accordingly -- this means that you can trust the INSERT statement to execute correctly and insert the data as you expect.

mayhem · 2018-05-28T10:11:56Z

messybrainz/fetch_artist_mbids.py

+    """ Fetches artist MBIDs from the MusicBrainz database for the recording MBID.
+        And inserts the artist MBIDs into the recording_artist table.
+    """
+    try :


It is more pythonic to catch this exception where this function is called, rather than here.

This is for the case if someone submitted a wrong UUID in listen which doesn't exist in MusicBrainz database.

mayhem · 2018-05-28T10:14:30Z

messybrainz/fetch_artist_mbids.py

+    for artist in recording['artists']:
+        artist_mbids.append(artist['id'])
+
+    result = add_artist_mbids(recording_mbid, artist_mbids)


You have a function called fetch_artist_mbids() -- from a caller's perspective, this function should not alter the database. If this is how you want the function to really work, you should change the name to something that indicates that the database state my be changed. fetch_and_insert_artist_mbids() could be a better name... But, more likely you should examine your algorithmic flow and see if this is perhaps better broken into two separate functions...

mayhem · 2018-05-28T10:15:40Z

messybrainz/fetch_artist_mbids.py

+        recording MBIDs that were added to the recording_artist table.
+    """
+    # Init databases
+    db.init_db_engine(config.SQLALCHEMY_DATABASE_URI)


Why are the init steps being done in a function to fetch artist mbids? It should be done by the main function that sets up the correct environment...

mayhem · 2018-05-29T13:18:28Z

messybrainz/fetch_artist_mbids.py

+    """
+
+    with db.engine.begin() as connection:
+        if reset:


OK, for the next set of improvements to this PR, we should adapt these features to work in an incremental way. By running this script it should examine any data that hasn't been matched yet and attempt to match it. Then we should expose a separate function for truncating the tables -- this then gives the user the ability run this program once or repeatedly and if choosing to start over, the tables can be truncated.

mayhem · 2018-05-29T13:21:31Z

messybrainz/fetch_artist_mbids.py

+        num_recording_mbids_added = 0
+        num_recording_mbids_processed = recording_mbids.rowcount
+        for recording_mbid in recording_mbids:
+            if is_valid_uuid(recording_mbid[0]):


Not sure if this step is even needed -- we can't insert invalid UUIDs into the database... But, if we keep it, then we should at least report an error to the user.

These UUIDs are inserted in JSON fields of recording_json table. And we don't have any check in messybrainz submit_listen to check if recording_mbid is valid UUID or not. But we get recordings only from LB which checks for valid UUIDs. So, as you said it's not possible to have invalid UUIDs. I'll remove this check. And in case we need it we should add it to submit_listen function.

mayhem · 2018-05-29T13:22:20Z

messybrainz/fetch_artist_mbids.py

+
+def is_recording_mbid_present(connection, recording_mbid):
+    """
+        Check if recording MBID is already present in recording_artist table.


Fix the table name in the comment.

mayhem · 2018-05-29T13:23:33Z

messybrainz/fetch_artist_mbids.py

+        num_recording_mbids_processed = recording_mbids.rowcount
+        for recording_mbid in recording_mbids:
+            if is_valid_uuid(recording_mbid[0]):
+                if not is_recording_mbid_present(connection, recording_mbid[0]):


Here is the same pattern problem again -- you could write one query to combining the fetching of unique recording_mbids AND to figure out which ones are not in clusters yet.

mayhem

Getting closer!

mayhem · 2018-05-31T12:02:26Z

messybrainz/fetch_artist_mbids.py

+        those recording MBIDs.
+    """
+
+    query = text("""SELECT DISTINCT data ->> 'recording_mbid'


This is better -- however, this approach will still get into trouble with large tables. You should be able to rewrite this query with a LEFT JOIN query that allows you to find rows that exist in one table, but not the other.

mayhem · 2018-05-31T12:03:19Z

manage.py

+    db.init_db_engine(config.SQLALCHEMY_DATABASE_URI)
+    musicbrainz_db.init_db_engine(config.MB_DATABASE_URI)
+
+    num_recording_mbids_processed, num_recording_mbids_added = fetch_and_store_artist_mbids_for_all_recording_mbids()


You need to catch exceptions and report errors to the user here.

paramsingh

Do we plan to add tests for the new db functions in a new PR?

paramsingh · 2018-06-07T16:56:59Z

messybrainz/fetch_and_store_artist_mbids.py

@@ -0,0 +1,92 @@
+# Script to fetch artist MBIDs from MusicBrainz Database using


I feel like this would be better as messybrainz/db/artist.py, also this isn't a script but a module.

On IRC, we realized that msb doesn't have a db module yet. Should probably do this in a different PR then.

paramsingh · 2018-06-07T17:10:12Z

messybrainz/fetch_and_store_artist_mbids.py

+
+    result = connection.execute(query)
+
+    return result


Note that directly returning result here returns an sqlalchemy ResultProxy element. Maybe we should do a fetchall and only return a list of uuids?

paramsingh · 2018-06-07T17:11:02Z

messybrainz/fetch_and_store_artist_mbids.py

+        num_recording_mbids_processed = recording_mbids.rowcount
+        for recording_mbid in recording_mbids:
+            try:
+                fetch_and_store_artist_mbids(connection, recording_mbid[0])


recording_mbid[0] would be much more readable as just recording_mbid. This would be possible if we returned a list of mbids instead of the direct result of the query in fetch_recording_mbids_not_in_recording_artist_join.

paramsingh · 2018-06-07T17:14:49Z

consul_config.py.ctmpl

@@ -11,6 +11,7 @@ SECRET_KEY = '''{{template "KEY" "secret_key"}}'''
 SQLALCHEMY_DATABASE_URI = "postgresql://messybrainz@{{.Address}}:{{.Port}}/messybrainz"
 TEST_SQLALCHEMY_DATABASE_URI = "postgresql://msb_test@{{.Address}}:{{.Port}}/msb_test"
 POSTGRES_ADMIN_URI="postgresql://postgres@{{.Address}}:{{.Port}}/template1"
+MB_DATABASE_URI = "postgresql://musicbrainz_ro@{{.Address}}:{{.Port}}/musicbrainz_db"


This makes messybrainz connect to the master postgres server. Please open a ticket saying that MessyBrainz should use pgbouncer-slave and assign to me.

https://tickets.metabrainz.org/browse/LB-371 Done!

paramsingh · 2018-06-08T16:17:30Z

We should now move the new db functions to the db module.

This script will fetch artist MBIDs from the MusicBrainz Database for the recording MBIDs present in the table.

Queries should not be in manage.py scripts and query to fetch recording_mbids from recording_json should be in data.py. Cache is not to be used so remove its initialization.

If UUID is valid or not can be checked by creating an UUID object and checking for value error.

Format queries according to https://github.com/metabrainz/guidelines/blob/master/Python.md#sql. Move init code to manage.py and remove unused imports and code.

Renaming table recording_artist to recording_artist_join and add a column updated to store information on when mbids are added to this table.

Rename recording_artist to recording_artist_json in comments.

Remove checks for valid UUID as submitted recording_json contain only valid UUIDs. This is ensured by ListenBrainz which is the only source of data currently in MessyBrainz. Remove check for recording MBIDs is present in recording_artist_json table or not. As one query while selecting MBIDs does the trick.

It is efficient to use only one insert query instead of multiple queries for each value we want to insert.

Use a JOIN query to get recording MBIDs present in recording_json table but not in recording_artist_join table. And show exceptions triggered by the script to the user running the script.

Rename fetch_artist_mbids.py to fetch_and_store_artist_mbids.py as fetch_artist_mbids doesnot give user clue about the fact that this script also stores the fetched MBIDs.

kartikeyaSh · 2018-06-08T18:21:27Z

@paramsingh I'll create a different PR for tests. I plan to use test data from LB-367 for testing.

paramsingh · 2018-06-11T11:18:32Z

requirements.txt

 Fabric == 1.10.2
 Flask-Testing == 0.4.2
 Flask-SQLAlchemy==2.0
 Flask-UUID == 0.2
 Jinja2 == 2.8
-SQLAlchemy==1.0.8
+SQLAlchemy==1.2.5


Should update this here too: https://github.com/metabrainz/messybrainz-server/blob/master/setup.py#L16

Move fetch_and_store_artist_mbids.py to db module and rename it to artist.py

Remove fetch_and_store_artist_mbids and add fetch_artist_mbids which does the fetching part and insert_artist_mbids does the storing part. And add the function get_artist_mbids_for_recording_mbid to get artist MBIDs using recording MBID.

Add valid_recordings_with_recording_mbids.json

paramsingh

Looks good to me now, just a few questions.

paramsingh · 2018-06-11T16:21:48Z

messybrainz/db/tests/test_artist.py

+            artist_mbids = artist.fetch_artist_mbids(connection, recording_mbid)
+            artist.insert_artist_mbids(connection, recording_mbid, artist_mbids)
+            artist_mbids_from_join = artist.get_artist_mbids_for_recording_mbid(connection, recording_mbid)
+            self.assertEqual(set(artist_mbids), set(artist_mbids_from_join))


Should this be assertSetEqual?

paramsingh · 2018-06-11T16:22:06Z

requirements.txt

@@ -18,3 +18,4 @@ ujson==1.33
 pytest==3.3.1
 pytest-cov==2.5.1
 unittest2==1.1.0
+mock == 2.0.0


Should be using this https://docs.python.org/3/library/unittest.mock.html instead of installing via pip.

paramsingh · 2018-06-11T16:24:21Z

messybrainz/db/tests/test_artist.py

+            artist.fetch_and_store_artist_mbids_for_all_recording_mbids()
+
+            artist_mbids = artist.get_artist_mbids_for_recording_mbid(connection, "cad174ad-d683-4858-a205-7bdc4175fff7")
+            self.assertSetEqual(set(artist_mbids), set(artist_mbids_fetched[0]))


Why exactly are you creating a set here?

Using sets for assertions because we can get multiple artist MBIDs for a single recording MBID, but the order of retrieval is not known.

Add tests to test the functionality to fetch artist MBIDs using recording MBID.

mayhem

\ø/

paramsingh

👍

mayhem requested a review from paramsingh May 14, 2018 15:26

paramsingh self-assigned this May 18, 2018

paramsingh reviewed May 18, 2018

View reviewed changes

mayhem suggested changes May 28, 2018

View reviewed changes

kartikeyaSh force-pushed the mba branch 2 times, most recently from 27d045d to 997f7e8 Compare May 29, 2018 12:08

mayhem suggested changes May 29, 2018

View reviewed changes

mayhem suggested changes May 31, 2018

View reviewed changes

kartikeyaSh force-pushed the mba branch 2 times, most recently from 00f10cd to 5004e35 Compare June 3, 2018 13:13

paramsingh reviewed Jun 7, 2018

View reviewed changes

kartikeyaSh added 14 commits June 8, 2018 22:59

Setup for accessing MusicBrainz Database.

6c19c93

Add script to fetch artist MBIDs for the recording MBIDs.

3b5b126

This script will fetch artist MBIDs from the MusicBrainz Database for the recording MBIDs present in the table.

Move queries to the correct places and remove unused code

865bb92

Queries should not be in manage.py scripts and query to fetch recording_mbids from recording_json should be in data.py. Cache is not to be used so remove its initialization.

Remove lengthy regex check for UUID

7b34efa

If UUID is valid or not can be checked by creating an UUID object and checking for value error.

Format and move code to correct places

86568a3

Format queries according to https://github.com/metabrainz/guidelines/blob/master/Python.md#sql. Move init code to manage.py and remove unused imports and code.

Modify table to store artist MBIDs for recordings.

e7c21be

Renaming table recording_artist to recording_artist_join and add a column updated to store information on when mbids are added to this table.

Add database updates

4785538

Move exception handling to the calling function

b6b3b08

Rename table name in comments.

50d92c5

Rename recording_artist to recording_artist_json in comments.

Use single insert query to insert multiple values

1fe7af6

It is efficient to use only one insert query instead of multiple queries for each value we want to insert.

Add a separate function to truncate recording_artist_join table

f9c5266

Use efficient query and catch exceptions

e405908

Use a JOIN query to get recording MBIDs present in recording_json table but not in recording_artist_join table. And show exceptions triggered by the script to the user running the script.

Rename fetch_artist_mbids.py

c9356b2

Rename fetch_artist_mbids.py to fetch_and_store_artist_mbids.py as fetch_artist_mbids doesnot give user clue about the fact that this script also stores the fetched MBIDs.

kartikeyaSh force-pushed the mba branch from b60e8ba to 59ab1b0 Compare June 8, 2018 17:30

paramsingh reviewed Jun 11, 2018

View reviewed changes

kartikeyaSh added 6 commits June 11, 2018 20:20

Update requirements

44e7f8e

Return list of recording MBIDs not ResultProxy

d09a603

Catch and show exceptions

89dc1d6

Move and rename fetch_and_store_artist_mbids.py

e1dab08

Move fetch_and_store_artist_mbids.py to db module and rename it to artist.py

Add testdata

ec180e6

Add valid_recordings_with_recording_mbids.json

kartikeyaSh force-pushed the mba branch from 59ab1b0 to 2b3795e Compare June 11, 2018 14:50

paramsingh reviewed Jun 11, 2018

View reviewed changes

kartikeyaSh force-pushed the mba branch from 2b3795e to b4b29df Compare June 11, 2018 17:26

Add tests for artist.py

f749c3f

Add tests to test the functionality to fetch artist MBIDs using recording MBID.

kartikeyaSh force-pushed the mba branch from b4b29df to f749c3f Compare June 11, 2018 17:32

mayhem approved these changes Jun 11, 2018

View reviewed changes

paramsingh approved these changes Jun 11, 2018

View reviewed changes

paramsingh merged commit 11c9542 into metabrainz:master Jun 11, 2018


		result = fetch_artist_mbids_for_all_recording_mbids()

		print("Total recording MBIDs processed: {0}.".format(result[0]))

		@@ -0,0 +1,92 @@
		# Script to fetch artist MBIDs from MusicBrainz Database using

LB-352: Add script to fetch artist MBIDs provided recording MBIDs from MusicBrainz. #36

LB-352: Add script to fetch artist MBIDs provided recording MBIDs from MusicBrainz. #36

Conversation

kartikeyaSh commented May 6, 2018 • edited

mayhem commented May 14, 2018

paramsingh commented May 18, 2018

paramsingh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayhem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paramsingh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paramsingh commented Jun 8, 2018

kartikeyaSh commented Jun 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paramsingh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayhem left a comment

Choose a reason for hiding this comment

paramsingh left a comment

Choose a reason for hiding this comment

kartikeyaSh commented May 6, 2018 •

edited