LB-372: MessyBrainz: Create artist_credit clusters using artist MBIDs in database #41
Conversation
Add convert_json_array_to_sorted_uuid_array which converts a given JSON array to a sorted psql array of type UUID and array_sort which returns a sorted UUID array.
Add artist_credit_redirect_array table to store cluster_id, artist_mbids_array pairs.
Here is a gist to test run: https://gist.github.com/kartikeyaSh/f80ce28b0d29ccee61fbe235f5acddc3 |
fd45542
to
b2b44f1
Compare
Add functionality to create clusters of artist_credit using artist_mbids present in the recording_json table.
Add artist_credit_redirect_array table
b2b44f1
to
4631d20
Compare
What exactly do you mean by |
@paramsingh A single MSID pointing to multiple MBIDs. e.g. James Morrison and James Morrison. Another way to put it is: Any |
First fetch all unique cluster_id for a list of artist_gids instead of fetching cluster_id for artist_gid as this may cause duplicate inserts in artist_credit_redirect table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undo the dropping of that table and then I think we should merge and move on.
@@ -0,0 +1,5 @@ | |||
BEGIN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is really needed since we haven't updated production yet -- so that table will not exist there...
4aecf0a
to
fc91ae4
Compare
messybrainz/db/artist.py
Outdated
|
||
Args: | ||
connection: the sqlalchemy db connection to be used to execute queries | ||
artist_credit_mbids (list): a list of sorted artist MBIDs for which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if this list is not sorted? We should consider taking any order and sorting it ourselves inside the function.
messybrainz/db/artist.py
Outdated
|
||
|
||
def create_artist_credit_clusters_without_considering_anomalies(connection): | ||
"""Creates cluster for artist_credit without considering anamolies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify in the comment what you mean by anomalies here (and in other places).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also anamolies -> anomalies
messybrainz/db/artist.py
Outdated
@@ -102,3 +103,297 @@ def fetch_and_store_artist_mbids_for_all_recording_mbids(): | |||
pass | |||
|
|||
return num_recording_mbids_processed, num_recording_mbids_added | |||
|
|||
|
|||
def fetch_distinct_artist_credit_mbids(connection): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this name should be fixed, there is no indication in the function name that it only fetches those mbids which do not have msids in the artist_credit_cluster table.
messybrainz/db/artist.py
Outdated
@@ -1,6 +1,7 @@ | |||
|
|||
import brainzutils.musicbrainz_db.recording as mb_recording | |||
import json | |||
import uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this import being used somewhere?
@@ -0,0 +1,5 @@ | |||
BEGIN; | |||
|
|||
DROP TABLE IF EXISTS artist_credit_redirect CASCADE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just move this to the other update script, keeping multiple schema change scripts in one PR doesn't really make sense.
messybrainz/db/tests/test_artist.py
Outdated
|
||
|
||
def test_link_artist_mbids_to_artist_credit_cluster_id(self): | ||
"""Tests if artist MBIDs are linked to correctly. Links are created in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linked to cluster
Rename artist_credit_redirect to artist_credit_redirect and alter the column to store artist MBIDs to array of UUIDs.
Rename fetch_distinct_artist_credit_mbids to fetch_unclustered_distinct_artist_credit_mbids. This gives indication that only those MBIDs are fetched which have not been clustered.
f762c15
to
8c7af9a
Compare
to multiple MBIDs in entity_redirect table). | ||
|
||
Args: | ||
connection: the sqlalchemy db connection to be used to execute queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taking a connection as input here and then creating a connection in the function as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done:). removed creating a connection in the function.
From IRC
I don't think we need tests for the generic functions. They'll only get called by each of the entity clustering functions and tested there. |
These functions are common to both artist and release when creating clusters.
Use function declared in db_common to create clusters.
c8c64cb
to
a9ccca8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇
This makes it easy to understand how clusters are formed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging output option should also be added to the other clustering functions in manage.py, but we can do that in the other PRs. For now, let's merge!
Summary
This PR includes a script for creating artist_credit clusters.
Problem
We need to create clusters and associate MSIDs to MBIDs.
Solution
Add a script which will create clusters for artists that have artist MBIDs in recording_json table.
This is somewhat similar to what I've written in my proposal
Additional things apart from the proposal: