Skip to content
This repository has been archived by the owner on May 1, 2020. It is now read-only.

LB-372: MessyBrainz: Create artist_credit clusters using artist MBIDs in database #41

Merged
merged 17 commits into from Jul 2, 2018

Conversation

kartikeyaSh
Copy link
Contributor

Summary

This PR includes a script for creating artist_credit clusters.

  • This is a…
    • Bug fix
    • Feature addition
    • Refactoring
    • Minor / simple change (like a typo)
    • Other

Problem

We need to create clusters and associate MSIDs to MBIDs.

  • JIRA ticket (optional): LB-372

Solution

Add a script which will create clusters for artists that have artist MBIDs in recording_json table.
This is somewhat similar to what I've written in my proposal

Additional things apart from the proposal:

  1. New table artist_credit_redirect_array which stores artist MBIDs as arrays, and is represented by an artist_credit_cluster_id.
  2. The clusters are created in 2 phases in phase-1 simple cases are handled and in phase-2 anomalies are handled.

Add convert_json_array_to_sorted_uuid_array which converts
a given JSON array to a sorted psql array of type UUID and
array_sort which returns a sorted UUID array.
Add artist_credit_redirect_array table to store
cluster_id, artist_mbids_array pairs.
@kartikeyaSh
Copy link
Contributor Author

Here is a gist to test run: https://gist.github.com/kartikeyaSh/f80ce28b0d29ccee61fbe235f5acddc3
I'll write tests for this soon.

@kartikeyaSh kartikeyaSh changed the title Artist cluster LB-372: MessyBrainz: Create artist_credit clusters using artist MBIDs in database Jun 13, 2018
Add functionality to create clusters of artist_credit
using artist_mbids present in the recording_json table.
Add artist_credit_redirect_array table
@paramsingh
Copy link
Collaborator

What exactly do you mean by anomaly here?

@kartikeyaSh
Copy link
Contributor Author

@paramsingh A single MSID pointing to multiple MBIDs. e.g. James Morrison and James Morrison. Another way to put it is: Any artist_credit_cluster_id which appears more than once in artist_credit_redirect_array table is an anomaly. You can see that in the above gist that cluster ID 1ff80901-f7e9-4f2a-ac37-3ed0ce216362 points to 4 different MBIDs in artist_credit_redirect_array.

First fetch all unique cluster_id for a list of artist_gids
instead of fetching cluster_id for artist_gid as this may
cause duplicate inserts in artist_credit_redirect table.
Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo the dropping of that table and then I think we should merge and move on.

@@ -0,0 +1,5 @@
BEGIN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is really needed since we haven't updated production yet -- so that table will not exist there...


Args:
connection: the sqlalchemy db connection to be used to execute queries
artist_credit_mbids (list): a list of sorted artist MBIDs for which
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if this list is not sorted? We should consider taking any order and sorting it ourselves inside the function.



def create_artist_credit_clusters_without_considering_anomalies(connection):
"""Creates cluster for artist_credit without considering anamolies.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify in the comment what you mean by anomalies here (and in other places).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also anamolies -> anomalies

@@ -102,3 +103,297 @@ def fetch_and_store_artist_mbids_for_all_recording_mbids():
pass

return num_recording_mbids_processed, num_recording_mbids_added


def fetch_distinct_artist_credit_mbids(connection):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name should be fixed, there is no indication in the function name that it only fetches those mbids which do not have msids in the artist_credit_cluster table.

@@ -1,6 +1,7 @@

import brainzutils.musicbrainz_db.recording as mb_recording
import json
import uuid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this import being used somewhere?

@@ -0,0 +1,5 @@
BEGIN;

DROP TABLE IF EXISTS artist_credit_redirect CASCADE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just move this to the other update script, keeping multiple schema change scripts in one PR doesn't really make sense.



def test_link_artist_mbids_to_artist_credit_cluster_id(self):
"""Tests if artist MBIDs are linked to correctly. Links are created in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linked to cluster

Rename artist_credit_redirect to artist_credit_redirect
and alter the column to store artist MBIDs to array of UUIDs.
Rename fetch_distinct_artist_credit_mbids to
fetch_unclustered_distinct_artist_credit_mbids. This
gives indication that only those MBIDs are fetched which
have not been clustered.
to multiple MBIDs in entity_redirect table).

Args:
connection: the sqlalchemy db connection to be used to execute queries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking a connection as input here and then creating a connection in the function as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done:). removed creating a connection in the function.

@paramsingh
Copy link
Collaborator

From IRC

iliekcomputers: as you mentioned in #44 I wrote generic functions. I'm not sure If I need to write tests for those functions cz those get tested when testing create_artist_credit_clusters, create_artist_credit_clusters_without_anomalies........ And I can't test those functions without using an entity like artist/release.

I don't think we need tests for the generic functions. They'll only get called by each of the entity clustering functions and tested there.

These functions are common to both artist and
release when creating clusters.
Use function declared in db_common to
create clusters.
Copy link
Collaborator

@paramsingh paramsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

This makes it easy to understand how clusters are formed.
Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging output option should also be added to the other clustering functions in manage.py, but we can do that in the other PRs. For now, let's merge!

@mayhem mayhem merged commit fc1da45 into metabrainz:master Jul 2, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants