Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CB-231: Use direct database access for the Release Group entity #122

Merged
merged 5 commits into from Aug 7, 2017

Conversation

ferbncode
Copy link
Member

Added fetch_multiple_release_group and get_release_group_by_id for fetching release group info.

@ferbncode ferbncode changed the title [WIP]: Use direct database access for entity: release_group [WIP]: CB-231: Use direct database access for entity: release_group Jul 5, 2017
@ferbncode ferbncode force-pushed the release_group_work branch 5 times, most recently from 2c182cc to df07ede Compare July 7, 2017 13:33
@ferbncode ferbncode changed the title [WIP]: CB-231: Use direct database access for entity: release_group CB-231: Use direct database access for entity: release_group Jul 7, 2017
@ferbncode ferbncode force-pushed the release_group_work branch 5 times, most recently from 54b2426 to f231a37 Compare July 12, 2017 19:02
Copy link
Collaborator

@alastair alastair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think we should have a talk now about number of db queries per method call before we go much further, but otherwise this seems to be going well.

from brainzutils import cache


DEFAULT_CACHE_EXPIRATION = 12 * 60 * 60 # seconds (12 hours)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this shared between all musicbrainz_db module files? It might be better to set it only once, unless you think there's a reason that we should refresh some entities more often

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All entities are updated after the same time. Updated as suggested.


if 'artists' in includes:
for release_group in release_groups:
artist_credit_names = release_group.artist_credit.artists
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked into how many individual SQL queries a single fetch_ method makes? I'm looking at this method and my guess is at least 7 per mbid (main query, artist - releases - rg rel - url rel - work rel - tags includes, one more if the mbid is a redirect).
There's no rule of thumb for the number of queries that we should make, but each query is going to require a round trip to the sql server (and in the case of production CB, it will be a network trip to a different physical server).

Looking at this method I can see a few basic optimisations:

  • If there are many mbids, get them all at once. This may require a modified version of get_something_by_gid
  • You can join into the artist_credit item of the releasegroup at the initial get query

For the other entity relations I'm not sure if there is any further optimisation that we can do, as each query will be against a different table. It's possible we could do some kind of UNION query but I'm not sure that will give us any benefit of speed.

Copy link
Member Author

@ferbncode ferbncode Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested, I've added a get_entities_by_gid function for fetching multiple entities. Also, added joinedload for artist_credit_names, artists and release_group_meta for saving db queries. A single fetch_ method now would take atmost 7 queries to the db (for any number of mbids). (Previously, it was making 5 queries per mbid (seperate ones for artist_credit_name, artist_credit, meta, artist_credit_name and release_group(excluding redirects queries)) + 6 more for relationships, releases and tags). Thanks for suggesting that :)

if 'work-rels' in includes:
entity_relation_helper(db, 'work', 'release_group', release_group_ids, includes_data)

if 'tags' in includes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will you have to do this tag query often? It might pay to have a helper function to do it. Do you need tags for Places? I see you have it here but not for Place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggeted, added a get_tags function for fetching entity tags (as it may be used for other entities like artists)

data['join_phrase'] = artist_credit_name.join_phrase
return data


def to_dict_release_groups(release_group, includes={}):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should have tests for these conversion methods to make sure that we correctly create a dict from a particular releasegroup id.
I think that this will be a bit difficult - we can't really do the query against a real musicbrainz database because the data might change in the database. I'm not sure that it's worth testing only this function, I'd like to see it tested in conjunction with fetch_multiple_release_groups
Maybe @gentlecat has a suggestion here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely like to see some tests for all of this.

Copy link
Member Author

@ferbncode ferbncode Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Regarding the test database, we can make use ot mbdata.sample_data.create_sample_data(). I've started some minimal example work here > ferbncode@866538a (works, but still WIP). I can make a list of testable entities (entities in sample_data.py). Is this a desirable way?

Also, these test (may) fail in case of schema_changes (in musicbrainz database). But I think that would be good for us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, read the backlogs later (on irc) about the script in musicbrainz-server (that way might be better for testing different type of entities)

Copy link
Contributor

@gentlecat gentlecat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like testing needs to be set up with an MB database to make them work now.

@gentlecat
Copy link
Contributor

@ferbncode, I assume you are working on tests with MB database now. When you finish that part, put it into a separate PR please so that it's more manageable. ;)

@ferbncode ferbncode force-pushed the release_group_work branch 6 times, most recently from 9cc3c8c to d1de138 Compare July 26, 2017 03:49
Copy link
Contributor

@gentlecat gentlecat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critiquebrainz/frontend/views/test/test_release_group.py is removed but not replaced with anything else. Can you clarify this? It's good to have at least a test that checks that page renders correctly.

[mbid],
includes=['artists', 'releases', 'release-group-rels', 'url-rels', 'work-rels', 'tags']
)[mbid]
cache.set(key=key, val=release_group, time=DEFAULT_CACHE_EXPIRATION)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be inside the if not release_group: block, otherwise it's being unnecessarily reset each time.

release_group = fetch_multiple_release_groups(
[mbid],
includes=['artists', 'releases', 'release-group-rels', 'url-rels', 'work-rels', 'tags']
)[mbid]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to create a wrapper function for fetch_multiple_release_groups that retrieves only one release group at a time.

includes_data = defaultdict(dict)
check_includes('release_group', includes)
with mb_session() as db:
query = db.query(models.ReleaseGroup).options(joinedload("meta"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does .options(joinedload("meta")) part do? It might be worth making a comment clarifying parts like these that might be more difficult to understand for someone unfamiliar with how mbdata works.

options(joinedload("artist_credit.artists")).\
options(joinedload("artist_credit.artists.artist"))

release_groups = get_entities_by_gids(query, models.ReleaseGroup, models.ReleaseGroupGIDRedirect, mbids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a case like this it would be better to specify names of the arguments explicitly. When I look at get_entities_by_gids(query, models.ReleaseGroup, models.ReleaseGroupGIDRedirect, mbids) I have no idea what these models are unless I look up the function definition and its docstring/implementation.

@@ -96,7 +96,7 @@
{% macro show_tags(tags) %}
{% if tags %}
{% for tag in tags %}
<a href="{{ 'https://musicbrainz.org/tag/%s'|format(tag['name']) }}">{{ tag['name'] }}</a>
<a href="{{ 'https://musicbrainz.org/tag/%s'|format(tag) }}">{{ tag }}</a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to use format instead of just outputting the value directly into the template?

@@ -18,7 +18,8 @@
TAG_INCLUDES = ["tags", "user-tags"]
RATING_INCLUDES = ["ratings", "user-ratings"]
VALID_INCLUDES = {
'place': ["aliases", "annotation"] + RELATION_INCLUDES + TAG_INCLUDES,
'place' : ["aliases", "annotation"] + RELATION_INCLUDES + TAG_INCLUDES,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks unnecessary.

[mbid],
includes=['artist-rels', 'place-rels', 'release-group-rels', 'url-rels'],
)[mbid]
return place
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can simply return fetch_multiple_places(... here.

self.assert200(response)
self.assertIn("Days Are Gone", str(response.data))
self.assertIn("No reviews found", str(response.data))
# TODO(roman): Try to add review and check it's displayed there!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fixed now?

Copy link
Contributor

@gentlecat gentlecat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are changes in critiquebrainz/frontend/external/musicbrainz_db/place.py related to release groups?

@@ -96,7 +96,7 @@
{% macro show_tags(tags) %}
{% if tags %}
{% for tag in tags %}
<a href="{{ 'https://musicbrainz.org/tag/%s'|format(tag['name']) }}">{{ tag['name'] }}</a>
<a href="{{ 'https://musicbrainz.org/tag/%s' % tag }}">{{ tag }}</a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix indentation here please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you are still using formatting. I meant this:

<a href="https://musicbrainz.org/tag/{{ tag }}">{{ tag }}</a>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 😅

album = spotify_api.get_album(spotify_id)
except ExternalServiceException:
flash.error(gettext("You need to specify existing album from Spotify!"))
return redirect(url_for('.spotify_list', release_group_id=release_group_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change. Can you explain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting https://gist.github.com/ferbncode/307bc3f3e13f5d65428aad6180ace1ed error when running pylint. I have undone this change now as it may be done separately.

@ferbncode
Copy link
Member Author

ferbncode commented Aug 6, 2017

No, changes in place.py adding the wrapper _get_place_by_id for fetch_multiple_places is not a part for release group stuff. I removed that part and will make a separate PR for the same. Other changes in place.py fix the block for the cache.set statement and imports the DEFAULT_CACHE_EXPIRATION variable (as per #122 (comment)). Should they be done in a separate PR too?

@gentlecat
Copy link
Contributor

Should they be done in a separate PR too?

That's fine.

@gentlecat gentlecat changed the title CB-231: Use direct database access for entity: release_group CB-231: Use direct database access for the Release Group entity Aug 7, 2017
@gentlecat gentlecat merged commit 154782d into metabrainz:master Aug 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants