JSON dumps #540

mwiencek · 2017-08-31T01:02:47Z

This has been running in production since...April? So the actual dumps work, but this contains a lot of other changes too, mostly to speed up the web service significantly.

mwiencek · 2017-08-31T09:02:46Z

I'm still going through and restructuring the commits to be smaller and more logical, but the general changes will remain the same.

Makes all of the *_toplevel methods accept multiple entities rather than just one, reducing the number of calls to "load" methods proportional to the size of the result.

Don't Repeat Yourself

It's unused.

The expanded explanation was written by ianmcorvidae and taken from https://bitbucket.org/metabrainz/musicbrainz-server/pull-requests/1201/dont-ever-bundle-the-last-current-batch/diff

It's roughly the same amount of code, I believe easier to follow, and around 6-7% faster on my machine (2016 MacBook Pro, 2.6 GHz Intel Core i7).

We don't actually need them to be mutable anywhere except the tests, and this way we can have lazy builder methods for is_empty, format, and defined_run which only calculate those values once on-demand.

The practical goal of this commit is to reduce the number of calls to `entities_with` in WebServiceInc, which according to NYTProf is the top subroutine with the highest exclusive time per request. Calling `entities_with` with `['mbid', 'relatable']` seems to be common enough that a constant would be useful anyway.

* find_attribute_by_name is slow, so don't call it if we don't need to. * get_by_ids calls uniq on the IDs for us. * %attr_ids isn't needed. * Don't return $data values unless being called in list context.

Use lazy builder methods in places where `_source_target_prop` is called over and over again. `target_type` in particular was a trouble spot according to NYTProf. Note: These will now die loudly if the entities or link type aren't loaded, which is expected.

Hash::Merge is slow, and we can very easily set the items we need directly without any reduction in clarity.

For use in place of ref_to_type, which is slow and can't handle URL subclasses. Renames the 'entity_type' attribute on SeriesType and CollectionType to 'item_entity_type' to avoid conflicts.

Look up the serializer class directly by entity type, rather than looping over all classes and doing an expensive 'isa' check on each class during every call.

ISRCs were being passed to the serializer as an array ref, even though we only needed the first one (the codes were all the same, only the recordings differed). This required a workaround in JSON::2::Utils. We now just store the list of recordings associated with the ISRC code on the first ISRC entity (in fact, this was already being done, but not being used), then pass that to the serializer. The weirdness here is due to the fact that an ISRC *code* can be associated with many recordings, but an ISRC *entity* (instance) can only be associated with one. Fixing that would require larger changes to the data models.

If the first page didn't change, that doesn't tell us anything about whether the other pages did, so I'm not sure what this was trying to do.

JSON dumps will have nullable last_modified columns, since unlike sitemaps, we aren't required to output these anywhere, so may as well support indicating that we don't know when the document changed. Note: These are not to be confused with last_updated columns, which indicate when the row changed and are not present in these dump schemas.

No need for this to be inside the loop.

Make the variable name non-sitemap-specific.

JSON dumps must only process one packet at a time, since an incremental dump will be published for each replication sequence. By contrast, it makes no difference whether the incremental sitemaps process multiple packets in a single build.

If last_processed_replication_sequence is NULL, then run_incremental_dump would start from current_replication_sequence. However, for the JSON dumps we'd want to start from the last full JSON dump replication sequence plus 1, because that's the only way for them to detect incremental changes in each packet. By contrast, the full sitemaps dump currently has little to do with the incremental sitemaps, which are based on changes in the pages' embedded JSON-LD; the full dump doesn't look at JSON-LD at all, so there would be no reason to start from the last full dump sequence there.

Requires a full import to generate useful output. Will be useful for generating sample data dumps.

The JSON dumps which are to be implemented will require a way to fetch web service output without making actual requests, so that they can finish in a reasonable amount of time.

Implements JSON dumps of the webservice, as described in admin/DumpJSON. Also implements incremental JSON dumps, which provide only the entities that have changed in the last hour. The full dumps are available at: http://ftp.musicbrainz.org/pub/musicbrainz/data/json-dumps/ The incremental dumps are accessible through metabrainz.org using an access token: https://metabrainz.org/api/musicbrainz/json-dumps/json-dump-$replication_sequence/$entity.tar.xz

1. The incremental sitemaps constantly crash with the following error: "The storable module was unable to store the child's data structure to the temp file "[...]": can't create [...]: No such file or directory." This error comes from ForkManager, and I am not sure how to resolve it. 2. In the various call sites of MusicBrainz::Script::Utils::retry, some transient errors are mentioned. I have a feeling these are related to the forking, but I'd have a hard time proving it because these errors rarely happen and have spurious messages.

base image update removed it

yvanzo · 2017-10-09T16:01:33Z

lib/MusicBrainz/Script/Role/IncrementalDump.pm

@@ -18,9 +18,7 @@ use MusicBrainz::Server::Replication::Packet qw(
    decompress_packet
    retrieve_remote_file
 );
-use Parallel::ForkManager 0.7.6;


ForkManager 0.7.6 was 2010, any chance that the forking errors would vanish using the latest 1.1.9?

We've definitely been using 1.19 in production. The 0.7.6 just indicates the minimum version the code will work with and croaks otherwise (IIRC there were API changes).

I'm just not going to attempt writing any concurrent code in Perl anymore. Even JavaScript is a far better language for that nowadays. :(

mwiencek requested a review from yvanzo August 31, 2017 01:02

mwiencek force-pushed the json-dump branch from 973a3e6 to f091438 Compare August 31, 2017 08:56

mwiencek force-pushed the json-dump branch 5 times, most recently from ac573d9 to ea0135b Compare September 5, 2017 06:38

mwiencek mentioned this pull request Sep 6, 2017

MBS-9450: Import DateTime module to UserDate #543

Closed

mwiencek force-pushed the json-dump branch 3 times, most recently from 33a14c2 to bf41a4b Compare September 12, 2017 00:07

mwiencek added 18 commits September 14, 2017 09:38

Reduce queries needed for some WS requests

2db1c67

Makes all of the *_toplevel methods accept multiple entities rather than just one, reducing the number of calls to "load" methods proportional to the size of the result.

Add a role for WS2 lookups

fa3c307

Don't Repeat Yourself

Remove 'releases' inc param in /ws/2/instrument

9f17336

It's unused.

Expand on 'exclude the last batch' comment

3ed2321

The expanded explanation was written by ianmcorvidae and taken from https://bitbucket.org/metabrainz/musicbrainz-server/pull-requests/1201/dont-ever-bundle-the-last-current-batch/diff

Add additional PartialDate tests

b71aa64

Simpler, faster PartialDate::format

6417910

It's roughly the same amount of code, I believe easier to follow, and around 6-7% faster on my machine (2016 MacBook Pro, 2.6 GHz Intel Core i7).

Make PartialDates immutable

2b0b3a2

We don't actually need them to be mutable anywhere except the tests, and this way we can have lazy builder methods for is_empty, format, and defined_run which only calculate those values once on-demand.

Speed up load_subobjects

dd04f2f

* find_attribute_by_name is slow, so don't call it if we don't need to. * get_by_ids calls uniq on the IDs for us. * %attr_ids isn't needed. * Don't return $data values unless being called in list context.

Remove unnecessary use of Hash::Merge

3dd1cd7

Hash::Merge is slow, and we can very easily set the items we need directly without any reduction in clarity.

Add an entity_type sub to some Entity models

8050b77

For use in place of ref_to_type, which is slow and can't handle URL subclasses. Renames the 'entity_type' attribute on SeriesType and CollectionType to 'item_entity_type' to avoid conflicts.

Speed up JSON::2::Utils::serializer lookup

d8a8797

Look up the serializer class directly by entity type, rather than looping over all classes and doing an expensive 'isa' check on each class during every call.

DRY type and type-id serialization

e73104e

Replace JSON::2::Role::Aliases with sub

01dfb74

Replace JSON::2::Role::Annotation with sub

c16b95e

Replace JSON::2::Role::GID with sub

d5508dc

mwiencek added 25 commits September 14, 2017 09:38

Move follow_foreign_key

df848b9

Set an LWP timeout when fetching URLs

a1e9280

Move get_linked_entities

b0a5531

Remove nonsensical $is_first_of_many check

b44db07

If the first page didn't change, that doesn't tell us anything about whether the other pages did, so I'm not sure what this was trying to do.

Custom handlers for FK paths and changed documents

a44af12

Add post_replication_sequence

b33a19f

Move fallback outside loop

e0015bf

No need for this to be inside the loop.

Move handle_replication_sequence

39e0dbe

Move get_current_replication_sequence

2856262

s/should_update_index/did_update_anything/g

906246e

Make the variable name non-sitemap-specific.

Add --packet-limit

35f66a4

JSON dumps must only process one packet at a time, since an incremental dump will be published for each replication sequence. By contrast, it makes no difference whether the incremental sitemaps process multiple packets in a single build.

Separate run_incremental_dump from run

ad8ff12

Move run_incremental_dump

1047357

Clear tmp_checked_entities when the script exits

cf4e4e9

Add a script to generate sample data

075297f

Requires a full import to generate useful output. Will be useful for generating sample data dumps.

Move & rephrase incremental dump synopsis

2424719

Method to bypass Catalyst fetching WS2 JSON

3de3db0

The JSON dumps which are to be implemented will require a way to fetch web service output without making actual requests, so that they can finish in a reasonable amount of time.

json_dump SQL schema

10d2fa2

Add retry handling for transient DBI errors

b9d62f5

Make follow_foreign_keys iterative

5f89d98

Replace pixz with threaded xz

59161ed

mwiencek force-pushed the json-dump branch from bf41a4b to 59161ed Compare September 14, 2017 14:38

Add build-essential

16874a7

base image update removed it

yvanzo approved these changes Oct 9, 2017

View reviewed changes

mwiencek merged commit 0179aa1 into metabrainz:master Oct 9, 2017

mwiencek deleted the json-dump branch October 9, 2017 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON dumps #540

JSON dumps #540

mwiencek commented Aug 31, 2017

mwiencek commented Aug 31, 2017

yvanzo Oct 9, 2017

mwiencek Oct 9, 2017

JSON dumps #540

JSON dumps #540

Conversation

mwiencek commented Aug 31, 2017

mwiencek commented Aug 31, 2017

yvanzo Oct 9, 2017

Choose a reason for hiding this comment

mwiencek Oct 9, 2017

Choose a reason for hiding this comment