Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON dumps #540

Merged
merged 63 commits into from Oct 9, 2017
Merged

JSON dumps #540

merged 63 commits into from Oct 9, 2017

Conversation

mwiencek
Copy link
Member

This has been running in production since...April? So the actual dumps work, but this contains a lot of other changes too, mostly to speed up the web service significantly.

@mwiencek
Copy link
Member Author

I'm still going through and restructuring the commits to be smaller and more logical, but the general changes will remain the same.

Makes all of the *_toplevel methods accept multiple entities rather than
just one, reducing the number of calls to "load" methods proportional to
the size of the result.
Don't Repeat Yourself
It's roughly the same amount of code, I believe easier to follow, and
around 6-7% faster on my machine (2016 MacBook Pro, 2.6 GHz Intel Core
i7).
We don't actually need them to be mutable anywhere except the tests, and
this way we can have lazy builder methods for is_empty, format, and
defined_run which only calculate those values once on-demand.
The practical goal of this commit is to reduce the number of calls to
`entities_with` in WebServiceInc, which according to NYTProf is the
top subroutine with the highest exclusive time per request.

Calling `entities_with` with `['mbid', 'relatable']` seems to be common
enough that a constant would be useful anyway.
 * find_attribute_by_name is slow, so don't call it if we don't need to.
 * get_by_ids calls uniq on the IDs for us.
 * %attr_ids isn't needed.
 * Don't return $data values unless being called in list context.
Use lazy builder methods in places where `_source_target_prop` is called
over and over again. `target_type` in particular was a trouble spot
according to NYTProf.

Note: These will now die loudly if the entities or link type aren't
loaded, which is expected.
Hash::Merge is slow, and we can very easily set the items we need
directly without any reduction in clarity.
For use in place of ref_to_type, which is slow and can't handle URL
subclasses.

Renames the 'entity_type' attribute on SeriesType and CollectionType to
'item_entity_type' to avoid conflicts.
Look up the serializer class directly by entity type, rather than
looping over all classes and doing an expensive 'isa' check on each
class during every call.
ISRCs were being passed to the serializer as an array ref, even though
we only needed the first one (the codes were all the same, only the
recordings differed). This required a workaround in JSON::2::Utils.

We now just store the list of recordings associated with the ISRC code
on the first ISRC entity (in fact, this was already being done, but not
being used), then pass that to the serializer.

The weirdness here is due to the fact that an ISRC *code* can be
associated with many recordings, but an ISRC *entity* (instance) can
only be associated with one. Fixing that would require larger changes to
the data models.
If the first page didn't change, that doesn't tell us anything about
whether the other pages did, so I'm not sure what this was trying to do.
JSON dumps will have nullable last_modified columns, since unlike
sitemaps, we aren't required to output these anywhere, so may as well
support indicating that we don't know when the document changed.

Note: These are not to be confused with last_updated columns, which
indicate when the row changed and are not present in these dump schemas.
No need for this to be inside the loop.
Make the variable name non-sitemap-specific.
JSON dumps must only process one packet at a time, since an incremental
dump will be published for each replication sequence. By contrast, it
makes no difference whether the incremental sitemaps process multiple
packets in a single build.
If last_processed_replication_sequence is NULL, then
run_incremental_dump would start from current_replication_sequence.

However, for the JSON dumps we'd want to start from the last full
JSON dump replication sequence plus 1, because that's the only way for
them to detect incremental changes in each packet.

By contrast, the full sitemaps dump currently has little to do with the
incremental sitemaps, which are based on changes in the pages' embedded
JSON-LD; the full dump doesn't look at JSON-LD at all, so there would be
no reason to start from the last full dump sequence there.
Requires a full import to generate useful output.

Will be useful for generating sample data dumps.
The JSON dumps which are to be implemented will require a way to fetch
web service output without making actual requests, so that they can
finish in a reasonable amount of time.
Implements JSON dumps of the webservice, as described in admin/DumpJSON.

Also implements incremental JSON dumps, which provide only the entities
that have changed in the last hour.

The full dumps are available at:
http://ftp.musicbrainz.org/pub/musicbrainz/data/json-dumps/

The incremental dumps are accessible through metabrainz.org using an
access token:
https://metabrainz.org/api/musicbrainz/json-dumps/json-dump-$replication_sequence/$entity.tar.xz
 1. The incremental sitemaps constantly crash with the following error:
    "The storable module was unable to store the child's data structure
    to the temp file "[...]": can't create [...]: No such file or
    directory." This error comes from ForkManager, and I am not sure how
    to resolve it.

 2. In the various call sites of MusicBrainz::Script::Utils::retry, some
    transient errors are mentioned. I have a feeling these are related
    to the forking, but I'd have a hard time proving it because these
    errors rarely happen and have spurious messages.
base image update removed it
@@ -18,9 +18,7 @@ use MusicBrainz::Server::Replication::Packet qw(
decompress_packet
retrieve_remote_file
);
use Parallel::ForkManager 0.7.6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForkManager 0.7.6 was 2010, any chance that the forking errors would vanish using the latest 1.1.9?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've definitely been using 1.19 in production. The 0.7.6 just indicates the minimum version the code will work with and croaks otherwise (IIRC there were API changes).

I'm just not going to attempt writing any concurrent code in Perl anymore. Even JavaScript is a far better language for that nowadays. :(

@mwiencek mwiencek merged commit 0179aa1 into metabrainz:master Oct 9, 2017
@mwiencek mwiencek deleted the json-dump branch October 9, 2017 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants