New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON dumps #540
JSON dumps #540
Conversation
I'm still going through and restructuring the commits to be smaller and more logical, but the general changes will remain the same. |
ac573d9
to
ea0135b
Compare
33a14c2
to
bf41a4b
Compare
Makes all of the *_toplevel methods accept multiple entities rather than just one, reducing the number of calls to "load" methods proportional to the size of the result.
Don't Repeat Yourself
It's unused.
The expanded explanation was written by ianmcorvidae and taken from https://bitbucket.org/metabrainz/musicbrainz-server/pull-requests/1201/dont-ever-bundle-the-last-current-batch/diff
It's roughly the same amount of code, I believe easier to follow, and around 6-7% faster on my machine (2016 MacBook Pro, 2.6 GHz Intel Core i7).
We don't actually need them to be mutable anywhere except the tests, and this way we can have lazy builder methods for is_empty, format, and defined_run which only calculate those values once on-demand.
The practical goal of this commit is to reduce the number of calls to `entities_with` in WebServiceInc, which according to NYTProf is the top subroutine with the highest exclusive time per request. Calling `entities_with` with `['mbid', 'relatable']` seems to be common enough that a constant would be useful anyway.
* find_attribute_by_name is slow, so don't call it if we don't need to. * get_by_ids calls uniq on the IDs for us. * %attr_ids isn't needed. * Don't return $data values unless being called in list context.
Use lazy builder methods in places where `_source_target_prop` is called over and over again. `target_type` in particular was a trouble spot according to NYTProf. Note: These will now die loudly if the entities or link type aren't loaded, which is expected.
Hash::Merge is slow, and we can very easily set the items we need directly without any reduction in clarity.
For use in place of ref_to_type, which is slow and can't handle URL subclasses. Renames the 'entity_type' attribute on SeriesType and CollectionType to 'item_entity_type' to avoid conflicts.
Look up the serializer class directly by entity type, rather than looping over all classes and doing an expensive 'isa' check on each class during every call.
ISRCs were being passed to the serializer as an array ref, even though we only needed the first one (the codes were all the same, only the recordings differed). This required a workaround in JSON::2::Utils. We now just store the list of recordings associated with the ISRC code on the first ISRC entity (in fact, this was already being done, but not being used), then pass that to the serializer. The weirdness here is due to the fact that an ISRC *code* can be associated with many recordings, but an ISRC *entity* (instance) can only be associated with one. Fixing that would require larger changes to the data models.
If the first page didn't change, that doesn't tell us anything about whether the other pages did, so I'm not sure what this was trying to do.
JSON dumps will have nullable last_modified columns, since unlike sitemaps, we aren't required to output these anywhere, so may as well support indicating that we don't know when the document changed. Note: These are not to be confused with last_updated columns, which indicate when the row changed and are not present in these dump schemas.
No need for this to be inside the loop.
Make the variable name non-sitemap-specific.
JSON dumps must only process one packet at a time, since an incremental dump will be published for each replication sequence. By contrast, it makes no difference whether the incremental sitemaps process multiple packets in a single build.
If last_processed_replication_sequence is NULL, then run_incremental_dump would start from current_replication_sequence. However, for the JSON dumps we'd want to start from the last full JSON dump replication sequence plus 1, because that's the only way for them to detect incremental changes in each packet. By contrast, the full sitemaps dump currently has little to do with the incremental sitemaps, which are based on changes in the pages' embedded JSON-LD; the full dump doesn't look at JSON-LD at all, so there would be no reason to start from the last full dump sequence there.
Requires a full import to generate useful output. Will be useful for generating sample data dumps.
The JSON dumps which are to be implemented will require a way to fetch web service output without making actual requests, so that they can finish in a reasonable amount of time.
Implements JSON dumps of the webservice, as described in admin/DumpJSON. Also implements incremental JSON dumps, which provide only the entities that have changed in the last hour. The full dumps are available at: http://ftp.musicbrainz.org/pub/musicbrainz/data/json-dumps/ The incremental dumps are accessible through metabrainz.org using an access token: https://metabrainz.org/api/musicbrainz/json-dumps/json-dump-$replication_sequence/$entity.tar.xz
1. The incremental sitemaps constantly crash with the following error: "The storable module was unable to store the child's data structure to the temp file "[...]": can't create [...]: No such file or directory." This error comes from ForkManager, and I am not sure how to resolve it. 2. In the various call sites of MusicBrainz::Script::Utils::retry, some transient errors are mentioned. I have a feeling these are related to the forking, but I'd have a hard time proving it because these errors rarely happen and have spurious messages.
base image update removed it
@@ -18,9 +18,7 @@ use MusicBrainz::Server::Replication::Packet qw( | |||
decompress_packet | |||
retrieve_remote_file | |||
); | |||
use Parallel::ForkManager 0.7.6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ForkManager 0.7.6
was 2010, any chance that the forking errors would vanish using the latest 1.1.9
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've definitely been using 1.19 in production. The 0.7.6
just indicates the minimum version the code will work with and croaks otherwise (IIRC there were API changes).
I'm just not going to attempt writing any concurrent code in Perl anymore. Even JavaScript is a far better language for that nowadays. :(
This has been running in production since...April? So the actual dumps work, but this contains a lot of other changes too, mostly to speed up the web service significantly.