Extendable metadata extractors (#2830) #2861

Kwpolska · 2017-07-05T13:36:48Z

This is #2830 — extendable metadata extractors.

Every extractor plugin specifies the source of information (currently supported are text and filename) and its priority (based on how well it knows it’s needed — YAML/TOML use markers, so they’re specialized; Nikola is normal, and FileMetadataRegexp is fallback). This tries to be extendable, although I probably missed a few things regarding openness.

Highlights:

YAML and TOML support in 2-file posts
FILE_METADATA_REGEXP still supported
Nikola-site and a lot of test stuff works properly, without changes
Posts know what metadata extractor was used, if they are 1-file posts
Compilers must advertise support for metadata, can use conditions, and have the highest priority
Metadata is split for parsing (and then again when compiling)
I’m doing modern Python 3: using enums and (basic) type annotations. This adds enum34 for Python 3.3 (do we need that version though?)
I had to effectively undo Generalize metadata functions and moving them to utils.py #2856 as part of the implementation, since the new extractors have different requirements
Some compiler APIs were changed as well, so old compilers might not work
If someone needs direct access to the Nikola extractor, it’s metadata_extractors.DEFAULT_EXTRACTOR — but most of the time, you should go through metadata_extractors_by or other convenience methods.
As I promised, the built-in extractors are not loaded by yapsy. This sort-of optimization could also be useful for a few other mandatory plugins, such as PostScanner or render_posts/render_pages

Questions:

What was match used for in re_meta? I didn’t see any uses, so I nuked it.
I removed support for \r\n when splitting metadata. The reason? This happens only when dealing with Windows-style newlines on Linux. (unless I’m wrong. We can restore it just in case.)
Right now, metadata_extractors_by is on the site object, but I’m not sure about it — perhaps it would work better in metadata_extractors.py as a global?
I kinda want to change create_post compiler API to def create_post(self, metadata, options): — opinions?

Future and caveats:

Compiler plugins may need changes for compatibility, I didn’t do that.
pkgindex needs a rewrite. I didn’t do that either.
I plan to extend annotations over to all plugin categories. PyCharm is oh-so-nice when you add type annotations.
We still parse reST/Markdown posts twice if using their respective meta formats. Memory usage would skyrocket if we tried to avoid that.

Comments and reviews welcome.

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

felixfontein

This looks really great! There are two things in plugin_categories.MetadataExtractor which you should definitely take a look at, but everything else looks fine!

felixfontein · 2017-07-06T19:45:53Z

nikola/metadata_extractors.py

+    """Check the conditions for a metadata extractor."""
+    for ct, arg in conditions:
+        if any((
+            ct == MetaCondition.config_bool and (arg not in config or (arg in config and not config[arg])),


Why not ct == MetaCondition.config_bool and not config.get(arg, False), or ct == MetaCondition.config_bool and (arg not in config or not config[arg])?

Brain fart. Fixed.

felixfontein · 2017-07-06T19:54:36Z

nikola/plugin_categories.py

+        """Extract metadata from filename."""
+        return {}
+
+    def write_metadata(self, metadata: dict, comment_wrap=False) -> str:


There should be a warning/note that such a method can empty the metadata dictionary. (The default Nikola extractor does that, for example.)

Better solution, change the behavior of Nikola’s extractor. A copy of a small dict shouldn’t hurt.

That's even better :)

felixfontein · 2017-07-06T19:59:52Z

nikola/metadata_extractors.py

+    priority = MetaPriority.normal
+    supports_write = True
+    split_metadata_re = re.compile('\n\n')
+    nikola_re = re.compile('^\s*\.\. (.*?): (.*)')


Wouldn't it be better for this to be a raw string? I know it makes no difference, but for someone not that familiar with Python strings, it is not so easy to guess whether '\s' is an escape sequence or does something strange. (Also, you're writing '\\+' in the TOML metadata regex, should could also be written as '\+'.)

In Nikola, you’re right. In TOML, this is intended:

split_metadata_re = re.compile('\n\\+\\+\\+\n')

We need \n for a real newline. So the double backslash becomes a single backslash, and it ends up escaping the + from regexp’s point of view.

I know; I meant that '\+' == '\\+', which is true for the same reason than '\s' == '\\s' (because \s is no valid escape sequence).

That’s deprecated behavior.

Makes sense; I was surprised that it works. I was just confused because here you seemed to use the (deprecated) behavior, while further down you were escaping, and I didn't know if that was intentional or not. But now that's resolved :)

felixfontein · 2017-07-06T20:03:43Z

nikola/nikola.py

@@ -728,6 +732,11 @@ def __init__(self, **config):
        if self.config['PRESERVE_EXIF_DATA'] and not self.config['EXIF_WHITELIST']:
            utils.LOGGER.warn('You are setting PRESERVE_EXIF_DATA and not EXIF_WHITELIST so EXIF data is not really kept.')

+        # TODO: should we keep backwards compat here?


It's nice to have a warning for some removed/renamed features for at least one or two releases. If someone is still using this feature with an up-to-date Nikola, he at least will notice that something changed :)

felixfontein · 2017-07-06T20:06:19Z

nikola/plugin_categories.py

+    def split_metadata_from_text(self, source_text: str) -> (str, str):
+        """Split text into metadata and content (both strings)."""
+        if self.split_metadata_re is None:
+            return source_text


Shouldn't this be return '', source_text?

Not quite, see below.

felixfontein · 2017-07-06T20:08:58Z

nikola/plugin_categories.py

+                return split_result[0], split_result[-1]
+
+    def extract_text(self, source_text: str) -> dict:
+        """Split file, return metadata and the content."""


The function is doing something else. It returns the parsed metadata, but not the content.

Leftover from a previous design. Fixed.

felixfontein · 2017-07-06T20:10:43Z

nikola/plugins/basic_import.py

@@ -166,7 +166,7 @@ def write_metadata(filename, title, slug, post_date, description, tags, **kwargs
        with io.open(filename, "w+", encoding="utf8") as fd:
            data = {'title': title, 'slug': slug, 'date': post_date, 'tags': ','.join(tags), 'description': description}
            data.update(kwargs)
-            fd.write(utils.write_metadata(data))
+            fd.write(utils.write_metadata(data, metadata_format='nikola', comment_wrap=False))


Even though it's technically not needed (because nikola is always provided by Nikola core), why not pass site as well?

There is no site in the importer plugins.

But you still have a site object (with all default options)?

Do we really? I can see references to that in the WordPress importer, but I have no idea where it comes from…

Every plugin has a set_site call which is called on plugin initialization. There is no documentation saying that this isn't the case for Command plugins, and in particular importer plugins. So I would assume it is a bug if the site object isn't there.

Ah, right. I misremembered needs_config — turns out a site always exists, but isn’t always required to be configured. I’ll fix that.

felixfontein · 2017-07-06T20:17:17Z

nikola/post.py

-    except AttributeError:
-        config = None
+    config = getattr(post, 'config', None)
+    metadata_extractors_by = getattr(post, 'metadata_extractors_by', metadata_extractors.default_metadata_extractors_by())


This always calls metadata_extractors.default_metadata_extractors_by(), even though it should never be used. Wouldn't it be better to add two more lines (if metadata_extractors_by is None: / metadata_extractors_by = metadata_extractors.default_metadata_extractors_by()) to avoid this call every time?

felixfontein · 2017-07-06T20:20:14Z

nikola/utils.py

+    """
+    # API compatibility
+    if metadata_format is None:
+        metadata_format = site.config.get('METADATA_FORMAT', 'nikola').lower()


In case site is done, this dies.

felixfontein · 2017-07-06T20:22:03Z

nikola/utils.py

+        extractor.check_requirements()
+        return extractor.write_metadata(data, comment_wrap)
+    elif metadata_format is None:
+        pass  # Quiet fallback to Nikola


This won't be quiet, because the new if below will print out a warning.

I made sure metadata_format will never be None.

felixfontein · 2017-07-06T20:27:52Z

It's nice to see that Python also got type annotations. Also, I'm for enums; whether it will be enum34 or Python 3.4+ only, I don't mind.

tritium21 · 2017-07-07T04:16:11Z

What platforms ship 3.3 as the default python3? Are those platforms still supported? I'm pretty sure ubuntu 12.04 was 3.2, but 14.04 was 3.4, but i could be wrong.

felixfontein · 2017-07-07T06:23:44Z

Debian 8 (jessie == oldstable) ships with Python 3.4, and Debian 9 probably with something newer.

Python 3.4 was released in March 2014, so Ubuntu 14.04 supports either 3.3 or 3.4, with 3.3 being more likely. I have no Ubuntu 14.04 boxes left to check, though...

Kwpolska · 2017-07-07T07:51:08Z

You don’t need boxen to check: 14.04 trusty ships 3.4.0, 9 stretch ships 3.5.3

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

@staticmethod

Note this is a minor API change (removes @staticmethod) that needs to be reflected in some importers. cc @felixfontein Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska · 2017-07-08T11:17:52Z

@felixfontein: thanks!

@ralsina: It’d be nice if you reviewed, too.

ralsina · 2017-07-08T12:29:23Z

@Kwpolska I'll give it a look this weekend.

ralsina

LGTM and is pretty awesome!

Kwpolska added 19 commits July 3, 2017 16:25

Rename UNSLUGIFY_TITLES → FILE_METADATA_UNSLUGIFY_TITLES

40c43dd

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Initial scaffolding for metadata extractor plugins

84c921a

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Use correct environment marker syntax

5aa1754

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Style fixes

73c1148

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Use metadata extractor APIs and plugins

4a1d0bb

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Fix galleries; add basic split support (temporary)

f7679bc

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Fix compatibility with .meta files

4a21369

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

#2856 was effectively undone

29e2719

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Make metadata splitting work

7c0f40a

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Code cleanups

9612fb9

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Fix tests, add new config_present condition

b562e61

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Allow whitespace before Nikola-style metadata

0cd3077

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Fix RSS tests

780bdf9

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Add MetadataExtractors documentation and override MetaPriority

653f3a0

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Style fixes

562533d

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Add tests for metadata extractors

a9fadb2

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Turns out we run flake8 on tests, too

72e5c54

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Make write_metadata work with metadata extractors

7e20b75

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Fix #2830 — add MetadataExtractor plugins

2bf7b61

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska added breaks backwards compatibility enhancement labels Jul 5, 2017

Kwpolska added this to the v8.0.0 milestone Jul 5, 2017

Kwpolska requested review from ralsina and felixfontein July 5, 2017 13:36

Minor documentation fixes [ci skip]

e5fd904

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

getnikola deleted a comment Jul 6, 2017

felixfontein requested changes Jul 6, 2017

View reviewed changes

Kwpolska added 3 commits July 7, 2017 10:19

Address review by @felixfontein

932ffec

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Remove useless comment [ci skip]

2e05f03

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Slightly smarter return values for split_metadata_for_text

f8d1814

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

getnikola deleted a comment Jul 8, 2017

Pass site when writing metadata in importers

d482f68

Note this is a minor API change (removes @staticmethod) that needs to be reflected in some importers. cc @felixfontein Signed-off-by: Chris Warrick <kwpolska@gmail.com>

getnikola deleted a comment Jul 8, 2017

Document potentially confusing split_metadata_from_text behavior

67b20ee

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

getnikola deleted a comment Jul 8, 2017

felixfontein approved these changes Jul 8, 2017

View reviewed changes

ralsina approved these changes Jul 9, 2017

View reviewed changes

ralsina merged commit dc47ce9 into master Jul 9, 2017

ralsina deleted the metadata-extractors branch July 9, 2017 17:27

This was referenced Jul 23, 2017

[static_comments] Support more metadata formats via metadata extractors getnikola/plugins#241

Merged

[static_comments] Use json / yaml getnikola/plugins#232

Closed

Extendable metadata extractors (#2830) #2861

Extendable metadata extractors (#2830) #2861

Conversation

Kwpolska commented Jul 5, 2017

felixfontein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixfontein commented Jul 6, 2017

tritium21 commented Jul 7, 2017

felixfontein commented Jul 7, 2017

Kwpolska commented Jul 7, 2017

Kwpolska commented Jul 8, 2017

ralsina commented Jul 8, 2017

ralsina left a comment

Choose a reason for hiding this comment