New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extendable metadata extractors (#2830) #2861
Conversation
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great! There are two things in plugin_categories.MetadataExtractor
which you should definitely take a look at, but everything else looks fine!
nikola/metadata_extractors.py
Outdated
"""Check the conditions for a metadata extractor.""" | ||
for ct, arg in conditions: | ||
if any(( | ||
ct == MetaCondition.config_bool and (arg not in config or (arg in config and not config[arg])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ct == MetaCondition.config_bool and not config.get(arg, False),
or ct == MetaCondition.config_bool and (arg not in config or not config[arg])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brain fart. Fixed.
"""Extract metadata from filename.""" | ||
return {} | ||
|
||
def write_metadata(self, metadata: dict, comment_wrap=False) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a warning/note that such a method can empty the metadata
dictionary. (The default Nikola extractor does that, for example.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better solution, change the behavior of Nikola’s extractor. A copy of a small dict shouldn’t hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's even better :)
nikola/metadata_extractors.py
Outdated
priority = MetaPriority.normal | ||
supports_write = True | ||
split_metadata_re = re.compile('\n\n') | ||
nikola_re = re.compile('^\s*\.\. (.*?): (.*)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better for this to be a raw string? I know it makes no difference, but for someone not that familiar with Python strings, it is not so easy to guess whether '\s'
is an escape sequence or does something strange. (Also, you're writing '\\+'
in the TOML metadata regex, should could also be written as '\+'
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Nikola, you’re right. In TOML, this is intended:
split_metadata_re = re.compile('\n\\+\\+\\+\n')
We need \n
for a real newline. So the double backslash becomes a single backslash, and it ends up escaping the +
from regexp’s point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know; I meant that '\+' == '\\+'
, which is true for the same reason than '\s' == '\\s'
(because \s
is no valid escape sequence).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense; I was surprised that it works. I was just confused because here you seemed to use the (deprecated) behavior, while further down you were escaping, and I didn't know if that was intentional or not. But now that's resolved :)
nikola/nikola.py
Outdated
@@ -728,6 +732,11 @@ def __init__(self, **config): | |||
if self.config['PRESERVE_EXIF_DATA'] and not self.config['EXIF_WHITELIST']: | |||
utils.LOGGER.warn('You are setting PRESERVE_EXIF_DATA and not EXIF_WHITELIST so EXIF data is not really kept.') | |||
|
|||
# TODO: should we keep backwards compat here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice to have a warning for some removed/renamed features for at least one or two releases. If someone is still using this feature with an up-to-date Nikola, he at least will notice that something changed :)
nikola/plugin_categories.py
Outdated
def split_metadata_from_text(self, source_text: str) -> (str, str): | ||
"""Split text into metadata and content (both strings).""" | ||
if self.split_metadata_re is None: | ||
return source_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be return '', source_text
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite, see below.
nikola/plugin_categories.py
Outdated
return split_result[0], split_result[-1] | ||
|
||
def extract_text(self, source_text: str) -> dict: | ||
"""Split file, return metadata and the content.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is doing something else. It returns the parsed metadata, but not the content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from a previous design. Fixed.
nikola/plugins/basic_import.py
Outdated
@@ -166,7 +166,7 @@ def write_metadata(filename, title, slug, post_date, description, tags, **kwargs | |||
with io.open(filename, "w+", encoding="utf8") as fd: | |||
data = {'title': title, 'slug': slug, 'date': post_date, 'tags': ','.join(tags), 'description': description} | |||
data.update(kwargs) | |||
fd.write(utils.write_metadata(data)) | |||
fd.write(utils.write_metadata(data, metadata_format='nikola', comment_wrap=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it's technically not needed (because nikola
is always provided by Nikola core), why not pass site
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no site in the importer plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you still have a site
object (with all default options)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really? I can see references to that in the WordPress importer, but I have no idea where it comes from…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every plugin has a set_site
call which is called on plugin initialization. There is no documentation saying that this isn't the case for Command
plugins, and in particular importer plugins. So I would assume it is a bug if the site
object isn't there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. I misremembered needs_config
— turns out a site always exists, but isn’t always required to be configured. I’ll fix that.
nikola/post.py
Outdated
except AttributeError: | ||
config = None | ||
config = getattr(post, 'config', None) | ||
metadata_extractors_by = getattr(post, 'metadata_extractors_by', metadata_extractors.default_metadata_extractors_by()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always calls metadata_extractors.default_metadata_extractors_by()
, even though it should never be used. Wouldn't it be better to add two more lines (if metadata_extractors_by is None:
/ metadata_extractors_by = metadata_extractors.default_metadata_extractors_by()
) to avoid this call every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
nikola/utils.py
Outdated
""" | ||
# API compatibility | ||
if metadata_format is None: | ||
metadata_format = site.config.get('METADATA_FORMAT', 'nikola').lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case site
is done, this dies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
nikola/utils.py
Outdated
extractor.check_requirements() | ||
return extractor.write_metadata(data, comment_wrap) | ||
elif metadata_format is None: | ||
pass # Quiet fallback to Nikola |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be quiet, because the new if
below will print out a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made sure metadata_format
will never be None.
It's nice to see that Python also got type annotations. Also, I'm for enums; whether it will be |
What platforms ship 3.3 as the default python3? Are those platforms still supported? I'm pretty sure ubuntu 12.04 was 3.2, but 14.04 was 3.4, but i could be wrong. |
Debian 8 (jessie == oldstable) ships with Python 3.4, and Debian 9 probably with something newer. Python 3.4 was released in March 2014, so Ubuntu 14.04 supports either 3.3 or 3.4, with 3.3 being more likely. I have no Ubuntu 14.04 boxes left to check, though... |
You don’t need boxen to check: 14.04 trusty ships 3.4.0, 9 stretch ships 3.5.3 |
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Note this is a minor API change (removes @staticmethod) that needs to be reflected in some importers. cc @felixfontein Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@felixfontein: thanks! @ralsina: It’d be nice if you reviewed, too. |
@Kwpolska I'll give it a look this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and is pretty awesome!
This is #2830 — extendable metadata extractors.
Every extractor plugin specifies the source of information (currently supported are text and filename) and its priority (based on how well it knows it’s needed — YAML/TOML use markers, so they’re
specialized
; Nikola isnormal
, and FileMetadataRegexp isfallback
). This tries to be extendable, although I probably missed a few things regarding openness.Highlights:
enum34
for Python 3.3 (do we need that version though?)metadata_extractors_by
or other convenience methods.Questions:
match
used for inre_meta
? I didn’t see any uses, so I nuked it.\r\n
when splitting metadata. The reason? This happens only when dealing with Windows-style newlines on Linux. (unless I’m wrong. We can restore it just in case.)create_post
compiler API todef create_post(self, metadata, options):
— opinions?Future and caveats:
pkgindex
needs a rewrite. I didn’t do that either.Comments and reviews welcome.