New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use shortcodes in a more robust manner #2737
Conversation
nikola/shortcodes.py
Outdated
# Previous shortcode closes here | ||
buffer.append(chunk) | ||
sc_id = str(uuid.uuid4()) | ||
new_data.append(sc_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UUID might not be enough, perhaps add some special marker that it’s a shortcode. Also, I assume it’s guaranteed not to appear in the final cache/ file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also some compiler might mangle the -
in the UUID. I'd remove the dashes to be on the safe side.
(And add some marker, like sc_id = 'SHORTCODE{0}REPLACEMENT'.format(uuid.uuid4().replace('-', ''))
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UUID should be unique enough, considering there are more UUIDs than atoms in the universe. The stuff about dashes sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUIDs are unique within each other, but not in the world. If you just use an UUID, especially without dashes, you might conflict with, say, a checksum that is part of the user content (or any other 128+-bit integer in hex format — which is not so rare on security– or low-level-coding–oriented blogs).
nikola/plugins/task/categories.py
Outdated
@@ -116,6 +116,9 @@ def slugify_tag_name(self, name, lang): | |||
|
|||
def slugify_category_name(self, path, lang): | |||
"""Slugify a category name.""" | |||
if not path: # Probably used link://category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this wasn't intended to be part of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not.
nikola/nikola.py
Outdated
if lang is None: | ||
lang = utils.LocaleBorg().current_lang | ||
return shortcodes.apply_shortcodes(data, self.shortcode_registry, self, filename, lang=lang, with_dependencies=with_dependencies, extra_context=extra_context) | ||
|
||
def apply_shortcodes2(self, data, shortcodes, filename=None, lang=None, with_dependencies=False, extra_context=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, even the best people have functions named like this. apply_shortcodes_uuid
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
nikola/shortcodes.py
Outdated
@@ -83,6 +80,9 @@ def _skip_whitespace(data, pos, must_be_nontrivial=False): | |||
|
|||
def _skip_nonwhitespace(data, pos): | |||
"""Return first position not before pos which contains a non-whitespace character.""" | |||
for i, x in enumerate(data[pos:]): | |||
if x.isspace(): | |||
return pos + i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now you are iterating over the rest of the string twice in case no space is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I am not? Sorry, it's late and I don't see it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The for
loop tries to find the first non-whitespace. If it doesn't find anything, the while
loop (which is still there below) does the same thing afterwards, and it also won't find anything, so finally len(data)
will be returned. You can replace the while
loop with return len(data)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forgot to delete all that. Will fix.
nikola/shortcodes.py
Outdated
# Previous shortcode closes here | ||
buffer.append(chunk) | ||
sc_id = str(uuid.uuid4()) | ||
new_data.append(sc_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also some compiler might mangle the -
in the UUID. I'd remove the dashes to be on the safe side.
(And add some marker, like sc_id = 'SHORTCODE{0}REPLACEMENT'.format(uuid.uuid4().replace('-', ''))
.)
nikola/shortcodes.py
Outdated
@@ -209,14 +209,58 @@ def _parse_shortcode_args(data, start, shortcode_name, start_pos): | |||
raise ParsingError("Shortcode '{0}' starting at {1} is not terminated correctly with '%}}}}'!".format(shortcode_name, _format_position(data, start_pos))) | |||
|
|||
|
|||
def _extract_shortcodes(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a private function name? Isn't this function supposed to be used by plugins? It shouldn't start with an underscore IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's only meant to be used right here.
3. (_SHORTCODE_END, text, start, name) | ||
1. ("TEXT", text) | ||
2. ("SHORTCODE_START", text, start, name, args) | ||
3. ("SHORTCODE_END", text, start, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you do these replacements? Before, you'd get a runtime error in case you did a typo (because the symbol with the typo doesn't exist). Now you won't notice anything until something behaves strange, and then you have to some debugging to find out that you did a typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because reading the parsed code made my head hurt.
This has a bug when shortcodes are "nested". Specifically this is used in the manual, with the "raw" shortcode wrapping another shortcode that should not be processed. The extract_shortcodes algorithm is wrong. I know how to fix it, it's going to take a bit longer. |
@Kwpolska @felixfontein if you could give this code another look, this is sort-of-done I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I assume you tested this more than the default tests do it :)
nikola/shortcodes.py
Outdated
tail = splitted | ||
while True: | ||
new_text, tail = extract_data_chunk(tail) | ||
text += new_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to make text
an array of strings, and return ''.join(text)
at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
nikola/shortcodes.py
Outdated
a dictionary of shortcodes, where the keys are UUIDs and the values | ||
are the shortcodes themselves ready to process. | ||
|
||
>>> c = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doctests are not executed anymore (#2703), it might be better to make those real tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, well. I'll do that then.
@@ -81,14 +81,17 @@ def compile_string(self, data, source_path=None, is_two_file=True, post=None, la | |||
'language_code': LEGAL_VALUES['DOCUTILS_LOCALES'].get(LocaleBorg().current_lang, 'en') | |||
} | |||
|
|||
from nikola import shortcodes as sc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other compilers? Also, I think it might be a slightly better-looking API to put extract_shortcodes
in the Nikola object as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kwpolska Since the old API still works, later branches can port each one to the new API.
I did not put extract_shortcodes in the Nikola object mostly because there's nothing site-specific in it, and I am half convinced to use a real tokenizer in shortcodes.py which would make things awkward if things need refactoring.
@felixfontein the demo site has a bunch of shortcodes usage, and this passes the invariance test, so at least that much works :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
tests/test_shortcodes.py
Outdated
def test_extract_shortcodes(input, expected, monkeypatch): | ||
|
||
i = iter('SC%d' % i for i in range(1, 100)) | ||
if sys.version[0] < "3": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if sys.version_info[0] < 3:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that's what it was and it failed :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you did sys.version[0] < 3
.
According to the documentation of sys.version
: "Do not extract version information out of it, rather, use version_info
[...]" (https://docs.python.org/2/library/sys.html#sys.version_info, https://docs.python.org/3/library/sys.html#sys.version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhh right.
]) | ||
def test_extract_shortcodes(input, expected, monkeypatch): | ||
|
||
i = iter('SC%d' % i for i in range(1, 100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should change this to use itertools.counter anyway
Pull Request Checklist
Description
This changes how shortcodes are applied:
This avoids problems like rst mangling URLs and such.
Still needs work:
The older API can be kept for backwards compatibility since it has not been touched.