Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shortcodes in a more robust manner #2737

Merged
merged 24 commits into from May 10, 2017
Merged

Use shortcodes in a more robust manner #2737

merged 24 commits into from May 10, 2017

Conversation

ralsina
Copy link
Member

@ralsina ralsina commented Apr 28, 2017

Pull Request Checklist

  • I’ve read the guidelines for contributing.
  • I updated AUTHORS.txt and CHANGES.txt (if the change is non-trivial) and documentation (if applicable).
  • I tested my changes.

Description

This changes how shortcodes are applied:

  • Before compiling the file, shortcodes are extracted and replaced by UUIDs
  • After compiling, shortcodes are processed and re-inserted where the UUID markers are

This avoids problems like rst mangling URLs and such.

Still needs work:

  • apply_shortcodes_2 is not a proper function name ;-)
  • All compilers can be switched to this approach gradually, but I would prefer if they all switched at once.

The older API can be kept for backwards compatibility since it has not been touched.

# Previous shortcode closes here
buffer.append(chunk)
sc_id = str(uuid.uuid4())
new_data.append(sc_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UUID might not be enough, perhaps add some special marker that it’s a shortcode. Also, I assume it’s guaranteed not to appear in the final cache/ file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also some compiler might mangle the - in the UUID. I'd remove the dashes to be on the safe side.
(And add some marker, like sc_id = 'SHORTCODE{0}REPLACEMENT'.format(uuid.uuid4().replace('-', '')).)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UUID should be unique enough, considering there are more UUIDs than atoms in the universe. The stuff about dashes sounds good.

Copy link
Member

@Kwpolska Kwpolska May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUIDs are unique within each other, but not in the world. If you just use an UUID, especially without dashes, you might conflict with, say, a checksum that is part of the user content (or any other 128+-bit integer in hex format — which is not so rare on security– or low-level-coding–oriented blogs).

@@ -116,6 +116,9 @@ def slugify_tag_name(self, name, lang):

def slugify_category_name(self, path, lang):
"""Slugify a category name."""
if not path: # Probably used link://category
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this wasn't intended to be part of this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not.

nikola/nikola.py Outdated
if lang is None:
lang = utils.LocaleBorg().current_lang
return shortcodes.apply_shortcodes(data, self.shortcode_registry, self, filename, lang=lang, with_dependencies=with_dependencies, extra_context=extra_context)

def apply_shortcodes2(self, data, shortcodes, filename=None, lang=None, with_dependencies=False, extra_context=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, even the best people have functions named like this. apply_shortcodes_uuid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -83,6 +80,9 @@ def _skip_whitespace(data, pos, must_be_nontrivial=False):

def _skip_nonwhitespace(data, pos):
"""Return first position not before pos which contains a non-whitespace character."""
for i, x in enumerate(data[pos:]):
if x.isspace():
return pos + i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you are iterating over the rest of the string twice in case no space is found.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I am not? Sorry, it's late and I don't see it :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for loop tries to find the first non-whitespace. If it doesn't find anything, the while loop (which is still there below) does the same thing afterwards, and it also won't find anything, so finally len(data) will be returned. You can replace the while loop with return len(data).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forgot to delete all that. Will fix.

# Previous shortcode closes here
buffer.append(chunk)
sc_id = str(uuid.uuid4())
new_data.append(sc_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also some compiler might mangle the - in the UUID. I'd remove the dashes to be on the safe side.
(And add some marker, like sc_id = 'SHORTCODE{0}REPLACEMENT'.format(uuid.uuid4().replace('-', '')).)

@@ -209,14 +209,58 @@ def _parse_shortcode_args(data, start, shortcode_name, start_pos):
raise ParsingError("Shortcode '{0}' starting at {1} is not terminated correctly with '%}}}}'!".format(shortcode_name, _format_position(data, start_pos)))


def _extract_shortcodes(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a private function name? Isn't this function supposed to be used by plugins? It shouldn't start with an underscore IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's only meant to be used right here.

3. (_SHORTCODE_END, text, start, name)
1. ("TEXT", text)
2. ("SHORTCODE_START", text, start, name, args)
3. ("SHORTCODE_END", text, start, name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do these replacements? Before, you'd get a runtime error in case you did a typo (because the symbol with the typo doesn't exist). Now you won't notice anything until something behaves strange, and then you have to some debugging to find out that you did a typo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because reading the parsed code made my head hurt.

@ralsina ralsina changed the title WIP: use shortcodes in a more robust manner Use shortcodes in a more robust manner May 4, 2017
@ralsina ralsina changed the title Use shortcodes in a more robust manner WIP: Use shortcodes in a more robust manner May 4, 2017
@ralsina ralsina changed the title WIP: Use shortcodes in a more robust manner Use shortcodes in a more robust manner May 4, 2017
@ralsina
Copy link
Member Author

ralsina commented May 5, 2017

This has a bug when shortcodes are "nested". Specifically this is used in the manual, with the "raw" shortcode wrapping another shortcode that should not be processed.

The extract_shortcodes algorithm is wrong. I know how to fix it, it's going to take a bit longer.

@ralsina
Copy link
Member Author

ralsina commented May 7, 2017

@Kwpolska @felixfontein if you could give this code another look, this is sort-of-done I think.

Copy link
Contributor

@felixfontein felixfontein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I assume you tested this more than the default tests do it :)

tail = splitted
while True:
new_text, tail = extract_data_chunk(tail)
text += new_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to make text an array of strings, and return ''.join(text) at the end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

a dictionary of shortcodes, where the keys are UUIDs and the values
are the shortcodes themselves ready to process.

>>> c = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doctests are not executed anymore (#2703), it might be better to make those real tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, well. I'll do that then.

@@ -81,14 +81,17 @@ def compile_string(self, data, source_path=None, is_two_file=True, post=None, la
'language_code': LEGAL_VALUES['DOCUTILS_LOCALES'].get(LocaleBorg().current_lang, 'en')
}

from nikola import shortcodes as sc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other compilers? Also, I think it might be a slightly better-looking API to put extract_shortcodes in the Nikola object as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kwpolska Since the old API still works, later branches can port each one to the new API.

I did not put extract_shortcodes in the Nikola object mostly because there's nothing site-specific in it, and I am half convinced to use a real tokenizer in shortcodes.py which would make things awkward if things need refactoring.

@ralsina
Copy link
Member Author

ralsina commented May 7, 2017

@felixfontein the demo site has a bunch of shortcodes usage, and this passes the invariance test, so at least that much works :-)

Copy link
Member

@Kwpolska Kwpolska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

def test_extract_shortcodes(input, expected, monkeypatch):

i = iter('SC%d' % i for i in range(1, 100))
if sys.version[0] < "3":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sys.version_info[0] < 3:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that's what it was and it failed :-P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you did sys.version[0] < 3.

According to the documentation of sys.version: "Do not extract version information out of it, rather, use version_info [...]" (https://docs.python.org/2/library/sys.html#sys.version_info, https://docs.python.org/3/library/sys.html#sys.version)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh right.

])
def test_extract_shortcodes(input, expected, monkeypatch):

i = iter('SC%d' % i for i in range(1, 100))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should change this to use itertools.counter anyway

@ralsina ralsina merged commit 77ddb6f into master May 10, 2017
@ralsina ralsina deleted the shortcodes2 branch May 10, 2017 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants