Improve markdown compiler performance #2661

living180 · 2017-02-14T15:09:09Z

This PR fixes issue #2660.

The first commit eliminates pathological performance when compiling a large number of markdown files at one time but preventing the markdown extensions list from growing each time a markdown file was compiled. On my machine this reduced the time required to compile 1000 markdown files from over 11 minutes to under 55 seconds.

The second commit provides a further incremental improvement by creating and reusing a single Markdown object instead of creating a separate Markdown object for each file that is compiled. This further reduces the compilation time for 1000 markdown files to under 34 seconds.

The CompileMarkdown.compile_string() method was always adding the configured markdown extensions (from MARKDOWN_EXTENSIONS) to self.extensions. This meant that each time it was called, self.extensions grew. As a result, the markdown conversion grew slower each time due to the increasing number of extensions. Fix by adding the configured extensions once in the set_site() method instead.

Kwpolska

Any benchmarks?

Also, add this change to CHANGES.txt and your name to AUTHORS.txt.

Kwpolska · 2017-02-14T15:11:20Z

nikola/plugins/compile/markdown/__init__.py

+        site_extensions = self.site.config.get("MARKDOWN_EXTENSIONS")
+        self.config_dependencies.append(str(sorted(site_extensions)))
+        extensions.extend(site_extensions)
+        self.markdown = Markdown(extensions=extensions, output_format="html5")


This will crash if Markdown does not exist.

Good catch, thanks.

felixfontein · 2017-02-14T15:34:21Z

nikola/plugins/compile/markdown/__init__.py

            plugin_info.plugin_object.short_help = plugin_info.description

        site_extensions = self.site.config.get("MARKDOWN_EXTENSIONS")
        self.config_dependencies.append(str(sorted(site_extensions)))
-        self.extensions.extend(site_extensions)
+        extensions.extend(site_extensions)
+        self.markdown = Markdown(extensions=extensions, output_format="html5")


Are Markdown objects thread-safe? (You can tell doit to process tasks in parallel using threads, so it could be that two Markdown posts are compiled simultaneously.) I somehow doubt this as you call self.markdown.reset() after self.markdown.convert(data). If this is the case, you should better use thread-local storage for the Markdown object.
(That would also indirectly solve Kwpolska's remark.)

ralsina · 2017-02-14T16:04:24Z

I agree, either make it thread local or protect it with a mutex.

felixfontein · 2017-02-14T16:11:50Z

Since this is about performance, making it thread-local sounds like a much better choice than a mutex :)

living180 · 2017-02-15T14:03:02Z

@Kwpolska wrote

Any benchmarks?

In the PR description, I included some timings I measured (basically more than 11 minutes before my changes, then under 55 seconds with just the first commit, and finally under 34 seconds with both commits). Are you looking for something more than that? If so, could you please describe what you would like to see? I'm happy to try to accomodate

Kwpolska · 2017-02-15T14:17:14Z

@living180 Ah, that’s good enough. But please address all our comments so we can proceed with this.

Stop using the markdown() convenience function in the CompileMarkdown.compile_string() method, which creates a new Markdown object each time it is called. Instead, create a new class, ThreadLocalMarkdown, which is a subclass of threading.local. This class has a single attribute, markdown, which is a per-thread Markdown instance. It also has a single method, convert(), which converts the data using the Markdown instance's convert() method, and then resets the Markdown instance's internal state. Then in the CompileMarkdown.set_site() method, instantiate a new ThreadLocalMarkdown instance (unless the markdown module is unavailable) and assign it to the plugin's converter attribute. In the compile_string() method, use that ThreadLocalMarkdown instance to perform the conversion. The end result is one Markdown object created per thread, instead of one for each markdown file compiled.

living180 · 2017-02-15T14:28:02Z

Updated version - should not fail ungracefully if the markdown module is not installed, and is now thread-safe for parallel builds. (Testing with nikola build -n 4 reduces the runtime to 14s on my machine.)

felixfontein · 2017-02-15T14:48:43Z

nikola/plugins/compile/markdown/__init__.py

+    """Convert to markdown using per-thread Markdown objects."""
+
+    def __init__(self, extensions):
+        self.markdown = Markdown(extensions=extensions, output_format="html5")


I'm not sure whether this is the correct way to use threading.local. Every thread creates a copy of this initial value, but depending on how this copy is created (maybe even none at all, by just copying the reference to the Markdown object) the resulting object is not thread-safe.

It might be better to set self.markdown to None, and to check self.markdown in convert() and initialize it there when its current value is None.

Ok, after reading some more about this, I realize I'm wrong and your approach is correct. threading.local is a really strange beast, it seems. (For anyone interested, it seems the only "real" documentation can be found in the docstring here: https://svn.python.org/projects/python/trunk/Lib/_threading_local.py.)

Here’s some test code:

import threading from markdown import Markdown class ThreadLocalMarkdown(threading.local): """Convert to markdown using per-thread Markdown objects.""" def __init__(self, extensions): self.markdown = Markdown(extensions=extensions, output_format="html5") self.counter = 0 def convert(self, data): """Convert data to HTML and reset internal state.""" result = self.markdown.convert(data) self.markdown.reset() self.counter += 1 print(self.counter) return result def show(self): print(self.counter) converter = ThreadLocalMarkdown([]) def run_convert(): converter.convert("Hello, **world!**") threads = [threading.Thread(None, run_convert) for i in range(10)] [t.start() for t in threads] converter.show()

says a bunch of ones and a zero at the end, as expected.

With an integer it wouldn't be a problem anyway, because integers are immutable. If what I first thought would happen would really happen, you'd need a list or some other mutable object to test for it.

After reading the docstring for _threading_local, I'm happy with the code. I still think that threading.local viloates the principle of least astonishment.

I’m taking this as an approval and merging.

Yep, I'm happy with this PR. I'm just not happy how threading.local was designed ;)

Kwpolska · 2017-02-16T11:05:25Z

Thanks for contributing to Nikola! 🎉

Kwpolska requested changes Feb 14, 2017

View reviewed changes

felixfontein reviewed Feb 14, 2017

View reviewed changes

living180 added 2 commits February 15, 2017 16:19

update CHANGES.txt/AUTHORS.txt

f5f9248

felixfontein reviewed Feb 15, 2017

View reviewed changes

Kwpolska approved these changes Feb 15, 2017

View reviewed changes

Add missing docstrings

5d3cc76

Kwpolska requested a review from felixfontein February 15, 2017 17:49

fix docstring quotes

6cbe4bb

Kwpolska removed the request for review from felixfontein February 16, 2017 11:04

Kwpolska merged commit a50f8eb into getnikola:master Feb 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve markdown compiler performance #2661

Improve markdown compiler performance #2661

living180 commented Feb 14, 2017

Kwpolska left a comment

Kwpolska Feb 14, 2017

living180 Feb 15, 2017

felixfontein Feb 14, 2017

ralsina commented Feb 14, 2017

felixfontein commented Feb 14, 2017

living180 commented Feb 15, 2017

Kwpolska commented Feb 15, 2017

living180 commented Feb 15, 2017

felixfontein Feb 15, 2017

felixfontein Feb 15, 2017

Kwpolska Feb 15, 2017

felixfontein Feb 15, 2017

Kwpolska Feb 16, 2017

felixfontein Feb 16, 2017

Kwpolska commented Feb 16, 2017

Improve markdown compiler performance #2661

Improve markdown compiler performance #2661

Conversation

living180 commented Feb 14, 2017

Kwpolska left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ralsina commented Feb 14, 2017

felixfontein commented Feb 14, 2017

living180 commented Feb 15, 2017

Kwpolska commented Feb 15, 2017

living180 commented Feb 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kwpolska commented Feb 16, 2017