Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve markdown compiler performance #2661

Merged
merged 5 commits into from Feb 16, 2017
Merged

Improve markdown compiler performance #2661

merged 5 commits into from Feb 16, 2017

Conversation

living180
Copy link
Contributor

This PR fixes issue #2660.

The first commit eliminates pathological performance when compiling a large number of markdown files at one time but preventing the markdown extensions list from growing each time a markdown file was compiled. On my machine this reduced the time required to compile 1000 markdown files from over 11 minutes to under 55 seconds.

The second commit provides a further incremental improvement by creating and reusing a single Markdown object instead of creating a separate Markdown object for each file that is compiled. This further reduces the compilation time for 1000 markdown files to under 34 seconds.

The CompileMarkdown.compile_string() method was always adding the
configured markdown extensions (from MARKDOWN_EXTENSIONS) to
self.extensions.  This meant that each time it was called,
self.extensions grew.  As a result, the markdown conversion grew slower
each time due to the increasing number of extensions.  Fix by adding the
configured extensions once in the set_site() method instead.
Copy link
Member

@Kwpolska Kwpolska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any benchmarks?

Also, add this change to CHANGES.txt and your name to AUTHORS.txt.

site_extensions = self.site.config.get("MARKDOWN_EXTENSIONS")
self.config_dependencies.append(str(sorted(site_extensions)))
extensions.extend(site_extensions)
self.markdown = Markdown(extensions=extensions, output_format="html5")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will crash if Markdown does not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks.

plugin_info.plugin_object.short_help = plugin_info.description

site_extensions = self.site.config.get("MARKDOWN_EXTENSIONS")
self.config_dependencies.append(str(sorted(site_extensions)))
self.extensions.extend(site_extensions)
extensions.extend(site_extensions)
self.markdown = Markdown(extensions=extensions, output_format="html5")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Markdown objects thread-safe? (You can tell doit to process tasks in parallel using threads, so it could be that two Markdown posts are compiled simultaneously.) I somehow doubt this as you call self.markdown.reset() after self.markdown.convert(data). If this is the case, you should better use thread-local storage for the Markdown object.
(That would also indirectly solve Kwpolska's remark.)

@ralsina
Copy link
Member

ralsina commented Feb 14, 2017

I agree, either make it thread local or protect it with a mutex.

@felixfontein
Copy link
Contributor

Since this is about performance, making it thread-local sounds like a much better choice than a mutex :)

@living180
Copy link
Contributor Author

@Kwpolska wrote

Any benchmarks?

In the PR description, I included some timings I measured (basically more than 11 minutes before my changes, then under 55 seconds with just the first commit, and finally under 34 seconds with both commits). Are you looking for something more than that? If so, could you please describe what you would like to see? I'm happy to try to accomodate

@Kwpolska
Copy link
Member

@living180 Ah, that’s good enough. But please address all our comments so we can proceed with this.

Stop using the markdown() convenience function in the
CompileMarkdown.compile_string() method, which creates a new Markdown
object each time it is called.

Instead, create a new class, ThreadLocalMarkdown, which is a subclass of
threading.local.  This class has a single attribute, markdown, which is
a per-thread Markdown instance.  It also has a single method, convert(),
which converts the data using the Markdown instance's convert() method,
and then resets the Markdown instance's internal state.

Then in the CompileMarkdown.set_site() method, instantiate a new
ThreadLocalMarkdown instance (unless the markdown module is unavailable)
and assign it to the plugin's converter attribute.  In the
compile_string() method, use that ThreadLocalMarkdown instance to
perform the conversion.

The end result is one Markdown object created per thread, instead of one
for each markdown file compiled.
@living180
Copy link
Contributor Author

Updated version - should not fail ungracefully if the markdown module is not installed, and is now thread-safe for parallel builds. (Testing with nikola build -n 4 reduces the runtime to 14s on my machine.)

"""Convert to markdown using per-thread Markdown objects."""

def __init__(self, extensions):
self.markdown = Markdown(extensions=extensions, output_format="html5")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this is the correct way to use threading.local. Every thread creates a copy of this initial value, but depending on how this copy is created (maybe even none at all, by just copying the reference to the Markdown object) the resulting object is not thread-safe.

It might be better to set self.markdown to None, and to check self.markdown in convert() and initialize it there when its current value is None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after reading some more about this, I realize I'm wrong and your approach is correct. threading.local is a really strange beast, it seems. (For anyone interested, it seems the only "real" documentation can be found in the docstring here: https://svn.python.org/projects/python/trunk/Lib/_threading_local.py.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here’s some test code:

import threading
from markdown import Markdown

class ThreadLocalMarkdown(threading.local):
    """Convert to markdown using per-thread Markdown objects."""

    def __init__(self, extensions):
        self.markdown = Markdown(extensions=extensions, output_format="html5")
        self.counter = 0

    def convert(self, data):
        """Convert data to HTML and reset internal state."""
        result = self.markdown.convert(data)
        self.markdown.reset()
        self.counter += 1
        print(self.counter)
        return result

    def show(self):
        print(self.counter)

converter = ThreadLocalMarkdown([])

def run_convert():
    converter.convert("Hello, **world!**")

threads = [threading.Thread(None, run_convert) for i in range(10)]
[t.start() for t in threads]
converter.show()

says a bunch of ones and a zero at the end, as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With an integer it wouldn't be a problem anyway, because integers are immutable. If what I first thought would happen would really happen, you'd need a list or some other mutable object to test for it.

After reading the docstring for _threading_local, I'm happy with the code. I still think that threading.local viloates the principle of least astonishment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m taking this as an approval and merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'm happy with this PR. I'm just not happy how threading.local was designed ;)

@Kwpolska Kwpolska removed the request for review from felixfontein February 16, 2017 11:04
@Kwpolska Kwpolska merged commit a50f8eb into getnikola:master Feb 16, 2017
@Kwpolska
Copy link
Member

Thanks for contributing to Nikola! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants