New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve markdown compiler performance #2661
Improve markdown compiler performance #2661
Conversation
The CompileMarkdown.compile_string() method was always adding the configured markdown extensions (from MARKDOWN_EXTENSIONS) to self.extensions. This meant that each time it was called, self.extensions grew. As a result, the markdown conversion grew slower each time due to the increasing number of extensions. Fix by adding the configured extensions once in the set_site() method instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any benchmarks?
Also, add this change to CHANGES.txt and your name to AUTHORS.txt.
site_extensions = self.site.config.get("MARKDOWN_EXTENSIONS") | ||
self.config_dependencies.append(str(sorted(site_extensions))) | ||
extensions.extend(site_extensions) | ||
self.markdown = Markdown(extensions=extensions, output_format="html5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will crash if Markdown
does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks.
plugin_info.plugin_object.short_help = plugin_info.description | ||
|
||
site_extensions = self.site.config.get("MARKDOWN_EXTENSIONS") | ||
self.config_dependencies.append(str(sorted(site_extensions))) | ||
self.extensions.extend(site_extensions) | ||
extensions.extend(site_extensions) | ||
self.markdown = Markdown(extensions=extensions, output_format="html5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are Markdown
objects thread-safe? (You can tell doit to process tasks in parallel using threads, so it could be that two Markdown posts are compiled simultaneously.) I somehow doubt this as you call self.markdown.reset()
after self.markdown.convert(data)
. If this is the case, you should better use thread-local storage for the Markdown
object.
(That would also indirectly solve Kwpolska's remark.)
I agree, either make it thread local or protect it with a mutex. |
Since this is about performance, making it thread-local sounds like a much better choice than a mutex :) |
@Kwpolska wrote
In the PR description, I included some timings I measured (basically more than 11 minutes before my changes, then under 55 seconds with just the first commit, and finally under 34 seconds with both commits). Are you looking for something more than that? If so, could you please describe what you would like to see? I'm happy to try to accomodate |
@living180 Ah, that’s good enough. But please address all our comments so we can proceed with this. |
Stop using the markdown() convenience function in the CompileMarkdown.compile_string() method, which creates a new Markdown object each time it is called. Instead, create a new class, ThreadLocalMarkdown, which is a subclass of threading.local. This class has a single attribute, markdown, which is a per-thread Markdown instance. It also has a single method, convert(), which converts the data using the Markdown instance's convert() method, and then resets the Markdown instance's internal state. Then in the CompileMarkdown.set_site() method, instantiate a new ThreadLocalMarkdown instance (unless the markdown module is unavailable) and assign it to the plugin's converter attribute. In the compile_string() method, use that ThreadLocalMarkdown instance to perform the conversion. The end result is one Markdown object created per thread, instead of one for each markdown file compiled.
Updated version - should not fail ungracefully if the |
"""Convert to markdown using per-thread Markdown objects.""" | ||
|
||
def __init__(self, extensions): | ||
self.markdown = Markdown(extensions=extensions, output_format="html5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this is the correct way to use threading.local
. Every thread creates a copy of this initial value, but depending on how this copy is created (maybe even none at all, by just copying the reference to the Markdown
object) the resulting object is not thread-safe.
It might be better to set self.markdown
to None
, and to check self.markdown
in convert()
and initialize it there when its current value is None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after reading some more about this, I realize I'm wrong and your approach is correct. threading.local
is a really strange beast, it seems. (For anyone interested, it seems the only "real" documentation can be found in the docstring here: https://svn.python.org/projects/python/trunk/Lib/_threading_local.py.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here’s some test code:
import threading
from markdown import Markdown
class ThreadLocalMarkdown(threading.local):
"""Convert to markdown using per-thread Markdown objects."""
def __init__(self, extensions):
self.markdown = Markdown(extensions=extensions, output_format="html5")
self.counter = 0
def convert(self, data):
"""Convert data to HTML and reset internal state."""
result = self.markdown.convert(data)
self.markdown.reset()
self.counter += 1
print(self.counter)
return result
def show(self):
print(self.counter)
converter = ThreadLocalMarkdown([])
def run_convert():
converter.convert("Hello, **world!**")
threads = [threading.Thread(None, run_convert) for i in range(10)]
[t.start() for t in threads]
converter.show()
says a bunch of ones and a zero at the end, as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With an integer it wouldn't be a problem anyway, because integers are immutable. If what I first thought would happen would really happen, you'd need a list or some other mutable object to test for it.
After reading the docstring for _threading_local
, I'm happy with the code. I still think that threading.local
viloates the principle of least astonishment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m taking this as an approval and merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'm happy with this PR. I'm just not happy how threading.local
was designed ;)
Thanks for contributing to Nikola! 🎉 |
This PR fixes issue #2660.
The first commit eliminates pathological performance when compiling a large number of markdown files at one time but preventing the markdown extensions list from growing each time a markdown file was compiled. On my machine this reduced the time required to compile 1000 markdown files from over 11 minutes to under 55 seconds.
The second commit provides a further incremental improvement by creating and reusing a single Markdown object instead of creating a separate Markdown object for each file that is compiled. This further reduces the compilation time for 1000 markdown files to under 34 seconds.