Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing to write HTML fragments instead of only whole documents. #2546

Merged
merged 6 commits into from Oct 26, 2016

Conversation

felixfontein
Copy link
Contributor

Test implementation for the dicussion at the bottom of #2544.

@@ -16,6 +16,8 @@ Features
translated (Issue #2116)
* Pass ``post`` object and ``lang`` to post compilers (Issue #2531)
* Pass ``url_type`` into template's context.
* ``render_template`` and ``generic_renderer`` can now create HTML
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS. new changes go at the top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also move the url_type part, then?

doc = lxml.html.document_fromstring(data, parser)
self.rewrite_links(doc, src, context['lang'], url_type, is_fragment=is_fragment)
if is_fragment:
data = (doc.text or '').encode('utf-8') + ''.encode('utf-8').join([lxml.html.tostring(child, encoding='utf-8', method='html') for child in doc.iterchildren()])
Copy link
Member

@Kwpolska Kwpolska Oct 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL lxml has awful trees.

How would the solution that adds the extra div tag look? I don’t think we would really mind one little <div> there. Or worst case scenario, if we know the <div> is guaranteed to always be there, we could slice it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra <div></div> is produced by lxml.html.tostring(doc, encoding='utf-8', method='html').

I don't really like cutting stuff out via text replacements.

"""Replace links in document to point to the right places."""
# First let lxml replace most of them
doc.rewrite_links(lambda dst: self.url_replacer(src, dst, lang, url_type), resolve_base_href=False)

# lxml ignores srcset in img and source elements, so do that by hand
objs = list(doc.xpath('(*//img|*//source)'))
objs = list(doc.xpath('({0}//img|{0}//source)'.format('' if is_fragment else '*')))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rewrite this as a 4-line if/else tree instead of abusing str.format. We should also find out what that asterisk changes (can non-fragment code work without it? Does it reach all <img> and <source> elements if the asterisk is not present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't reach top-level <img> tags with the asterisk there. I think (I don't know the xpath syntax well enough to be sure) that the asterisk eats up a top level element, and then it looks for children (or children-of-children etc.) of that top element.

Copy link
Member

@Kwpolska Kwpolska Oct 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about no asterisk and full document? (Please rewrite in 4 lines anyway) done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should work as well, since there are no top-level <img> or <source> elements which have to be handled differently. I've changed it accordingly, which removes the whole if/else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The original syntax was introduced in 24a45e1.)

self.rewrite_links(doc, src, context['lang'], url_type)
data = b'<!DOCTYPE html>\n' + lxml.html.tostring(doc, encoding='utf8', method='html', pretty_print=True)
if is_fragment:
data = (doc.text or '').encode('utf-8') + ''.encode('utf-8').join([lxml.html.tostring(child, encoding='utf-8', method='html') for child in doc.iterchildren()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this does. Doesn't it end up producing 2 copies of the doc's text?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc.text is all the text in doc that appears before a HTML element. Text that appears after HTML elements is taken care of by tostring(). (lxml is weird.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments with a similar explanation, hope it makes it clearer.

@ralsina
Copy link
Member

ralsina commented Oct 25, 2016

Ok then!

El lun., 24 oct. 2016 16:17, Chris Warrick notifications@github.com
escribió:

@Kwpolska commented on this pull request.

In nikola/nikola.py #2546:

     self.rewrite_links(doc, src, context['lang'], url_type)
  •    data = b'<!DOCTYPE html>\n' + lxml.html.tostring(doc, encoding='utf8', method='html', pretty_print=True)
    
  •    if is_fragment:
    
  •        data = (doc.text or '').encode('utf-8') + ''.encode('utf-8').join([lxml.html.tostring(child, encoding='utf-8', method='html') for child in doc.iterchildren()])
    

doc.text is all the text in doc that appears before a HTML element. Text
that appears after HTML elements is taken care of by tostring(). (lxml is
weird.)


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#2546, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAGKxTwjY0Lt6_yVDjuoJ4sIwRpFP0Tks5q3L3egaJpZM4KeM0s
.

@felixfontein
Copy link
Contributor Author

One more thing I had to try out after seeing that lxml.html.tostring has a doctype argument. :)

@felixfontein
Copy link
Contributor Author

The one test fails because there were some connection problems to pypi while installing doit. All other tests had no problems.

@Kwpolska
Copy link
Member

Those tests are pretty worthless anyways.

@felixfontein
Copy link
Contributor Author

Ok. And thanks for the approval. Anyone mind if I merge?

@Kwpolska
Copy link
Member

Go ahead.

@felixfontein felixfontein merged commit 9a09d5d into master Oct 26, 2016
@felixfontein felixfontein deleted the allow-html-fragment-output branch October 26, 2016 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants