Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure link path to be URL encoded #2005

Closed
wants to merge 2 commits into from

Conversation

msnoigrs
Copy link
Contributor

@msnoigrs msnoigrs commented Sep 3, 2015

This PR allow to use characters not permitted in URL. For example, in the case of tag, category and author not sluged appear in a generated path.

@@ -41,7 +41,7 @@

<%def name="open_graph_metadata(post)">
%if use_open_graph:
<meta property="og:site_name" content="${blog_title|striphtml}">
<meta property="og:site_name" content="${blog_title|striphtml,h}">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ,h really necessary alongside striphtml and is it valid syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is valid syntax. http://docs.makotemplates.org/en/latest/filtering.html
striphtml call markupsafe.Markup.striptags().
https://github.com/mitsuhiko/markupsafe/blob/master/markupsafe/__init__.py#L148
This method only remove |<[^>]*> blocks and unescape.

@Kwpolska
Copy link
Member

Kwpolska commented Sep 3, 2015

Is URL encoding really needed? In 2015, where UTF-8 is the only encoding that exists, and we explicitly disallow " etc. in URLs (even with slugify disabled)?

@msnoigrs
Copy link
Contributor Author

msnoigrs commented Sep 3, 2015

At least as far as I know, In Japanese environment with UTF-8, URL encoding surely has been used.
And I'd like to use slugify disabled name for author. ex. /output/authors/日本語 名前/. In this case, URL encoding is needed.

@Kwpolska
Copy link
Member

Kwpolska commented Sep 3, 2015

It should just work without encoding (or it should be handled by someone else), are you sure this is necessary? Please test using a Japanese link in Chrome and Firefox, with slugify disabled and without your patch.

@msnoigrs
Copy link
Contributor Author

msnoigrs commented Sep 3, 2015

I know modern browsers can treat Japanese link correctly.
If I enter "日本語.html" in the address bar, browser will request to server by URL encoding.
urlencodeja
Search engines treat URLs with utf-8 and URLs with URL encoding separately. This is bad for SEO.
URL encoding has been used to prevent this.

Is it should configurable in the conf.py?

@msnoigrs msnoigrs force-pushed the for-upstream branch 2 times, most recently from 016ca61 to c63bdef Compare September 6, 2015 09:47
@msnoigrs
Copy link
Contributor Author

msnoigrs commented Sep 6, 2015

I made it configurable. Please check this again.
I need this feature for japanese site.

@Kwpolska
Copy link
Member

Kwpolska commented Sep 6, 2015

I really don’t know if we should merge this, especially with the switch…

@Aeyoun: what’s your opinion? Should this go in the core, or should this be something users would just do themselves (in custom templates)?

@da2x
Copy link
Contributor

da2x commented Sep 6, 2015

Adding |h in more places is probably good and will avoid issues like titles like Get your <body> on a diet!. Are we consistently either |h or |striphtml everywhere?

All URLs in Nikola are already compiling with “RFC-3986 standard for URIs, the RFC-3987 standard for IRIs, and the XML standard.” (sitemap FAQ).

Nikola uses the same URL fromat in the sitemap as well as for link canonicalization so there is no confusion in the terms of search engines and duplicate content. So the SEO argument is something I wouldn’t worry about at all in this case. (And I’m usually the SEO worrying type.) We’re already following best-practices. I’m not sure if any particular servers would do better with one format over the other. This is the only possible argument for changing this. This would have ether be a problem everyone or no one, so there should be no need for an option to optionally break URLs.

From the attached screenshot, everything part of the browser (and Nikola) seem to be doing the correct thing. In summary, besides |h-ing all the things, I don’t see where this branch accomplishes anything.

@da2x
Copy link
Contributor

da2x commented Sep 6, 2015

TL;DR; unless we need to change URLs because of either RFC-3986 or RFC-3987, we shouldn’t change a thing with the URLs.

@msnoigrs
Copy link
Contributor Author

msnoigrs commented Sep 6, 2015

@da2x
Copy link
Contributor

da2x commented Sep 6, 2015

That article covers the same RFCs as I referred to.

@da2x
Copy link
Contributor

da2x commented Sep 6, 2015

Maybe I’m misunderstand the issues here. I believe we already do use RFC-3986 / RFC-3987 (plus XML reserved-characters escaping) everywhere. If this is not the case, than things are broken and should be fixed and made non-optional. Updating these links should be done internally and not require any changes to any templates.

@masayuko, is Nikola not meeting RFC-3986 / RFC-3987 everywhere? nowhere? somewhere? what about sitemaps? atom/rss? If not, the link utilities that produced the links should be considered broken and need fixing. The link issue should not require template changes at all as Nikola should already be producing standard links.

We could add a new method utils.urlencode() and push every link through it before sending links be used in templates, feeds, sitemaps, etc.

@msnoigrs
Copy link
Contributor Author

msnoigrs commented Sep 6, 2015

The attached screenshot means the follow process is occured:
Request with UTF-8 encoded URL (user) -> URL encoding (browser) -> decode from URL encoding to UTF-8 (server) -> UTF-8 (filesystem) -> Response
I know Nikola is already compatible with these RFC.

I hope the below part:

Below is that same URL, UTF-8 encoded (for hosting on a server that uses that encoding) and URL escaped:

http://www.example.com/%C3%BCmlat.html&q=name

@msnoigrs
Copy link
Contributor Author

msnoigrs commented Sep 6, 2015

@Aeyoun At least, I hope sitemap, RSS/Atom, link tag for canonical is encoded with URL encoding. They have something to do with search engines.

@da2x
Copy link
Contributor

da2x commented Sep 6, 2015

The following method should produce correctly encoded links for anything that Nikola produces. (Limited to netloc and paths, because that’s all that Nikola controls.) I’ll make a new branch, stick this in utils.py, and make sure anything that produces a URL/IRI calls this method before returning it to a template, feed, sitemap, etc. It can be called multiple times, so I’ll add it to all existing abs_link() and other generic purpose link methods.

Opinions? @Kwpolska? @masayuko?

import urllib.parse
from collections import OrderedDict
from unicodedata import normalize as unicodenormalize

def encodelink(iri):
    iri = unicodenormalize('NFC', iri)
    link = OrderedDict(urllib.parse.urlparse(iri).__dict__)
    link['path'] = urllib.parse.quote(urllib.parse.unquote(link['path']))
    try:
        link['netloc'] = link['netloc'].encode('utf-8').decode('idna').encode('idna').decode('utf-8')
    except UnicodeDecodeError:
        link['netloc'] = link['netloc'].encode('idna').decode('utf-8')
    encoded_link = urllib.parse.urlunparse(link.values())
    return encoded_link.encode('utf-8')

Some examples, all producing the expected result. Why a method fulfilling this purpose isn’t already provided by urllib.parse is beyond me.

>>> encodelink("/test")
b'/test'
>>> encodelink("/æøå")
b'/%C3%A6%C3%B8%C3%A5'
>>> encodelink("/test-æøå")
b'/test-%C3%A6%C3%B8%C3%A5'
>>> encodelink("/日本語.html?ninja")
b'/%E6%97%A5%E6%9C%AC%E8%AA%9E.html?ninja'
>>> encodelink("/test-%C3%A6%C3%B8%C3%A5")
b'/test-%C3%A6%C3%B8%C3%A5'
>>> encodelink("/æøå?test")
b'/%C3%A6%C3%B8%C3%A5?test'
>>> encodelink("http://æøå.no/a & b?1&2")
b'http://xn--5cab8c.no/a%20%26%20b?1&2'
>>> encodelink("http://æøå.no/æøå")
b'http://xn--5cab8c.no/%C3%A6%C3%B8%C3%A5'
>>> encodelink("http://elg.no/stein")
b'http://elg.no/stein'
>>> encodelink("http://xn--5cab8c.no/%C3%A6%C3%B8%C3%A5")
b'http://xn--5cab8c.no/%C3%A6%C3%B8%C3%A5'

@msnoigrs
Copy link
Contributor Author

msnoigrs commented Sep 6, 2015

It's great and thank you for understanding. I agree.

@da2x da2x mentioned this pull request Sep 8, 2015
@msnoigrs msnoigrs closed this Sep 12, 2015
@msnoigrs msnoigrs deleted the for-upstream branch September 12, 2015 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants