Ensure link path to be URL encoded #2005

msnoigrs · 2015-09-03T04:03:14Z

This PR allow to use characters not permitted in URL. For example, in the case of tag, category and author not sluged appear in a generated path.

Kwpolska · 2015-09-03T15:06:49Z

nikola/data/themes/base/templates/post_helper.tmpl

@@ -41,7 +41,7 @@

 <%def name="open_graph_metadata(post)">
 %if use_open_graph:
-    <meta property="og:site_name" content="${blog_title|striphtml}">
+    <meta property="og:site_name" content="${blog_title|striphtml,h}">


Is ,h really necessary alongside striphtml and is it valid syntax?

Yes. It is valid syntax. http://docs.makotemplates.org/en/latest/filtering.html
striphtml call markupsafe.Markup.striptags().
https://github.com/mitsuhiko/markupsafe/blob/master/markupsafe/__init__.py#L148
This method only remove |<[^>]*> blocks and unescape.

Kwpolska · 2015-09-03T15:10:35Z

Is URL encoding really needed? In 2015, where UTF-8 is the only encoding that exists, and we explicitly disallow " etc. in URLs (even with slugify disabled)?

msnoigrs · 2015-09-03T17:12:48Z

At least as far as I know, In Japanese environment with UTF-8, URL encoding surely has been used.
And I'd like to use slugify disabled name for author. ex. /output/authors/日本語名前/. In this case, URL encoding is needed.

Kwpolska · 2015-09-03T17:48:05Z

It should just work without encoding (or it should be handled by someone else), are you sure this is necessary? Please test using a Japanese link in Chrome and Firefox, with slugify disabled and without your patch.

msnoigrs · 2015-09-03T19:15:18Z

I know modern browsers can treat Japanese link correctly.
If I enter "日本語.html" in the address bar, browser will request to server by URL encoding.

Search engines treat URLs with utf-8 and URLs with URL encoding separately. This is bad for SEO.
URL encoding has been used to prevent this.

Is it should configurable in the conf.py?

msnoigrs · 2015-09-06T10:50:04Z

I made it configurable. Please check this again.
I need this feature for japanese site.

Kwpolska · 2015-09-06T11:05:58Z

I really don’t know if we should merge this, especially with the switch…

@Aeyoun: what’s your opinion? Should this go in the core, or should this be something users would just do themselves (in custom templates)?

da2x · 2015-09-06T14:04:00Z

Adding |h in more places is probably good and will avoid issues like titles like Get your <body> on a diet!. Are we consistently either |h or |striphtml everywhere?

All URLs in Nikola are already compiling with “RFC-3986 standard for URIs, the RFC-3987 standard for IRIs, and the XML standard.” (sitemap FAQ).

Nikola uses the same URL fromat in the sitemap as well as for link canonicalization so there is no confusion in the terms of search engines and duplicate content. So the SEO argument is something I wouldn’t worry about at all in this case. (And I’m usually the SEO worrying type.) We’re already following best-practices. I’m not sure if any particular servers would do better with one format over the other. This is the only possible argument for changing this. This would have ether be a problem everyone or no one, so there should be no need for an option to optionally break URLs.

From the attached screenshot, everything part of the browser (and Nikola) seem to be doing the correct thing. In summary, besides |h-ing all the things, I don’t see where this branch accomplishes anything.

da2x · 2015-09-06T14:05:22Z

TL;DR; unless we need to change URLs because of either RFC-3986 or RFC-3987, we shouldn’t change a thing with the URLs.

msnoigrs · 2015-09-06T17:17:13Z

Please refer to https://support.google.com/webmasters/answer/35653?hl=en

da2x · 2015-09-06T17:26:56Z

That article covers the same RFCs as I referred to.

da2x · 2015-09-06T17:37:18Z

Maybe I’m misunderstand the issues here. I believe we already do use RFC-3986 / RFC-3987 (plus XML reserved-characters escaping) everywhere. If this is not the case, than things are broken and should be fixed and made non-optional. Updating these links should be done internally and not require any changes to any templates.

@masayuko, is Nikola not meeting RFC-3986 / RFC-3987 everywhere? nowhere? somewhere? what about sitemaps? atom/rss? If not, the link utilities that produced the links should be considered broken and need fixing. The link issue should not require template changes at all as Nikola should already be producing standard links.

We could add a new method utils.urlencode() and push every link through it before sending links be used in templates, feeds, sitemaps, etc.

msnoigrs · 2015-09-06T17:40:46Z

The attached screenshot means the follow process is occured:
Request with UTF-8 encoded URL (user) -> URL encoding (browser) -> decode from URL encoding to UTF-8 (server) -> UTF-8 (filesystem) -> Response
I know Nikola is already compatible with these RFC.

I hope the below part:

Below is that same URL, UTF-8 encoded (for hosting on a server that uses that encoding) and URL escaped:

http://www.example.com/%C3%BCmlat.html&q=name

msnoigrs · 2015-09-06T17:49:18Z

@Aeyoun At least, I hope sitemap, RSS/Atom, link tag for canonical is encoded with URL encoding. They have something to do with search engines.

da2x · 2015-09-06T19:20:58Z

The following method should produce correctly encoded links for anything that Nikola produces. (Limited to netloc and paths, because that’s all that Nikola controls.) I’ll make a new branch, stick this in utils.py, and make sure anything that produces a URL/IRI calls this method before returning it to a template, feed, sitemap, etc. It can be called multiple times, so I’ll add it to all existing abs_link() and other generic purpose link methods.

Opinions? @Kwpolska? @masayuko?

import urllib.parse
from collections import OrderedDict
from unicodedata import normalize as unicodenormalize

def encodelink(iri):
    iri = unicodenormalize('NFC', iri)
    link = OrderedDict(urllib.parse.urlparse(iri).__dict__)
    link['path'] = urllib.parse.quote(urllib.parse.unquote(link['path']))
    try:
        link['netloc'] = link['netloc'].encode('utf-8').decode('idna').encode('idna').decode('utf-8')
    except UnicodeDecodeError:
        link['netloc'] = link['netloc'].encode('idna').decode('utf-8')
    encoded_link = urllib.parse.urlunparse(link.values())
    return encoded_link.encode('utf-8')

Some examples, all producing the expected result. Why a method fulfilling this purpose isn’t already provided by urllib.parse is beyond me.

>>> encodelink("/test")
b'/test'
>>> encodelink("/æøå")
b'/%C3%A6%C3%B8%C3%A5'
>>> encodelink("/test-æøå")
b'/test-%C3%A6%C3%B8%C3%A5'
>>> encodelink("/日本語.html?ninja")
b'/%E6%97%A5%E6%9C%AC%E8%AA%9E.html?ninja'
>>> encodelink("/test-%C3%A6%C3%B8%C3%A5")
b'/test-%C3%A6%C3%B8%C3%A5'
>>> encodelink("/æøå?test")
b'/%C3%A6%C3%B8%C3%A5?test'
>>> encodelink("http://æøå.no/a & b?1&2")
b'http://xn--5cab8c.no/a%20%26%20b?1&2'
>>> encodelink("http://æøå.no/æøå")
b'http://xn--5cab8c.no/%C3%A6%C3%B8%C3%A5'
>>> encodelink("http://elg.no/stein")
b'http://elg.no/stein'
>>> encodelink("http://xn--5cab8c.no/%C3%A6%C3%B8%C3%A5")
b'http://xn--5cab8c.no/%C3%A6%C3%B8%C3%A5'

msnoigrs · 2015-09-06T19:29:26Z

It's great and thank you for understanding. I agree.

Kwpolska reviewed Sep 3, 2015
View reviewed changes

msnoigrs force-pushed the for-upstream branch from ea1b36f to e5da7ae Compare September 3, 2015 18:20

msnoigrs force-pushed the for-upstream branch 2 times, most recently from 016ca61 to c63bdef Compare September 6, 2015 09:47

msnoigrs added 2 commits September 6, 2015 18:56

Ensure link path to be URL encoded

b1a16bd

URL encode and HTML escape

019896d

msnoigrs force-pushed the for-upstream branch from c63bdef to 019896d Compare September 6, 2015 09:57

da2x mentioned this pull request Sep 8, 2015

Use encodelink() everywhere #2037

Merged

msnoigrs closed this Sep 12, 2015

msnoigrs deleted the for-upstream branch September 12, 2015 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure link path to be URL encoded #2005

Ensure link path to be URL encoded #2005

msnoigrs commented Sep 3, 2015

Kwpolska Sep 3, 2015

msnoigrs Sep 3, 2015

Kwpolska commented Sep 3, 2015

msnoigrs commented Sep 3, 2015

Kwpolska commented Sep 3, 2015

msnoigrs commented Sep 3, 2015

msnoigrs commented Sep 6, 2015

Kwpolska commented Sep 6, 2015

da2x commented Sep 6, 2015

da2x commented Sep 6, 2015

msnoigrs commented Sep 6, 2015

da2x commented Sep 6, 2015

da2x commented Sep 6, 2015

msnoigrs commented Sep 6, 2015

msnoigrs commented Sep 6, 2015

da2x commented Sep 6, 2015

msnoigrs commented Sep 6, 2015

Ensure link path to be URL encoded #2005

Ensure link path to be URL encoded #2005

Conversation

msnoigrs commented Sep 3, 2015

Kwpolska Sep 3, 2015

Choose a reason for hiding this comment

msnoigrs Sep 3, 2015

Choose a reason for hiding this comment

Kwpolska commented Sep 3, 2015

msnoigrs commented Sep 3, 2015

Kwpolska commented Sep 3, 2015

msnoigrs commented Sep 3, 2015

msnoigrs commented Sep 6, 2015

Kwpolska commented Sep 6, 2015

da2x commented Sep 6, 2015

da2x commented Sep 6, 2015

msnoigrs commented Sep 6, 2015

da2x commented Sep 6, 2015

da2x commented Sep 6, 2015

msnoigrs commented Sep 6, 2015

msnoigrs commented Sep 6, 2015

da2x commented Sep 6, 2015

msnoigrs commented Sep 6, 2015