Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use encodelink() everywhere #2037

Merged
merged 5 commits into from Sep 10, 2015
Merged

Use encodelink() everywhere #2037

merged 5 commits into from Sep 10, 2015

Conversation

da2x
Copy link
Contributor

@da2x da2x commented Sep 8, 2015

Replaces the pull request suggested in #2005.

utils.encodelink(str) takes a string and NFC normalizes it, encodes the netloc to IDN punycode, and encodes the path to percent-encoding (RFC-3986 / RFC-3987). Schema, query, and fragment components of the address are also NFC normalized but are otherwise untouched. The function can be called multiple times non-destructively.

@masayuko, would you please test this? Set USE_SLUGIFY = False to support tags and categories. No other option required. Please also test nikola check -l (Nikola’s link checker.)

@da2x
Copy link
Contributor Author

da2x commented Sep 8, 2015

OK, after 3 hours I’ve concluded that py27 support for check is either a) unfixable or b) beyond my ability to fix. Too many Python internals are hardcoded to be ascii only. Anything that traverses the filesystem is a nightmare. I can cobble together something that runs on py27 excluding py3 and everything that isn’t Linux. py3 works fine as-is now.

Would it be okay ti throw an error and die if a user tries using check with non-ascii characters in anything older than py32? "It’s cool that you want to use UTF-8 paths, but you can’t have nice things in py2. Please upgrade to Python 3 to check your files and links."

@da2x
Copy link
Contributor Author

da2x commented Sep 9, 2015

Everything is awesome. Should work*. Does it work for you, @masayuko?

* OSX may still explode because of it’s unique NFD preference, seems to work fine, but really it shouldn’t have.

@Kwpolska
Copy link
Member

Kwpolska commented Sep 9, 2015

We wouldn’t mind not supporting Python 2 here.

@da2x
Copy link
Contributor Author

da2x commented Sep 9, 2015

This is ready to merge with py27 and py3 support. Someone want to sanity check it? @Kwpolska? @masayuko?

@msnoigrs
Copy link
Contributor

Sorry for late reply. I am checking it now. I'll report back later.
It seems that NFKC is better than NFD.
(refer to nameprep rfc3491 stringprep rfc3454)

@msnoigrs
Copy link
Contributor

OK. I have just tested in several ways. It works for me.
As mentioned above, it should use NFKC to normalize in util.encodelink().
I have encountered a issue that fail to create sitemap. When the words of author, category, tag, etc end with period(.), It will be occured due to follow line:

path = (path.replace(os.sep, '/') + '/').replace('./', '')

in nikola/plugins/task/sitemap/init.py.
It was fixed in my PR code #2005.

@da2x
Copy link
Contributor Author

da2x commented Sep 10, 2015

NFKC is destructive where as NFC is not. NFKC offers no advantages here as far as I know. Emailing discussions on W3C’s list recommended NFC over NFKC, so I’m sticking with that. (I couldn’t be bothered to dig up the link again.) NFD and NFKD is out of the question because they can’t be used reliably to produce the same output. Besides, everyone but Apple is using NFC everywhere for everything.

Regarding the issue with periods in the end of paths, could you make a separate issue about that?

da2x added a commit that referenced this pull request Sep 10, 2015
@da2x da2x merged commit ef6af2b into master Sep 10, 2015
da2x added a commit that referenced this pull request Sep 10, 2015
@msnoigrs
Copy link
Contributor

The netloc should follow nameprep RFC3491, RFC3454. so prefer NFKC.
Examples of japanese idn include converting half-width kana between full-width kana.
In actual operation, half-width kana cannot be used as idn because half-width kana convert to full-width kana by NFKC.
Kana is the major character in Japanese and It has several expressions in Unicode.

ハ (U+FF8A) half-width
ハ (U+30CF) full-width
ハ (U+FF8A) -> NFC -> ハ (U+FF8A) <-- disallow in idn
ハ (U+FF8A) -> NFKC -> ハ (U+30CF)

パ (U+FF8A U+FF9F) half-width and two code points
パ (U+30D1) full-width
パ (U+FF8A U+FF9F) -> NFC -> パ (U+FF8A U+FF9F) <-- disallow in idn
パ (U+FF8A U+FF9F) -> NFKC -> パ (U+30D1)

On the other hand, the path get used as the location name on a filesystem. so prefer NFC.

@msnoigrs
Copy link
Contributor

I have just checked myself.

print('パ'.encode('idna'))
b'xn--odk'
print('パ'.encode('idna'))
b'xn--odk'

NFC is no problem. Please forget the above said.

@Kwpolska Kwpolska deleted the encodelinks branch December 24, 2015 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants