Header/id deduplication #2763

Kwpolska · 2017-05-14T18:02:11Z

Fixes #2570. Also contains a doc fix for #2743, because I didn’t notice which branch I was committing to.

Should I bother with <a name=""> as well?

cc @wichtounet

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

felixfontein · 2017-05-14T18:18:59Z

nikola/filters.py

+            # Results are ordered the same way they are ordered in document
+            offending_elements = doc.xpath('//*[@id="{}"]'.format(i))
+            counter = 2
+            for e in offending_elements[1::-1]:


Shouldn't this be offending_elements[::-1] without the first 1?

Notice counter = 2? There will be foo, foo-2, foo-3…

But this only enumerates elements 1 and 0 of the list, not any other element. ['a', 'b', 'c', 'd', 'e'][1::-1] == ['b', 'a'].

You probably want offending_elements[-2::-1].

felixfontein · 2017-05-14T18:20:39Z

nikola/filters.py

+            offending_elements = doc.xpath('//*[@id="{}"]'.format(i))
+            counter = 2
+            for e in offending_elements[1::-1]:
+                new_id = '{0}-{1}'.format(i, counter)


What happens if this is ID is in use as well?

But trying until you find a free one will be a problem if you want to make permalinks (as for the Sphinx permalinks filter). Assume the two oldest entries have headers with IDs a and a. If there's nothing else around, one will get a and the other a-2. But now assume a new post is added which uses a-2 (for whatever reason). Then the old a-2 will end up as a-3, because a-2 is already taken, and links pointing to a in the second oldest post are suddenly pointing to the wrong place.

This is one of those cases which we can’t reliably fix, at least as a filter. To make it work in every scenario, we’d need to add post names to the IDs, or otherwise do unusual stuff. The thing is, most people are unlikely to use those permalinks on indexes, and this plugin’s aims are (a) to fix HTML validation issues on indexes, (b) to fix IDs for Sphinx permalinks and other uses clashing, on a single page. You can’t protect against changing permalinks if those are maintained by code, not humans, whilst allowing said humans to edit the page contents (an edit to a post/page could trigger a deduplication)

True. You'd need state to track the permalinks. We should mention that somewhere in the documenation, though. Otherwise we'll sooner or later get bug reports for that :)

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

@felixfontein

h/t @felixfontein Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska · 2017-05-15T07:45:52Z

I updated the documentation and used a different approach: going bottom→top in indexes, and top→bottom in posts/pages. This way, we can get a sensible ordering for both use cases.

ralsina · 2017-05-17T15:38:38Z

@Kwpolska I resolved the conflicts, hope I did not mess it, feel free to push merge

felixfontein

LGTM except... :)

felixfontein · 2017-05-17T17:16:21Z

nikola/filters.py

+                off = offending_elements[-2::-1]
+            for e in off:
+                new_id = i
+                while doc.xpath('//*[@id="{}"]'.format(new_id)):


Why not use (and later update) seen_ids instead of hoping that the query runs as fast as possible?

Makes sense, fixed.

felixfontein · 2017-05-17T17:18:06Z

nikola/filters.py

+                # Find headerlinks that we can fix.
+                headerlinks = e.find_class('headerlink')
+                for hl in headerlinks:
+                    # We might get headerlinks of child elements


If one of the header links belongs to a child element with the same ID, you change the link to something wrong.

How do you suggest to fix that? A break?

@felixfontein

h/t @felixfontein Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska · 2017-05-20T11:55:01Z

@felixfontein, please explain, and perhaps suggest a solution to:

If one of the header links belongs to a child element with the same ID, you change the link to something wrong.

felixfontein · 2017-05-20T12:33:45Z

Maybe something like:

<div id="the-content">
  [...]
  <h1 id="the-content">The Content</h1><a class="headerlink" href="#the-content">¶</a>
</div>

(And there's another the-content ID somewhere so that both these end up in off.) If the code processes them from top to bottom, it will change it to the following:

<div id="the-content-2">
  [...]
  <h1 id="the-content-3">The Content</h1><a class="headerlink" href="#the-content-2">¶</a>
</div>

because the <div> is processed first.

To avoid this, you probably have to check if there's another the-content ID between the headerlink element and the element whose ID we're currently processing.

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska · 2017-05-20T12:49:45Z

It sounds as if doing a break when we find the first headerlink will work, because those are supposed to be the first headerlink in a given header. (Using headerlinks[0] would work in 99.9% of cases, I believe). @felixfontein, please review again.

felixfontein · 2017-05-20T12:52:09Z

In the case I sketched above it won't work, because the <div> has no headerlink (because it is no header), so still the wrong headerlink will be "fixed". The problem here is that the header in the post happens end up with the same ID (because of it's title) than something generated by the theme (the <div id="the-content">).

Kwpolska · 2017-05-20T12:55:25Z

Please propose a better solution then? I’m inclined to say “edge case not worth fixing”.

felixfontein · 2017-05-20T13:24:44Z

Is there an efficient way to find the first parent object of hl whose ID is id? If yes, you could check whether it is e.

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska · 2017-05-20T16:27:30Z

I’m afraid this might not be efficient enough, and that would be needed only for an edge case. This happening in real input is very unlikely. (reST won’t ever generate it, for example)

felixfontein · 2017-05-20T20:17:11Z

Just manually walking up the DOM tree shouldn't be very slow. Usually DOM trees aren't excessively deep that you really get a performance hit. And since this search has only to be done for very few elements, it shouldn't really matter.

And yes, while this is an edge case, it is a bug and really shitty for a user if this happens. And if we ever fix this, it might break some existing links where someone might have worked around this bug.

In my opinion, we should get this filter into a form that should never require any changes, because every change might break existing links, which are supposed to be permalinks. But then, I probably won't be using it (in particular the headerlink feature), so I won't really mind if you decide to ignore this edge case :)

Edit: I just noticed that this only screws up the header links, but not the IDs themselves. So this won't destroy permalinks later on, it will just generate incorrect header links. So I'm really fine with leaving this as-is. On the other hand, walking up the DOM tree isn't that hard and inefficient either.

felixfontein · 2017-05-20T20:19:54Z

Ok, just thought of one reason where this can happen more realistically in real life: if someone decides to name a subsection the same as a section. Some people tend to do that. And if you don't use a compiler which prevents colliding IDs, you have a problem.

Edit: ah, I forgot about the break, that'll of course prevent this.

Kwpolska · 2017-05-21T09:56:24Z

Thanks, merged. Will rebuild getnikola.com to use this in a second.

Kwpolska added 3 commits May 14, 2017 18:21

Fix #2570 -- new deduplicate_ids filter

0c1dd74

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Deduplicate headers bottom-up

15ffedb

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Fix braces in docs

81c337c

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska added the enhancement label May 14, 2017

Kwpolska requested review from ralsina and felixfontein May 14, 2017 18:02

Kwpolska added this to the v7.8.6 milestone May 14, 2017

felixfontein requested changes May 14, 2017

View reviewed changes

Try new IDs until a free one is found

79a3914

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

ralsina approved these changes May 14, 2017

View reviewed changes

Kwpolska added 2 commits May 14, 2017 20:50

Testing slices with 3 elements isn’t always right

18c76b1

h/t @felixfontein Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Smarter ID rewrite ordering

91204c8

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Merge branch 'master' into header-deduplication

83354b7

felixfontein requested changes May 17, 2017

View reviewed changes

Use seen_ids instead of XPath to find existing IDs

7c68969

h/t @felixfontein Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska added 2 commits May 20, 2017 14:46

Merge branch 'master' into header-deduplication

e3eaca0

Update at most 1 headerlink

0439676

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Undo glitch in add_header_permalinks

5237119

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

felixfontein approved these changes May 20, 2017

View reviewed changes

Kwpolska merged commit 26af25c into master May 21, 2017

Kwpolska deleted the header-deduplication branch May 21, 2017 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Header/id deduplication #2763

Header/id deduplication #2763

Kwpolska commented May 14, 2017

felixfontein May 14, 2017

Kwpolska May 14, 2017

felixfontein May 14, 2017

felixfontein May 14, 2017

Kwpolska May 14, 2017

felixfontein May 14, 2017

Kwpolska May 14, 2017

felixfontein May 14, 2017

Kwpolska commented May 15, 2017

ralsina commented May 17, 2017

felixfontein left a comment

felixfontein May 17, 2017

Kwpolska May 17, 2017

felixfontein May 17, 2017

Kwpolska May 17, 2017

Kwpolska commented May 20, 2017

felixfontein commented May 20, 2017

Kwpolska commented May 20, 2017

felixfontein commented May 20, 2017

Kwpolska commented May 20, 2017

felixfontein commented May 20, 2017 •

edited

Kwpolska commented May 20, 2017

felixfontein commented May 20, 2017 •

edited

felixfontein commented May 20, 2017 •

edited

Kwpolska commented May 21, 2017

Header/id deduplication #2763

Header/id deduplication #2763

Conversation

Kwpolska commented May 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kwpolska commented May 15, 2017

ralsina commented May 17, 2017

felixfontein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kwpolska commented May 20, 2017

felixfontein commented May 20, 2017

Kwpolska commented May 20, 2017

felixfontein commented May 20, 2017

Kwpolska commented May 20, 2017

felixfontein commented May 20, 2017 • edited

Kwpolska commented May 20, 2017

felixfontein commented May 20, 2017 • edited

felixfontein commented May 20, 2017 • edited

Kwpolska commented May 21, 2017

felixfontein commented May 20, 2017 •

edited

felixfontein commented May 20, 2017 •

edited

felixfontein commented May 20, 2017 •

edited