Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header/id deduplication #2763

Merged
merged 11 commits into from May 21, 2017
Merged

Header/id deduplication #2763

merged 11 commits into from May 21, 2017

Conversation

Kwpolska
Copy link
Member

Fixes #2570. Also contains a doc fix for #2743, because I didn’t notice which branch I was committing to.

Should I bother with <a name=""> as well?

cc @wichtounet

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@Kwpolska Kwpolska added this to the v7.8.6 milestone May 14, 2017
# Results are ordered the same way they are ordered in document
offending_elements = doc.xpath('//*[@id="{}"]'.format(i))
counter = 2
for e in offending_elements[1::-1]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be offending_elements[::-1] without the first 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice counter = 2? There will be foo, foo-2, foo-3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this only enumerates elements 1 and 0 of the list, not any other element. ['a', 'b', 'c', 'd', 'e'][1::-1] == ['b', 'a'].

You probably want offending_elements[-2::-1].

offending_elements = doc.xpath('//*[@id="{}"]'.format(i))
counter = 2
for e in offending_elements[1::-1]:
new_id = '{0}-{1}'.format(i, counter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this is ID is in use as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But trying until you find a free one will be a problem if you want to make permalinks (as for the Sphinx permalinks filter). Assume the two oldest entries have headers with IDs a and a. If there's nothing else around, one will get a and the other a-2. But now assume a new post is added which uses a-2 (for whatever reason). Then the old a-2 will end up as a-3, because a-2 is already taken, and links pointing to a in the second oldest post are suddenly pointing to the wrong place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of those cases which we can’t reliably fix, at least as a filter. To make it work in every scenario, we’d need to add post names to the IDs, or otherwise do unusual stuff. The thing is, most people are unlikely to use those permalinks on indexes, and this plugin’s aims are (a) to fix HTML validation issues on indexes, (b) to fix IDs for Sphinx permalinks and other uses clashing, on a single page. You can’t protect against changing permalinks if those are maintained by code, not humans, whilst allowing said humans to edit the page contents (an edit to a post/page could trigger a deduplication)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. You'd need state to track the permalinks. We should mention that somewhere in the documenation, though. Otherwise we'll sooner or later get bug reports for that :)

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
h/t @felixfontein

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@Kwpolska
Copy link
Member Author

I updated the documentation and used a different approach: going bottom→top in indexes, and top→bottom in posts/pages. This way, we can get a sensible ordering for both use cases.

@ralsina
Copy link
Member

ralsina commented May 17, 2017

@Kwpolska I resolved the conflicts, hope I did not mess it, feel free to push merge

Copy link
Contributor

@felixfontein felixfontein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except... :)

off = offending_elements[-2::-1]
for e in off:
new_id = i
while doc.xpath('//*[@id="{}"]'.format(new_id)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use (and later update) seen_ids instead of hoping that the query runs as fast as possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, fixed.

# Find headerlinks that we can fix.
headerlinks = e.find_class('headerlink')
for hl in headerlinks:
# We might get headerlinks of child elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of the header links belongs to a child element with the same ID, you change the link to something wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you suggest to fix that? A break?

h/t @felixfontein

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@Kwpolska
Copy link
Member Author

@felixfontein, please explain, and perhaps suggest a solution to:

If one of the header links belongs to a child element with the same ID, you change the link to something wrong.

@felixfontein
Copy link
Contributor

Maybe something like:

<div id="the-content">
  [...]
  <h1 id="the-content">The Content</h1><a class="headerlink" href="#the-content">¶</a>
</div>

(And there's another the-content ID somewhere so that both these end up in off.) If the code processes them from top to bottom, it will change it to the following:

<div id="the-content-2">
  [...]
  <h1 id="the-content-3">The Content</h1><a class="headerlink" href="#the-content-2">¶</a>
</div>

because the <div> is processed first.

To avoid this, you probably have to check if there's another the-content ID between the headerlink element and the element whose ID we're currently processing.

@Kwpolska
Copy link
Member Author

It sounds as if doing a break when we find the first headerlink will work, because those are supposed to be the first headerlink in a given header. (Using headerlinks[0] would work in 99.9% of cases, I believe). @felixfontein, please review again.

@felixfontein
Copy link
Contributor

In the case I sketched above it won't work, because the <div> has no headerlink (because it is no header), so still the wrong headerlink will be "fixed". The problem here is that the header in the post happens end up with the same ID (because of it's title) than something generated by the theme (the <div id="the-content">).

@Kwpolska
Copy link
Member Author

Please propose a better solution then? I’m inclined to say “edge case not worth fixing”.

@felixfontein
Copy link
Contributor

felixfontein commented May 20, 2017

Is there an efficient way to find the first parent object of hl whose ID is id? If yes, you could check whether it is e.

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@Kwpolska
Copy link
Member Author

I’m afraid this might not be efficient enough, and that would be needed only for an edge case. This happening in real input is very unlikely. (reST won’t ever generate it, for example)

@felixfontein
Copy link
Contributor

felixfontein commented May 20, 2017

Just manually walking up the DOM tree shouldn't be very slow. Usually DOM trees aren't excessively deep that you really get a performance hit. And since this search has only to be done for very few elements, it shouldn't really matter.

And yes, while this is an edge case, it is a bug and really shitty for a user if this happens. And if we ever fix this, it might break some existing links where someone might have worked around this bug.

In my opinion, we should get this filter into a form that should never require any changes, because every change might break existing links, which are supposed to be permalinks. But then, I probably won't be using it (in particular the headerlink feature), so I won't really mind if you decide to ignore this edge case :)

Edit: I just noticed that this only screws up the header links, but not the IDs themselves. So this won't destroy permalinks later on, it will just generate incorrect header links. So I'm really fine with leaving this as-is. On the other hand, walking up the DOM tree isn't that hard and inefficient either.

@felixfontein
Copy link
Contributor

felixfontein commented May 20, 2017

Ok, just thought of one reason where this can happen more realistically in real life: if someone decides to name a subsection the same as a section. Some people tend to do that. And if you don't use a compiler which prevents colliding IDs, you have a problem.

Edit: ah, I forgot about the break, that'll of course prevent this.

@Kwpolska Kwpolska merged commit 26af25c into master May 21, 2017
@Kwpolska Kwpolska deleted the header-deduplication branch May 21, 2017 09:56
@Kwpolska
Copy link
Member Author

Thanks, merged. Will rebuild getnikola.com to use this in a second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to avoid duplicated ids and h1 on the blog index ?
3 participants