Skip to content

Commit

Permalink
Don’t recheck the same remote address more than once
Browse files Browse the repository at this point in the history
Fixe #1732
  • Loading branch information
da2x committed May 19, 2015
1 parent 7553b5a commit cb316c3
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGES.txt
Expand Up @@ -10,6 +10,7 @@ Features
Bugfixes
--------

* Don’t check the same remote URL more than once (Issue #1732)
* All remotely checked links resulted in incorrect warnings (`nikola check -lr`)
* Exclude `<meta content="noindex" name="robots">` from sitemaps
* new_post paths are now relative to CWD (Issue #1325)
Expand Down
8 changes: 7 additions & 1 deletion nikola/plugins/command/check.py
Expand Up @@ -173,6 +173,7 @@ def _execute(self, options, args):
sys.exit(1)

existing_targets = set([])
checked_remote_targets = []

This comment has been minimized.

Copy link
@felixfontein

felixfontein May 20, 2015

Contributor

Why use a list and not a set? If there's a huge amount of distinct links, this is unnecessary slow.

This comment has been minimized.

Copy link
@da2x

This comment has been minimized.

Copy link
@felixfontein

felixfontein May 20, 2015

Contributor

Ah, thanks :)


def analyze(self, task, find_sources=False, check_remote=False):
rv = False
Expand Down Expand Up @@ -215,11 +216,16 @@ def analyze(self, task, find_sources=False, check_remote=False):
((parsed.scheme or target.startswith('//')) and url_type in ('rel_path', 'full_path')):
if not check_remote or parsed.scheme not in ["http", "https"]:
continue
if parsed.netloc == base_url.netloc:
if parsed.netloc == base_url.netloc: # absolute URL to self.site
continue
if target in self.checked_remote_targets: # already checked this exact target
continue
self.checked_remote_targets.append(target)

# Check the remote link works
req_headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0 (Nikola)'} # I’m a real boy!

This comment has been minimized.

Copy link
@Kwpolska

Kwpolska May 19, 2015

Member

Why not make this user-agent the most nonsense user-agent string in the known universe by including something from every browser?

This comment has been minimized.

Copy link
@da2x

da2x May 19, 2015

Author Contributor

Future Firefox version number isn’t nonsense enough for you? It’s included because a bunch of sites I’ve linked to returned nonsense erorrs or dropped the connection entirely when encountering requests’ user-agent. Opted for something honest (Nikola component) but browser-looking rather than robot-like. We’re only scraping status codes (no actual data) so this slightly dishonest behavior is okay. We don’t try to do anything bad.

resp = requests.head(target, headers=req_headers)

if resp.status_code > 399: # Error
self.logger.warn("Broken link in {0}: {1} [Error {2}]".format(filename, target, resp.status_code))
continue
Expand Down

0 comments on commit cb316c3

Please sign in to comment.