Enhancements for generating sitemaps #2090

msnoigrs · 2015-09-16T02:46:07Z

This PR contains three enhancements of generating sitemaps.

Avoiding making the same robotparser over and over again while generating sitemaps. It lead to performance improvement.
ROBOTS_EXCLUSIONS can set non-ascii strings without crashing in Python2.
If robots.txt already exists, read that file and prefer it to ROBOTS_EXCLUSIONS. This is the same as robots task.

da2x · 2015-09-16T12:09:45Z

Need to output a warning when using copied resource robots.txt and 1) it doesn’t contain "Sitemap: /sitemapindex.xml" or 2) doesn’t contain every entry in ROBOTS_EXCLUSIONS.

msnoigrs · 2015-09-16T12:45:06Z

A warning is already outputed in nikola/plugins/task/robots.py.

It seems that generating sitemaps have nothing to do with "Sitemap:" lines. It's same as the current code.
no. It surely contain every entry.

    rules = []
    for rule in exclusions:
        rules.append('Disallow: {0}'.format(rule))

msnoigrs · 2015-09-16T12:54:25Z

It works for me. on both python2 and python3.

If the robots.txt file already exist, read it and ignore ROBOTS_EXCLUSIONS.

msnoigrs · 2015-09-20T06:38:32Z

I wrote the simple code.
Once The robotparser is created, it can parse only once. The parse method can accept multiple lines at once as follows.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.robotparser

def main():
    r = urllib.robotparser.RobotFileParser()
    r.parse(["User-Agent: *", "Disallow: /a.html", "Disallow: /b.html"])
    print(r.can_fetch("*", '/a.html')) # False
    print(r.can_fetch("*", '/b.html')) # False
    print(r.can_fetch("*", '/c.html')) # True

    r = urllib.robotparser.RobotFileParser()
    r.parse(["User-Agent: *", "Disallow: /a.html"])
    print(r.can_fetch("*", '/a.html')) # False
    print(r.can_fetch("*", '/b.html')) # True
    print(r.can_fetch("*", '/c.html')) # True
    r.parse(["User-Agent: *", "Disallow: /b.html"]) # Ignored
    print(r.can_fetch("*", '/a.html')) # False
    print(r.can_fetch("*", '/b.html')) # True
    print(r.can_fetch("*", '/c.html')) # True

if __name__ == '__main__':

    args = main()

ralsina · 2017-05-14T23:16:47Z

I am not familiar with this code, but this PR looks ok for me ad it has sat here for way too long... any opinions @Kwpolska ?

msnoigrs force-pushed the robots_exclusions branch from 3db2cce to 1dd0178 Compare September 16, 2015 03:29

msnoigrs force-pushed the robots_exclusions branch from 1dd0178 to b91024f Compare September 17, 2015 12:39

msnoigrs added 2 commits September 18, 2015 22:12

Refactor checking robots_exclusions

fd68d6b

Read the robots.txt while creating sitemaps

cac59c7

If the robots.txt file already exist, read it and ignore ROBOTS_EXCLUSIONS.

msnoigrs force-pushed the robots_exclusions branch from b91024f to cac59c7 Compare September 18, 2015 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancements for generating sitemaps #2090

Enhancements for generating sitemaps #2090

msnoigrs commented Sep 16, 2015

da2x commented Sep 16, 2015

msnoigrs commented Sep 16, 2015

msnoigrs commented Sep 16, 2015

msnoigrs commented Sep 20, 2015

ralsina commented May 14, 2017

Enhancements for generating sitemaps #2090

Are you sure you want to change the base?

Enhancements for generating sitemaps #2090

Conversation

msnoigrs commented Sep 16, 2015

da2x commented Sep 16, 2015

msnoigrs commented Sep 16, 2015

msnoigrs commented Sep 16, 2015

msnoigrs commented Sep 20, 2015

ralsina commented May 14, 2017