Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements for generating sitemaps #2090

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

msnoigrs
Copy link
Contributor

This PR contains three enhancements of generating sitemaps.

  1. Avoiding making the same robotparser over and over again while generating sitemaps. It lead to performance improvement.
  2. ROBOTS_EXCLUSIONS can set non-ascii strings without crashing in Python2.
  3. If robots.txt already exists, read that file and prefer it to ROBOTS_EXCLUSIONS. This is the same as robots task.

@da2x
Copy link
Contributor

da2x commented Sep 16, 2015

Need to output a warning when using copied resource robots.txt and 1) it doesn’t contain "Sitemap: /sitemapindex.xml" or 2) doesn’t contain every entry in ROBOTS_EXCLUSIONS.

@msnoigrs
Copy link
Contributor Author

A warning is already outputed in nikola/plugins/task/robots.py.

  1. It seems that generating sitemaps have nothing to do with "Sitemap:" lines. It's same as the current code.
  2. no. It surely contain every entry.
    rules = []
    for rule in exclusions:
        rules.append('Disallow: {0}'.format(rule))

@msnoigrs
Copy link
Contributor Author

It works for me. on both python2 and python3.

If the robots.txt file already exist, read it and ignore ROBOTS_EXCLUSIONS.
@msnoigrs
Copy link
Contributor Author

I wrote the simple code.
Once The robotparser is created, it can parse only once. The parse method can accept multiple lines at once as follows.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.robotparser

def main():
    r = urllib.robotparser.RobotFileParser()
    r.parse(["User-Agent: *", "Disallow: /a.html", "Disallow: /b.html"])
    print(r.can_fetch("*", '/a.html')) # False
    print(r.can_fetch("*", '/b.html')) # False
    print(r.can_fetch("*", '/c.html')) # True

    r = urllib.robotparser.RobotFileParser()
    r.parse(["User-Agent: *", "Disallow: /a.html"])
    print(r.can_fetch("*", '/a.html')) # False
    print(r.can_fetch("*", '/b.html')) # True
    print(r.can_fetch("*", '/c.html')) # True
    r.parse(["User-Agent: *", "Disallow: /b.html"]) # Ignored
    print(r.can_fetch("*", '/a.html')) # False
    print(r.can_fetch("*", '/b.html')) # True
    print(r.can_fetch("*", '/c.html')) # True

if __name__ == '__main__':

    args = main()

@ralsina
Copy link
Member

ralsina commented May 14, 2017

I am not familiar with this code, but this PR looks ok for me ad it has sat here for way too long... any opinions @Kwpolska ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants