Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create directory-specific sitemaps for directories with 200+ files #2030

Closed
wants to merge 3 commits into from

Conversation

da2x
Copy link
Contributor

@da2x da2x commented Sep 6, 2015

Folders with 200+ HTML files get their own sitemaps that are included in the main sitemapindex.xml. These are not included in the main sitemap.xml file as they already exists in the folder-sitemaps. As all URLs for a folder reside within the same folder, they’re valid sitemaps in terms of search engines. Works nicely for sites with many tags (TAG_PATH fills up quickly) and should work for sectioned sites as well. The number 200 was entirely arbitrarily chosen in an attempt to keep the sitemaps somewhat small without having to create one for every folder with just a few files.

The motivation here was trying to make sitemaps smaller so sitemap consumers don’t need to download the entire growing main sitemap.xml file every time I add one page. It eats a lot of bandwidth. Plus it mitigates #1683 somewhat.

New sitemapindex follows best practices and passes sitemap tests at Bing, Yandex, Google, and Baidu. Well, at least I think it passes at Baidu. Big green checkmark and Chinese text means it’s okay, right?

Mitigates issue #1683 by making the main index smaller. Also
reduces bandwidth usage by making smaller sitemaps overall.
@da2x
Copy link
Contributor Author

da2x commented Sep 6, 2015

@ralsina @Kwpolska, line 301 of nikola/plugins/task/sitemap/init.py is probably a performance regression. Not sure how to generate the tasks before having scanned all of the directories?

@ralsina
Copy link
Member

ralsina commented Sep 7, 2015

@Aeyoun you can make your task depend scan_locs_task like write_sitemap does:

"calc_dep": ["_scan_locs:sitemap"],

@ralsina
Copy link
Member

ralsina commented Sep 8, 2015

If I understand correctly, here is what we need to do.

  1. Run a bunch of tasks that generate files
  2. Run scan_locs that creates a list of all files in a folder, including those generated by those tasks
  3. From that list, find folders with > 200 files, and for each of those create a task that will generate
    a local sitemap
  4. Run those tasks, generate all the small local sitemaps
  5. Done

I have looked at the code, and I don't know how to make this work correctly:

  1. Just running scan_locs before generating the tasks is not good enough, because it runs before the other tasks run, so you will miss files.

  2. Using scan_locs as a calc_dep is not good enough, because you won't know how many or what tasks to create.

So, I can see 2 ways to make this work:

  1. A "task that generates more tasks", so that we could make taskA have a calc_dep on scan_locs, which returns a list of sitemaps to generate, then taskA takes that list and yields more tasks, one for each sitemap

  2. Using doit's getargs in targets, and have one task that uses scan_locs data to decide what sitemaps we need, and then generates exactly those sitemaps.

@schettino72 any of those things possible?

@felixfontein
Copy link
Contributor

With the earlytask_impl branch, this is a piece of cake. Just use a task with a high enough stage number, its tasks will only be created when all other tasks with lower stage numbers are already done.

@schettino72
Copy link
Member

The easiest solution is to handle this logic yourself inside the task.

On step 4 "Run those tasks, generate all the small local sitemaps".
Instead of trying to dynamically create tasks only for folders that has more than 200 files,
create tasks for all folders. In these tasks' action check the logic (200 files or more) to see
if a local sitemap should really be created or not.

Remember that inside your action you can manipulate the task metadata.
http://pydoit.org/tasks.html#keywords-on-actions passing task.
You might need to change targets at run time in the action
(and maybe more stuff to keep the folders with less than 200 files from running all the time).

And as @felixfontein said, early_task might be another way to solve this.

@ralsina
Copy link
Member

ralsina commented Sep 9, 2015

Ahhh changing targets inside one task that generates all the sitemaps looks like the easy way out.

Yes, early tasks would fix this but that's a backwards-incompatible change for us so it has to wait for v8 :-(

@da2x
Copy link
Contributor Author

da2x commented Sep 9, 2015

Well, we actually don’t know what all the folders will be either.

Will investigate these new ideas. Thanks guys!

@ralsina
Copy link
Member

ralsina commented Sep 9, 2015

@Aeyoun just have one task, calc_dep on scan_locs, decide there what folders need sitemaps, and have it set its own targets list accordingly.

@da2x
Copy link
Contributor Author

da2x commented Sep 9, 2015

Hm. updating targets after running actions doesn’t really work. nikola.gen_tasks() as used by check.py is unaware of metadata that is updated from within a task’s action. So check.py will break. Other than that, it seem to work to update the metadata as part of running the task. Though it also introduced some build instability as scan_locs() and tasks that depend on it is unaware of any sitemap files by design.

@felixfontein
Copy link
Contributor

Changing the metadata later probably also break nikola run <name of target> for the later inserted targets.

@ralsina
Copy link
Member

ralsina commented Nov 2, 2015

So, @Aeyoun is this stalled? Impossible?

@da2x
Copy link
Contributor Author

da2x commented Nov 2, 2015

I can’t figure out how to do it, at least. The task is always created too early to know what other files will exist to accurately say which sitemaps will be generated when. I’ve tried some variations from the original idea but they all run into the same problem at some time or another.

@da2x
Copy link
Contributor Author

da2x commented Nov 2, 2015

This would have to generate it’s tasks and run only after everything else have already finished.

@ralsina
Copy link
Member

ralsina commented Aug 4, 2016

Ok, so basically this never worked, and it's for a minor corner-feature (if that's a thing). I say we close it.

@ralsina ralsina closed this Aug 4, 2016
@Kwpolska Kwpolska deleted the smaller-sitemaps branch August 4, 2016 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants