Create directory-specific sitemaps for directories with 200+ files #2030

da2x · 2015-09-06T14:20:03Z

Folders with 200+ HTML files get their own sitemaps that are included in the main sitemapindex.xml. These are not included in the main sitemap.xml file as they already exists in the folder-sitemaps. As all URLs for a folder reside within the same folder, they’re valid sitemaps in terms of search engines. Works nicely for sites with many tags (TAG_PATH fills up quickly) and should work for sectioned sites as well. The number 200 was entirely arbitrarily chosen in an attempt to keep the sitemaps somewhat small without having to create one for every folder with just a few files.

The motivation here was trying to make sitemaps smaller so sitemap consumers don’t need to download the entire growing main sitemap.xml file every time I add one page. It eats a lot of bandwidth. Plus it mitigates #1683 somewhat.

New sitemapindex follows best practices and passes sitemap tests at Bing, Yandex, Google, and Baidu. Well, at least I think it passes at Baidu. Big green checkmark and Chinese text means it’s okay, right?

Mitigates issue #1683 by making the main index smaller. Also reduces bandwidth usage by making smaller sitemaps overall.

da2x · 2015-09-06T14:21:20Z

@ralsina @Kwpolska, line 301 of nikola/plugins/task/sitemap/init.py is probably a performance regression. Not sure how to generate the tasks before having scanned all of the directories?

ralsina · 2015-09-07T14:24:40Z

@Aeyoun you can make your task depend scan_locs_task like write_sitemap does:

"calc_dep": ["_scan_locs:sitemap"],

ralsina · 2015-09-08T20:13:04Z

If I understand correctly, here is what we need to do.

Run a bunch of tasks that generate files
Run scan_locs that creates a list of all files in a folder, including those generated by those tasks
From that list, find folders with > 200 files, and for each of those create a task that will generate
a local sitemap
Run those tasks, generate all the small local sitemaps
Done

I have looked at the code, and I don't know how to make this work correctly:

Just running scan_locs before generating the tasks is not good enough, because it runs before the other tasks run, so you will miss files.
Using scan_locs as a calc_dep is not good enough, because you won't know how many or what tasks to create.

So, I can see 2 ways to make this work:

A "task that generates more tasks", so that we could make taskA have a calc_dep on scan_locs, which returns a list of sitemaps to generate, then taskA takes that list and yields more tasks, one for each sitemap
Using doit's getargs in targets, and have one task that uses scan_locs data to decide what sitemaps we need, and then generates exactly those sitemaps.

@schettino72 any of those things possible?

felixfontein · 2015-09-08T20:55:29Z

With the earlytask_impl branch, this is a piece of cake. Just use a task with a high enough stage number, its tasks will only be created when all other tasks with lower stage numbers are already done.

schettino72 · 2015-09-09T00:36:28Z

The easiest solution is to handle this logic yourself inside the task.

On step 4 "Run those tasks, generate all the small local sitemaps".
Instead of trying to dynamically create tasks only for folders that has more than 200 files,
create tasks for all folders. In these tasks' action check the logic (200 files or more) to see
if a local sitemap should really be created or not.

Remember that inside your action you can manipulate the task metadata.
http://pydoit.org/tasks.html#keywords-on-actions passing task.
You might need to change targets at run time in the action
(and maybe more stuff to keep the folders with less than 200 files from running all the time).

And as @felixfontein said, early_task might be another way to solve this.

ralsina · 2015-09-09T00:58:00Z

Ahhh changing targets inside one task that generates all the sitemaps looks like the easy way out.

Yes, early tasks would fix this but that's a backwards-incompatible change for us so it has to wait for v8 :-(

da2x · 2015-09-09T09:47:14Z

Well, we actually don’t know what all the folders will be either.

Will investigate these new ideas. Thanks guys!

ralsina · 2015-09-09T10:42:25Z

@Aeyoun just have one task, calc_dep on scan_locs, decide there what folders need sitemaps, and have it set its own targets list accordingly.

da2x · 2015-09-09T11:07:48Z

Hm. updating targets after running actions doesn’t really work. nikola.gen_tasks() as used by check.py is unaware of metadata that is updated from within a task’s action. So check.py will break. Other than that, it seem to work to update the metadata as part of running the task. Though it also introduced some build instability as scan_locs() and tasks that depend on it is unaware of any sitemap files by design.

felixfontein · 2015-09-09T17:40:38Z

Changing the metadata later probably also break nikola run <name of target> for the later inserted targets.

ralsina · 2015-11-02T16:09:56Z

So, @Aeyoun is this stalled? Impossible?

da2x · 2015-11-02T17:31:58Z

I can’t figure out how to do it, at least. The task is always created too early to know what other files will exist to accurately say which sitemaps will be generated when. I’ve tried some variations from the original idea but they all run into the same problem at some time or another.

da2x · 2015-11-02T17:33:53Z

This would have to generate it’s tasks and run only after everything else have already finished.

ralsina · 2016-08-04T13:41:27Z

Ok, so basically this never worked, and it's for a minor corner-feature (if that's a thing). I say we close it.

Create directory-specific sitemaps for directories with 200+ files.

f0b94d3

Mitigates issue #1683 by making the main index smaller. Also reduces bandwidth usage by making smaller sitemaps overall.

flake8

f425710

Prevent running scan_locs() more than once

32e93e2

ralsina closed this Aug 4, 2016

Kwpolska deleted the smaller-sitemaps branch August 4, 2016 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create directory-specific sitemaps for directories with 200+ files #2030

Create directory-specific sitemaps for directories with 200+ files #2030

da2x commented Sep 6, 2015

da2x commented Sep 6, 2015

ralsina commented Sep 7, 2015

ralsina commented Sep 8, 2015

felixfontein commented Sep 8, 2015

schettino72 commented Sep 9, 2015

ralsina commented Sep 9, 2015

da2x commented Sep 9, 2015

ralsina commented Sep 9, 2015

da2x commented Sep 9, 2015

felixfontein commented Sep 9, 2015

ralsina commented Nov 2, 2015

da2x commented Nov 2, 2015

da2x commented Nov 2, 2015

ralsina commented Aug 4, 2016

Create directory-specific sitemaps for directories with 200+ files #2030

Create directory-specific sitemaps for directories with 200+ files #2030

Conversation

da2x commented Sep 6, 2015

da2x commented Sep 6, 2015

ralsina commented Sep 7, 2015

ralsina commented Sep 8, 2015

felixfontein commented Sep 8, 2015

schettino72 commented Sep 9, 2015

ralsina commented Sep 9, 2015

da2x commented Sep 9, 2015

ralsina commented Sep 9, 2015

da2x commented Sep 9, 2015

felixfontein commented Sep 9, 2015

ralsina commented Nov 2, 2015

da2x commented Nov 2, 2015

da2x commented Nov 2, 2015

ralsina commented Aug 4, 2016