Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor Taskcluster PR and master runs #14210

Closed
foolip opened this issue Nov 23, 2018 · 18 comments
Closed

Monitor Taskcluster PR and master runs #14210

foolip opened this issue Nov 23, 2018 · 18 comments

Comments

@foolip
Copy link
Member

foolip commented Nov 23, 2018

Our Taskcluster setup is amazing, but not so reliable that we never have to worry about it, and also it's possible to break it with our own code changes. Issues so far:

The earliest ones were just part of getting the setup stable, but we will likely continue to have issues.

We should monitor at least:

For humans to notice and act we need to either send email or have status who up in something like https://foolip.github.io/ecosystem-infra-rotation/ (in use) or https://foolip.github.io/wpt-rotation/ (failed experiment).

@lukebjerring given the wpt.fyi connection, do you have thoughts on how we should do this?

Also cc @web-platform-tests/admins for ideas.

(Verify correctness of Taskcluster PR checks is about verifying that the stability jobs on Taskcluster work as well as they did on Travis.)

@foolip
Copy link
Member Author

foolip commented Nov 23, 2018

In #13818 I've done some work to monitor Azure Pipelines in an ad-hoc manner. Turning that into something real and sharing it across CI systems might be one path.

Relevant technical detail: Taskcluster uses the Statuses API and Azure Pipelines the Checks API.

@lukebjerring
Copy link
Contributor

Eventually, wpt.fyi is intending to respond to Taskcluster having kicked off and become pending until it hears further. If we set up some kind of timeout, this will cover the last item in your monitoring list.

@foolip
Copy link
Member Author

foolip commented Nov 23, 2018

Sounds good. If we also have a custom status check that's triggered for the same things as any of our runs (including master) perhaps that could implicitly monitor the other CIs in a way that avoids building a separate monitoring dashboard for this? Kind of a stretch?

@foolip
Copy link
Member Author

foolip commented Nov 27, 2018

In #13274 (comment) I reported a failure (https://tools.taskcluster.net/groups/ZklAzb_fTueVrBAWR0kGBA) that should have come up in monitoring. Had that been the daily commit (and we were no longer running Buildbot) we would have no aligned runs for that day.

@lukebjerring, is that a case that would be caught by the time of wpt.fyi monitoring you're considering?

Also explicitly poking @jgraham for ideas.

@foolip
Copy link
Member Author

foolip commented Dec 13, 2018

I've written a script that lists non-success statuses in the past week.

NodeJS script
'use strict';

const octokit = require('@octokit/rest')();

// https://github.com/octokit/rest.js/#pagination
function paginate(method, parameters) {
  const options = method.endpoint.merge(parameters);
  return octokit.paginate(options);
}

async function main() {
  octokit.authenticate({
    type: 'token',
    token: process.env.GH_TOKEN
  });

  const WEEK_MS = 7*24*3600*1000;
  const weekAgo = new Date(Date.now() - WEEK_MS).toISOString();

  const commits = await paginate(octokit.repos.listCommits, {
    owner: 'web-platform-tests',
    repo: 'wpt',
    since: weekAgo,
    per_page: 100
  });

  console.log(`Found ${commits.length} commits since ${weekAgo}`);

  for (const commit of commits) {
    // no checks on master yet
    /*
    const checks = await paginate(octokit.checks.listForRef, {
      owner: 'web-platform-tests',
      repo: 'wpt',
      ref: commit.sha,
    });
    */

    const statuses = await paginate(octokit.repos.listStatusesForRef, {
      owner: 'web-platform-tests',
      repo: 'wpt',
      ref: commit.sha,
    });

    // Statuses are in reverse chronological order, so only consider the first
    // (latest).
    const status = statuses.find(s => s.context === 'Taskcluster (push)');
    if (!status) {
      continue;
    }

    if (status.state !== 'success') {
      console.log(`${status.state}: ${status.target_url} (for ${commit.sha})`);
    }
  }
}

main();

The current output is:

failure: https://tools.taskcluster.net/task-group-inspector/#/ZNAaG-moRCG_2CPh3zNmGw (for 69d00d5f81ac1b29673878368055d24390819cc8)
failure: https://tools.taskcluster.net/task-group-inspector/#/cBwRwlUfQy-lpIr4lRMN_g (for e4a221e6c4aead59a34a234a9a3b1d7d161b98af)
failure: https://tools.taskcluster.net/task-group-inspector/#/UulPfqZwR7aaYmX4Bv8PHw (for ed68c260a90dae67b3a0781875e9e82473af95f8)
failure: https://tools.taskcluster.net/task-group-inspector/#/L-umDIXXQNWiWGAOpEt0mg (for 008bfc93426ab6973a843cc201f347b90ae3028f)

3 of them have a single failed task that might pass on rerun. https://tools.taskcluster.net/groups/cBwRwlUfQy-lpIr4lRMN_g failed all the firefox tasks, but also in a way that might pass on rerun.

@Hexcles
Copy link
Member

Hexcles commented Dec 13, 2018

Here's one weird example of flaky tests reported by Taskcluster:

#14469
https://tools.taskcluster.net/groups/V7chQRdcQ6iqnYZI-Q0MLw/tasks/YtWQy41TT6SraDjsPmbFtg/runs/0/logs/public%2Flogs%2Flive.log#L261

I retried the test a couple times by force pushing to the branch, and I observe the same result: the first run of the second affected test always fails. I can't reproduce the flakiness locally.

@jgraham
Copy link
Contributor

jgraham commented Dec 13, 2018

Did you run it locally in the taskcluster docker image, using broadly the same command in in automation?

@foolip
Copy link
Member Author

foolip commented Dec 17, 2018

Filed https://github.com/octokit/rest.js/issues/1161 for an issue I've run into with the monitoring.

@foolip
Copy link
Member Author

foolip commented Dec 20, 2018

I've found two types of errors during clone of the repo now:
#14617
#14618

Monitoring should be able to distinguish infra failures from real failures in stability jobs.

@foolip
Copy link
Member Author

foolip commented Jan 25, 2019

In #15040 Taskcluster didn't start, which I haven't seen since #14165. That's another failure mode that monitoring should flag.

@foolip
Copy link
Member Author

foolip commented Mar 15, 2019

In #15820 Taskcluster again didn't start, so this is an error condition we really should monitor for.

@foolip
Copy link
Member Author

foolip commented Apr 25, 2019

I refreshed my script a bit today to look for PRs with to Taskcluster status.
'use strict';

const octokit = require('@octokit/rest')({
  auth: `token ${process.env.GH_TOKEN}`,
});

// https://github.com/octokit/rest.js/#pagination
function paginate(method, parameters) {
  const options = method.endpoint.merge(parameters);
  return octokit.paginate(options);
}

function getChecksForRef(ref) {
  return paginate(octokit.checks.listForRef, {
    owner: 'web-platform-tests',
    repo: 'wpt',
    ref,
    per_page: 100,
  });
}

async function getStatusesForRef(ref) {
  const statuses = await paginate(octokit.repos.listStatusesForRef, {
    owner: 'web-platform-tests',
    repo: 'wpt',
    ref,
    per_page: 100,
  });

  // Statuses are in reverse chronological order, so filter out all but the
  // first for each unique context string.
  const seenContexts = new Set;
  return statuses.filter(status => {
    const context = status.context;
    if (seenContexts.has(context)) {
      return false;
    }
    seenContexts.add(context);
    return true;
  });
}

function isRecentlyPendingCheck(check, maxAge = 6*3600) {
  if (check.conclusion === null && check.updated_at) {
    const updated = Date.parse(check.updated_at);
    const age = (Date.now() - updated) / 1000;
    if (age < maxAge) {
      return true;
    }
  }
  return false;
}

function isRecentlyPendingStatus(status, maxAge = 2*3600) {
  if (status.state === 'pending' && status.updated_at) {
    const updated = Date.parse(status.updated_at);
    const age = (Date.now() - updated) / 1000;
    if (age < maxAge) {
      return true;
    }
  }
  return false;
}

async function checkMaster(since) {
  const commits = await paginate(octokit.repos.listCommits, {
    owner: 'web-platform-tests',
    repo: 'wpt',
    since,
    per_page: 100
  });

  console.log(`Found ${commits.length} commits since ${since}`);

  for (const commit of commits) {
    const checks = await getChecksForRef(commit.sha);
    for (const check of checks) {
      if (check.conclusion === 'success' || check.conclusion === 'neutral') {
        continue;
      }

      if (isRecentlyPendingCheck(check)) {
        continue;
      }

      console.log(`${check.conclusion}: ${check.details_url} (for ${commit.sha})`);
    }

    const statuses = await getStatusesForRef(commit.sha);
    const status = statuses.find(s => s.context === 'Taskcluster (push)');
    if (!status) {
      continue;
    }

    if (isRecentlyPendingStatus(status)) {
      continue;
    }

    if (status.state !== 'success') {
      console.log(`${status.state}: ${status.target_url} (for ${commit.sha})`);
    }
  }
}

async function checkPRs(since) {
  const prs = await paginate(octokit.search.issuesAndPullRequests, {
    q: `repo:web-platform-tests/wpt is:pr is:open updated:>${since}`,
    per_page: 100,
  });

  console.log(`Found ${prs.length} PRs since ${since}`);

  for (const pr of prs) {
    const commits = await paginate(octokit.pulls.listCommits, {
      owner: 'web-platform-tests',
      repo: 'wpt',
      pull_number: pr.number,
      per_page: 100,
    });

    if (commits.length >= 100) {
      console.warn(`Ignoring PR #${pr.number} because it has too many (>=100) commits`);
      continue;
    }

    // only look at the final commit
    const commit = commits[commits.length - 1];

    const statuses = await getStatusesForRef(commit.sha);
    const status = statuses.find(s => s.context === 'Taskcluster (pull_request)');
    if (!status) {
      console.log(`No Taskcluster status for ${pr.html_url}`);
      continue;
    }

    if (isRecentlyPendingStatus(status)) {
      continue;
    }

    if (status.state !== 'success' && status.state !== 'failure') {
      console.log(`${status.state}: ${status.target_url} (for ${pr.html_url})`);
    }
  }
}

async function main() {
  const WEEK_MS = 7*24*3600*1000;
  const weekAgo = new Date(Date.now() - WEEK_MS).toISOString();
  // Get rid of milliseconds, GitHub doesn't support it.
  const since = weekAgo.replace(/\.[0-9]+Z/, 'Z');

  //await checkMaster(since);
  await checkPRs(since);
}

main();

Result is:

Found 109 PRs since 2019-04-18T12:55:01Z
No Taskcluster status for https://github.com/web-platform-tests/wpt/pull/12805
No Taskcluster status for https://github.com/web-platform-tests/wpt/pull/12803
No Taskcluster status for https://github.com/web-platform-tests/wpt/pull/10420

The first two are IDL sync PRs that don't matter, and I've commented on #10420 to explain what's wrong.

So it's good news, it doesn't look like Taskcluster is missing PRs at a high rate at least. I haven't looked beyond a week back.

@foolip
Copy link
Member Author

foolip commented Apr 25, 2019

@Hexcles do you think any of this monitoring could be incorporated with your work on results receiver validation? There's a big difference between a CI system failing to start at all and the results being bad, but we're interested in the overall system working to deliver results, so it'd be great to not have to monitor the different parts separately. I don't know how practical that is though.

@foolip
Copy link
Member Author

foolip commented Sep 23, 2019

@jgraham what do you think the right layer for finding out what Taskcluster runs should have been triggered is?

If the problem described in #19207 is very rare then we could look at all tagged commits and all PRs, but that wouldn't catch when Taskcluster is failing to even start. Would one have to monitor webhooks to work out what was requested?

@Hexcles this is similar to the question of how we'd record pending runs for wpt.fyi before they're finished, do you have thoughts?

@foolip foolip assigned stephenmcgruer and unassigned foolip Oct 25, 2019
@stephenmcgruer
Copy link
Contributor

This falls under @LukeZielinski 's monitoring/metrics efforts, but I'm not sure what the priority is for TaskCluster runs. Luke - feel free to downgrade to backlog.

@stephenmcgruer
Copy link
Contributor

We started to look at this in Q2, and did land some monitoring setup in https://github.com/ecosystem-Infra/wptmon that monitored Taskcluster runs. However:

  1. It never progressed past an experiment,
  2. it hasn't been updated for the switch to the Checks API, and
  3. we aren't scheduling anymore time on this currently.

As such, dropping to backlog and removing Luke.

@foolip
Copy link
Member Author

foolip commented Apr 30, 2021

Back in 2019 I wrote some scripts in https://github.com/foolip/wpt-stats which I seemingly never mentioned here. They still exist and do some checking of both master and PR runs, but don't really summarize it into anything useful.

@foolip
Copy link
Member Author

foolip commented May 6, 2021

I'll go ahead and close this, it's not concrete enough to act on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants