Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chrome crashes on Community-TC #20133

Closed
Hexcles opened this issue Nov 6, 2019 · 5 comments · Fixed by #20152
Closed

Chrome crashes on Community-TC #20133

Hexcles opened this issue Nov 6, 2019 · 5 comments · Fixed by #20152

Comments

@Hexcles
Copy link
Member

Hexcles commented Nov 6, 2019

We started Taskcluster migration (https://bugzilla.mozilla.org/show_bug.cgi?id=1574668) and now we are seeing widespread Chrome crashes on the new Community-TC instance.

Example: https://community-tc.services.mozilla.com/tasks/aMcaEBP4RIKa_EbrE30xJw/runs/0/logs/https%3A%2F%2Fcommunity-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FaMcaEBP4RIKa_EbrE30xJw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log#L550

This PR has both Taskcluster and Community-TC jobs. Note that the crash only happens on Community-TC. After checking a few other jobs, it seems the crash is widespread on Community-TC.

Hexcles added a commit that referenced this issue Nov 6, 2019
@Hexcles Hexcles self-assigned this Nov 6, 2019
@Hexcles
Copy link
Member Author

Hexcles commented Nov 6, 2019

cc @jgraham

@Hexcles
Copy link
Member Author

Hexcles commented Nov 6, 2019

We are certain that Chrome is failing to start in Docker:

Failed to move to new namespace: PID namespaces supported, Network namespace supported, but failed: errno = Operation not permitted

which is a rather common problem, c.f. Puppeteer troubleshooting doc.

We have a few workarounds on the Docker side as seen here:

Run the container with: --privileged
Run with: --cap-add SYS_ADMIN
Run with: --security-opt seccomp:unconfined
Run with: --security-opt seccomp:chrome.json

Our understanding is that they have increasing security levels. The first one is easy to do on Community-TC, but particularly bad security-wise, because it gives Docker containers RW access to /sys and /dev on the host, allowing a container to pollute the host easily (e.g. using sysctl or dd to /dev). The hosts are NOT reset between tests on Taskcluster. A good long-term solution is to use a custom seccomp profile, which may require building our own worker image as it is not configurable otherwise, and currently there is no self served way to build a worker image.

Alternatively, we could disable Chrome sandboxing. This is in fact better security-wise: we have password-less sudo inside Docker, so anyone can get root in a PR already. By disabling Chrome sandboxing, we are not loosening the container isolation at all, so we are not opening up new vulnerabilities. However, it is unclear whether it'd affect web-observable behaviours (I'm checking with the Chrome team). If not, I think this might be the best tradeoff.

@Hexcles
Copy link
Member Author

Hexcles commented Nov 6, 2019

None of the aforementioned workarounds were used previously. We are not sure why it used to work, but have a theory that a different sandboxing mechanism was used because of old kernel version.

On the old taskcluster.net instance, github-worker runs a 14.04-based image. According to Chromium docs, user namespace sandbox is only used on kernel >= 3.8. Yet 14.04 shipped 3.13, so we are not sure...

@Hexcles
Copy link
Member Author

Hexcles commented Nov 6, 2019

We have two immediate fix candidates:

Hexcles added a commit that referenced this issue Nov 7, 2019
Whenever we see TASKCLUSTER_ROOT_URL in the entry point (runl.py), we
disable Chrome sandboxing with --no-sandbox.

Fix #20133
stephenmcgruer pushed a commit that referenced this issue Nov 7, 2019
Whenever we see TASKCLUSTER_ROOT_URL in the entry point, we
disable Chrome sandboxing with --no-sandbox. This is required to work
around the lack of capabilities in the new community TaskCluster instances.

Fix #20133
@stephenmcgruer
Copy link
Contributor

We believe that bc83451 should resolve this, by disabling the Chrome sandbox when running under the new taskcluster instances. Long term, we should investigate adding support to TaskCluster for custom seccomp profiles, but hopefully this puts out the fire for now.

🔥 <-- 🚒

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants