Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(provisioner): NixOS based image #662

Closed
wants to merge 38 commits into from

Conversation

mvgijssel
Copy link
Member

@mvgijssel mvgijssel commented Aug 8, 2024

ref #661

TODO

  • Create NixOS image builder docker container
  • Create NixOS image using https://github.com/nix-community/nixos-generators
  • Load created image using lima
  • Start lima
  • Enable SSH access in NixOS
  • Enable SSH access through lima
  • Validate VM works
  • Create shell_command build action inside docker container to build image
  • Create shell script to start image (just using limactl start with config and qcow) and call from pants
  • Figure out way to regular apply nixos configuration to machine
  • Combine build and deploy flake into single file
  • Update build script to use new combined flake
  • Create provision script with new flake
  • Consolidate configuration from configuration.nix and flake.nix
  • Check if run/current-system/sw/bin contains vim
  • Add caching to provision task in the docker image
  • Add goss into the server and setup healthcheck endpoint
  • Refactor nixos-rebuild and nix into binstub calling into temporary docker container
    • Create delegator pex_binary
    • Create docker container if does not exist
    • Add logging
    • Create server pex which terminates after certain timeout
    • Link server pex into client pex (https://pantsbuild.slack.com/archives/C046T6T9U/p1724752064431699)
    • Boot container with server pex
    • Server polls pids: any new pid detected will extend the timeout
    • Add server command to settings class
    • Write settings somewhere
    • Compare new settings with stored settings and terminate container if don't match
    • Delegate host environment variables into container
    • Figure out relative include in bin/nix and bin/nixos-rebuild and bin/nix-exec
    • Figure out how to call delegator from bin/nixos-rebuild and bin/nix
    • Add validation for timeout argument
    • Share code between client and server like logging
    • Fix running pants delegator inside of shell_command using PATH trick
    • $TMPDIR is not available within the pants sandbox!
    • Fix dealing with $HOME variable for delegator
    • Try a global "/" mount with bin/nix-exec, that way don't have worry about paths being mounted.
    • Inject proper docker image tag
    • Exceptions not raised within the sandbox, missing bash variables
    • How to copy the results from nix after nix build? Can we make an impure flake which puts the result in the local directory? Or can we deal with the symlink somehow?
    • Replace build.sh and provision.sh scripts
    • Implement tests
    • Follow-up: if a command runs less than 1 second, it's not picked up by the server. Is there an alternative (easy) client/server architecture that fixes that limitation?
  • Ability to call healthcheck endpoint from host (ssh forwarding? other networking type? enable all features? limactl shell bastion-vm curl localhost:8080)
  • Use regular docker invocation for building image (not using docker_environment)
  • Write test which starts machine, runs healthcheck, runs provisioning and runs healthcheck again
  • Use pants in CI using EngFlow?
  • Setup pants with Trunk
  • Setup lima with Trunk
  • Prevent installing nerdctl using lima
  • Re-enable SSH wrapper Warp to see if envfs works
  • Use limactl validate to validate yaml files using trunk?
  • nixos-rebuild leaves a bunch of ssh connections open in the nix-exec container. Will these expire on their own? And therefore the container as well eventually?
  • Bazel ignore BUILD files with .bazel extension (or add .pants to pants ones?)

@mvgijssel mvgijssel changed the title Improve pdm usage feat(provisioner): NixOS based image Aug 21, 2024
@mvgijssel
Copy link
Member Author

Able to boot the NixOS VM, but the cloud-init doesn't work properly leaving the machine in a broken state. There's a discussion going on to support NixOS as a proper Lima guest lima-vm/lima#430.

Working examples:

Running lima with plain mode: limactl start --plain bastion/bastion-vm.yaml starts the vm but still stuck on the SSH step.

@mvgijssel
Copy link
Member Author

Seems the ssh login is failing because /bin/bash is missing in NixOS but is set as a shell in the cloud-init https://github.com/lima-vm/lima/blob/d7669be1f18617a17131da72097cafe296fd4067/pkg/cidata/cidata.TEMPLATE.d/user-data#L34

@mvgijssel
Copy link
Member Author

mvgijssel commented Aug 23, 2024

It's possible to apply a nixos configuration into the machine using

  1. Create configuration.nix file at /etc/nixos/configuration.nix
  2. Run sudo nixos-rebuild -I nixos-config=/etc/nixos/configuration.nix switch

Problem is that after a single apply (or reboot) the current user no longer has the ability to sudo. Lima creates a user like so https://github.com/lima-vm/lima/blob/d7669be1f18617a17131da72097cafe296fd4067/pkg/cidata/cidata.TEMPLATE.d/user-data#L34, but maybe because the user isn't part of the configuration the settings are not persisted?

Maybe it's easier to add a generic user, ops, to the image which always works despite the lima-created user. If it's possible to copy the authorized keys from the lima user to the ops user, then it's possible to login to the vm like using the ops user:

ssh -F ~/.lima/bastion-vm/ssh.config ops@lima-bastion-vm

@mvgijssel
Copy link
Member Author

Both deploy-rs and nixos-rebuild seem to work

NIX_SSHOPTS="-p 62704" nixos-rebuild --target-host [email protected] --build-host [email protected] -I nixos-config=./configuration.nix switch --use-remote-sudo

Now it's a matter of glueing everything together.

@mvgijssel
Copy link
Member Author

Potential delegator api

delegator \
    --name nixos-rebuild \
    --volume $HOME \
    --volume $(realpath TMPDIR) \
    --timout 10m \
    --image image-builder:dev \
    nixos-rebuild $@

This will create a docker container named nixos-rebuild if it does not already exists based of the image-builder:dev image. Mounts $HOME and $TMPDIR into /opt/delegator and will stop the container after 10 minutes of no activity.

@mvgijssel
Copy link
Member Author

For the result symlink some thoughts:

  • store the nix store in an alternative location? Accessible in the host on the same path?
  • Make flake impure, let it modify current directory somehow?
  • …?

@mvgijssel
Copy link
Member Author

mvgijssel commented Aug 31, 2024

Trying to run ../bin/nix build --store /tmp/root .#docker using an alternative store path results in error:

warning: Git tree '/opt/delegator/Users/maarten/Development/setup' is dirty
error: builder for '/nix/store/z3scfb0yl5c6y6h2w1q82ngvx2mrwzxb-extra-commands.sh.drv' failed with exit code 1;
       last 1 log lines:
       > mv: cannot move '/tmp/nix-build-extra-commands.sh.drv-0/.attr-1lf969yddshzhld7sr1vbagr07bnygg99lgl6gk5kxcnp4z9wbcq' to '/nix/store/vrpm0y1j3f3m25zqcywlb9kc4wz07zwf-extra-commands.sh': Operation not permitted
       For full logs, run 'nix log /nix/store/z3scfb0yl5c6y6h2w1q82ngvx2mrwzxb-extra-commands.sh.drv'.
error: 1 dependencies of derivation '/nix/store/h29ama6563i83sjpa21avq50swcmraf9-tarball.drv' failed to build

The following does work:

../bin/nix build --store /tmp/root nixpkgs#hello

But the results symlink is still pointing to /nix/...

@mvgijssel
Copy link
Member Author

mvgijssel commented Aug 31, 2024

Use nix copy to copy the generated store link to a local file! See if that works.

...

Unfortunately does not work as expected! Ideally just use the --store option, which does exactly what we want. Putting the store in an alternative location. Validated that the built content lives at the new store location, the created symlink called "result" pointing at /nix/store is just wrong.

...

Seems to work when setting the --privileged flag on the docker container

...

Trying to nix build with a store location in the delegator volume mount gives this error:

warning: Git tree '/opt/delegator/Users/maarten/Development/setup' is dirty
error: creating file '"/opt/delegator/private/tmp/nix-exec/nix/store/m2ca09nkxdy2g6x0ch2mmslwn2slhgyg-linux-pam-1.6.1-man/share/man/man8/pam.8.gz"': File exists
error: some substitutes for the outputs of derivation '/nix/store/wk2vy3jscn0c0i2q1dx0k3k5nr5jzk1h-linux-pam-1.6.1.drv' failed (usually happens due to networking issues); try '--fallback' to build derivation from source
error: 1 dependencies of derivation '/nix/store/z4fvv7bifmnc22873i65p2309wbxpz3d-ensure-all-wrappers-paths-exist.drv' failed to build
error: 1 dependencies of derivation '/nix/store/jd4274vyylwhzfbksxwfszgcg2cbaw8s-login.pam.drv' failed to build
error: 1 dependencies of derivation '/nix/store/w85b4djv8amgx2khk7wkcrsq86d0x3fh-system-path.drv' failed to build
error: 1 dependencies of derivation '/nix/store/nbscj62f2an4vnkd3fqd77brpq9ma6lq-nixos-system-nixos-24.05.20240822.797f7dc.drv' failed to build
error: 1 dependencies of derivation '/nix/store/h29ama6563i83sjpa21avq50swcmraf9-tarball.drv' failed to build

Definitely feels like we're trying to do the wrong thing here 🤔

@mvgijssel
Copy link
Member Author

mvgijssel commented Sep 1, 2024

Optimisations for the delegator:

  1. Use chroot on the mounted volumes inside the container, so the paths inside and outside work!
  2. Don't poll for processes inside the container, it turns out to be unreliable. Keeping the container alive longer than it should.
  3. Currently mounting entire disk / with docker executing as root, allowing bypassing permissions. Use a non-root user.
  4. Files created by nix are indeed owned by root in macOS. Needing sudo rm -rf /tmp/nix-exec to remove Nix store

Are there alternatives like bob build we can use to build files into local directory? Or just use nix like default and deal with the symlink / store path. nix-exec instead of nix and do the command we previously used.

@mvgijssel
Copy link
Member Author

See if we can make the setup more standard:

  • Use HCP Packer to do the image build in the cloud
  • Run goss during the packer pipeline for testing
  • Use renovate to bump the flake lock (https://docs.renovatebot.com/modules/manager/nix/)
  • Apply the changes to the provisioner using nixos-rebuild:
    • Use Argo to pull repo and apply changes OR
    • Use Omada VPN to give access to provisioner OR
    • Use Teleport to give access to provisioner OR
    • Use PiKVM to deploy the golden image into the machine

@mvgijssel mvgijssel closed this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant