Skip to content

📦 ♃ Debian packaging of JupyterHub, a multi-user server for Jupyter notebooks

License

Notifications You must be signed in to change notification settings

houshijie-2020/debianized-jupyterhub

 
 

Repository files navigation

"jupyterhub" Debian Packaging

STATUS ☑️ Package building works, and you can launch the server using systemctl start jupyterhub, or via the provided Dockerfile.run. It'll use PAM authorization, i.e. starts the notebook servers in a local user's context. The package is tested on Ubuntu Bionic, on Ubuntu Xenial (with Python 3.8.1 from the Deadsnakes PPA), and on Debian Stretch. You can also use a Docker container (see Dockerfile.run / user: admin / pwd: test1234).

BSD 3-clause licensed debianized-jupyterhub jupyterhub

Contents

What is this?

This project provides packaging of the core JupyterHub components, so they can be easily installed on Debian-like target hosts. This makes life-cycle management on production hosts a lot easier, and avoids common drawbacks of ‘from source’ installs, like needing build tools and direct internet access in production environments.

The Debian packaging metadata in debian puts the jupyterhub Python package and its dependencies as released on PyPI into a DEB package, using dh-virtualenv. The resulting omnibus package is thus easily installed to and removed from a machine, but is not a ‘normal’ Debian python-* package. If you want that, look elsewhere.

Since the dynamic router of JupyterHub is a Node.js application, the package also has a dependency on nodejs, limited to the current LTS version range (that is 12.x or 10.x as of this writing). In practice, that means you should use the NodeSource packages to get Node.js, because the native Debian ones are typically dated (Stretch comes with 4.8.2~dfsg-1). Adapt the debian/control file if your requirements are different.

Customizing the Package Contents

To add any plugins or other optional Python dependencies, list them in install_requires in setup.py as usual – but only use versioned dependencies so package builds are reproducible. These packages are then visible in the default Python3 kernel. Or add a requirements.txt file, which has the advantage that you don't need to change any git-controlled files.

Some standard extensions are already contained in setup.py as setuptools extras. The viz extra installs seaborn and holoviews, which in turn pulls large parts of the usual data science stack, including numpy, scipy, pandas, and matplotlib. The related vizjs extra adds several Javascript-based frameworks like bokeh, and image rendering support for SVG/PNG writing. Activating extras increases the package size by 10s or even 100s of MiB, so be aware of that and keep an eye on package size.

Activate the spark extra to get PySpark and related utilities. The systemd unit already includes support for auto-detection or explicit configuration of an installed JVM.

To activate extras, you need dh-virtualenv v1.1 which supports the --extras option. That option is used as part of the EXTRA_REQUIREMENTS variable in debian/rules – add or remove extras there as you see fit. There are two special extras named default and full – the DEFAULT_EXTRAS are listed in setup.py, and full is simply everything.

“Devops Intelligence” showcase

Here is an example of what you can do using this package, without any great investment of effort or capital. Within a simple setup adding a single JupyterHub host, you can use the built-in Python3 kernel to access existing internal data sources (see figure below).

Such a setup supports risk analysis and decision making within development and operations processes – typical business intelligence / data science procedures can be applied to the ‘business of making and running software’. The idea is to create feedback loops, and facilitate human decision making by automatically providing reliable input in form of up-to-date facts. After all development is our business — so let's have KPIs for developing, releasing, and operating software.

Architecture Overview

See this notebook or this blog post for more details and a concrete example of how to use such a setup.

How to build and install the package

Packages are built in Docker using the Dockerfile.build file. That way you do not need to install tooling and build dependencies on your machine, and the package always gets built in a pristine environment. The only thing you need on your workstatioon is a docker-ce installation of version 17.06 or higher (either on Debian or on Ubuntu).

After initializing your work environment with command . .env --yes, call ./build.sh debian:buster to build the package for Debian Buster. Building for Ubuntu Bionic with ./build.sh ubuntu:bionic is also supported, as are the old-stable releases, but those aren't regularly tested and might fail. See Building Debian Packages in Docker for more details.

Generated package files are placed in the dist/ directory. You can upload them to a Debian package repository via e.g. dput, see here for a hassle-free solution that works with Artifactory and Bintray.

To test the resulting package, read the comments at the start of Dockerfile.run. Or install the package locally into /opt/venvs/jupyterhub/, using dpkg -i ….

sudo dpkg -i $PWD/dist/jupyterhub_*.deb
/usr/sbin/jupyterhub --version  # ensure it basically works

To list the installed version of jupyterhub and all its dependencies (around 150 in the default configuration), call this:

/opt/venvs/jupyterhub/bin/pip freeze | column

Trouble-Shooting

'npm' errors while building the package

While installing the configurable-http-proxy Javascript module, you might get errors like npm ERR! code E403. That specific error means you have to provide authorization with your Node.js registry.

npm uses a configuration file which can provide both a local registry URL and the credentials for it. Create a .npmrc file in the root of your git working directory, otherwise ~/.npmrc is used.

Example ‘.npmrc’ file:

_auth = xyzb64…E=
always-auth = true
email = [email protected]

'pkg-resources not found' or similar during virtualenv creation

See the related section in the dh-virtualenv manual.

'no such option: --no-binary' during package builds

This package needs a reasonably recent pip for building. To upgrade pip (which makes sense anyway if your system is still on the ancient version 1.5.6), call sudo python3 -m pip install -U pip.

When using dh-virtualenv 1.1 or later releases, this problem should not appear anymore.

“Unknown lvalue 'ProtectControlGroups' in section 'Service'” at runtime

This appears in the service logs (journalctl) when you use the provided systemd unit files on older systems (e.g. Xenial). They're just warnings, and can be safely ignored.

Updating requirements

As previously mentioned, additional packages are listed in setup.py. General dependencies can be found in install_requires, while groups of optional extensions are part of extras_require.

To assist upgrading to newer versions, call these commands in the project workdir:

./setup.py egg_info
pip-upgrade --skip-package-installation --skip-virtualenv-check debianized_jupyterhub.egg-info/requires.txt <<<"q"

This will list any available newer version numbers, that you can then edit into setup.py.

How to set up a simple service instance

After installing the package, JupyterHub is launched by default and available at http://127.0.0.1:8000/.

The same is true when you used the docker run command as mentioned in Dockerfile.run. The commands as found in Dockerfile.run also give you a detailed recipe for a manual install, when you cannot use Docker for any reason – the only difference is process control, read on for that.

The package contains a systemd unit for the service, and starting it is done via systemctl:

sudo systemctl enable jupyterhub
sudo systemctl start jupyterhub

# This should show the service in state "active (running)"
systemctl status 'jupyterhub' | grep -B2 Active:

The service runs as jupyterhub.daemon. Note that the jupyterhub user is not removed when purging the package, but the /var/{log,opt,run}/jupyterhub directories and the configuration are.

By default, the sudospawner is used to start a user's notebook process – for that purpose, the included /etc/sudoers.d/jupyterhub configuration allows the jupyterhub system user to create these on behalf of any user listed in the JUPYTER_USERS alias. Unless you change it, that means all accounts in the users group.

In case you want to enable a specific user group for the sudo spawner, change the sudoers file like this:

sed -i.orig~ -e s/%users/%jhub-users/ /etc/sudoers.d/jupyterhub

If you want certain users to have admin access, add them to the set named c.Authenticator.admin_users in /etc/jupyterhub/jupyterhub_config.py.

After an upgrade, the service restarts automatically by default – you can change that using the JUPYTERHUB_AUTO_RESTART variable in /etc/default/jupyterhub.

In case of errors or other trouble, look into the service's journal with…

journalctl -eu jupyterhub

To identify your instance, and help users use the right login credentials, add something similar to this to your /etc/jupyterhub/jupyterhub_config.py (see this issue for details):

c.JupyterHub.template_vars = dict(
    announcement=
        '<a href="https://confluence.example.com/x/123456" target="_blank">'
        "<h1>DevOps Intelligence Platform</h1></a>",
    announcement_login=
        '<a href="https://confluence.example.com/x/123456" target="_blank">'
        "<h1>DevOps Intelligence Platform</h1></a>",
        "<big>&#128274; <b>Use your company LDAP credentials!</b></big>",
)

If you add a PNG image at /etc/jupyterhub/banner.png, it is used instead of the original banner image (sized 208 × 56 px). Note that this is done via a postinst script, so you must call dpkg-reconfigure jupyterhub if you change or add such an image after the package installation.

Securing your JupyterHub web service with an SSL off-loader

Note that JupyterHub can directly offer an SSL endpoint, but there are a few reasons to do that via a local proxy:

  • JupyterHub needs no special configuration to open a low port (remember, we do not run it as root).
  • Often there are already configuration management systems in place that, for commodity web servers and proxies, seamlessly handle certificate management and other complexities.
  • You can protect sensitive endpoints (e.g. metrics) against unauthorized access using the built-in mechanisms of the chosen SSL off-loader.

To hide the HTTP endpoint from the outside world, change the bind URL in /etc/default/jupyterhub as follows:

# Bind to 127.0.0.1 only
sed -i.orig~ -e s~//:8000~//127.0.0.1:8000~ /etc/default/jupyterhub

Restart the service and check that port 8000 is bound to localhost only:

systemctl restart jupyterhub.service
netstat -tulpn | grep :8000

Then install your chosen webserver / proxy for SSL off-loading, listening on port 443 and forwarding to port 8000. Typical candidates are NginX, Apache httpd, or Envoy. For an internet-facing service, consider https-portal, which is a NginX docker image with easy configuration and built-in Let's Encrypt support.

Otherwise, install the Debian nginx-full package and copy docs/examples/nginx-jhub.conf to the /etc/nginx/sites-enabled/default file (or another path depending on your server setup). Make sure to read through the file, most likely you have to adapt the certificate paths in ssl_certificate and ssl_certificate_key (and create a certificate, e.g. a self-signed one).

You also need to create Diffie-Hellman parameters using the following command, which can take several minutes to finish:

openssl dhparam -out /etc/ssl/private/dhparam.pem 4096

Then (re-)start the nginx service and try to login.

‼️ Note that this does not protect against any local users and their notebook servers and terminals, at least as long as you use the default spawner that launches local processes.

Changing the Service Unit Configuration

The best way to change or augment the configuration of a systemd service is to use a ‘drop-in’ file. For example, to increase the limit for open file handles above the default of 8192, use this in a root shell:

unit='jupyterhub'

# Change max. number of open files for ‘$unit’…
mkdir -p /etc/systemd/system/$unit.service.d
cat >/etc/systemd/system/$unit.service.d/limits.conf <<'EOF'
[Service]
LimitNOFILE=16384
EOF

systemctl daemon-reload
systemctl restart $unit

# Check that the changes are effective…
systemctl cat $unit
let $(systemctl show $unit -p MainPID)
cat "/proc/$MainPID/limits" | egrep 'Limit|files'

Configuration Files

  • /etc/default/jupyterhub – Operational parameters like log levels and port bindings.
  • /etc/jupyterhub/jupyterhub_config.py – The service's configuration.

A few configuration parameters are set in the /usr/sbin/jupyterhub-launcher script and thus override any values provided by jupyterhub_config.py.

ℹ️ Please note that the files in /etc/jupyterhub are not world-readable, since they might contain passwords.

Data Directories

  • /var/log/jupyterhub – Extra log files.
  • /var/opt/jupyterhub – Data files created during runtime (jupyterhub_cookie_secret, jupyterhub.sqlite, …).
  • /run/jupyterhub – PID file.

You should stick to these locations, because the maintainer scripts have special handling for them. If you need to relocate, consider using symbolic links to point to the physical location.

References

Documentation Links

These links point to parts of the documentation especially useful for operating a JupyterHub installation.

Related Projects

Things to Look At

About

📦 ♃ Debian packaging of JupyterHub, a multi-user server for Jupyter notebooks

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 67.5%
  • Shell 23.0%
  • Dockerfile 9.5%