Ceph-alopod

This article describes the switch of an existing Ceph deployment to a containerized deployment on Ubuntu 18.04.

Target audience: Ceph administrators planning to make the switch to containers using the official ceph-ansible playbook.

By Gauvain Pocentek, Cloud Consultant

Migrating to containerized Ceph on Ubuntu 18.04: a feedback

Ceph @ Obectif Libre

We have been running Ceph internally for a couple reasons:

  • our backups are stored using the RadosGW
  • we often test infrastructure tools that can be integrated with a Ceph cluster: it’s easier to test if Ceph is already up and running

The Ceph cluster is shared with the compute nodes of our internal OpenStack, and this setup works just fine. The only problem: upgrades. When we upgrade Ceph or OpenStack, things can become really messy, mainly because of packaging and repositories issues.

So we decided to move to a containerized Ceph. Containerized deployment and migration are both supported by the official ceph-ansible project that we have been using since the initial deployment. This project has worked very well for us, so we decided to use its migration playbook, trusting that everything would work without too much trouble. Apart from the small problems detailed in this article, everything went fine.

How to switch to containers

The process steps are:

  1. configuration update to enable the containerized deployment
  2. run of the migration playbook
  3. enjoy

The configuration change consists of a few additional lines in host_vars/all.yml:

# enable containerized deployment
containerized_deployment: true

# explicitly set the version (by default the 'latest' tag is used)
ceph_docker_image_tag: latest-mimic

You can find the list of available tags on the docker hub.

After changing the configuration, run the migration playbook:

$ ansible-playbook infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml

Simple enough, but we hit a few problems along the way. If you’re running the upgrade on Ubuntu, you will likely face the same issues, so here is how we fixed them.

The problems

Docker packaging

The ceph-ansible playbook installs the docker package available in the Ubuntu repositories. This means that if you already have docker installed from the docker.io repository the playbook will fail.

That problem was easily fixed by removing the already installed docker packages, as we didn’t have critical containers running. If this is not an option for you you can remove the installation tasks from the playbook.

System IDs conflicts

The biggest problem was the mismatch between the system IDs and the container IDs for the ceph user and group. During the OSD migration on the first node, the OSD containers never became alive because of permission problems: something in the /var/lib/ceph directories was not writable by the ceph user.

So we changed the owner and group IDs of the files to be 167, which is the ID used for the user and group in the containers. It was not enough. Ceph was in fact unable to write on /var/lib/ceph/osd/ceph-$ID/block – a symlink to a block device. The write permissions for the system on this file are handled by udev with rules defined in /lib/udev/rules.d/95-ceph-osd.rules. The rules give write access to the ceph user on the OSD-managed devices. The problem for us was to reload the Udev rules. It turned out that a reboot was the most efficient method.

So for each node we did the following to move Ceph to a state usable by the playbook:

# sed -i s/64045/167/g /etc/group /etc/passwd
# chown 167:167 /etc/ceph
# chown -R 167:167 /var/lib/ceph
# reboot

After we ran these steps on all the nodes, the playbook ran perfectly fine and we are now the proud sysadmins of a containerized Ceph.

Conclusion

Although the process to migrate to a containerized Ceph deployment is easy and completely automated, it requires a bit of preparation if you are using Ubuntu as host operating system. RedHat- and CentOS-based deployment will not have the ID problem, as ID 167 is used by default in a standard deployment.

Ceph once again proved itself very robust, as the complete failure of the containers startup on the first OSD node didn’t bother operations running on the cluster. And the switch to containers has been completely transparent!