(This part takes around 10 minutes to complete on cloudlab c220g5
)
Let us first go through some warm-up examples to get familiar with the Ceph storage layer as well as the super
and collapse
APIs. In this section, we will be learning the following commands (You do not need to know how to use these APIs correctly, the remaining parts of this artifact all provide one-click scripts to run the experiments):
ceph -s # display the ceph cluster status
rbd create # creates a disk image
rbd ls # lists all disk images
rbd map # maps a disk image as a block device with parent mode
rbd super # creates a disk clone and maps it as a block device with child mode
rbd collapse # deallocates one version of the disk (parent or child)
To do so, we will be using a simple example to go through the workflow in speculative recovery. We will first start a Ceph cluster and create a disk image. Then, we will map the parent version of this disk image as a block device and do some file operations (pretending to be the primary application instance). Next, we will switch to the child version of the disk by using super and do some more file operations (pretending to be the backup instance). Finally, we will use collapse to deallocate the parent version and promote the child (pretending a failover).
(This part does not have a one-click script since it is meant to be interactive)
Now let us begin. First, change to the Ceph build directory:
cd /mnt/specreds/ceph/build
Ceph provides a script to start up a test cluster locally:
./start-new.sh
This starts up a test cluster with three OSDs (the storage servers) each of which manages a 100GB device (a 100GB disk image file residing in the filesystem). This means that the total storage capacity of the test cluster is 300GB (the effective capacity is still 100GB given three-way replication). This will be enough for our experiments. Then, populate some system environment:
source vstart_environment.sh
Now, you can check the cluster status by typing:
bin/ceph -s
If the health
field shows HEALTH_ERR
with Module 'dashboard' has failed: No module named 'routes'
as the only error, this is expected and it does not impact the cluster's normal operations. If you see other errors, then the cluster may be truly unhealthy (this could happen at the cluster start-up time). Please wait a bit and the cluster shold clean up just fine.
After you are done with the experiments, you can shutdown the cluster with
./reset.sh
The next time you wish to resume, use the startup script again to restart the cluster, but remember to use start-keep.sh
so that the cluster will reuse the previous setup (including the data). Otherwise, a brand new cluster will be created and all previous data will be lost.
./start-keep.sh
(If you see the cluster is in an unhealthy state during normal operation, you can start up a new cluster by reset.sh
then start-new.sh
)
Ceph's block device interface is called rbd
. We will be using this interface to create a disk as well as calling super
and collapse
.
Now, let us create a disk image named foo
with 1GB of size
bin/rbd create foo --size=1G # create disk foo, 1GB in size
(You will be seeing some WARNING messages printed, please ignore them)
Then, you can check the created disk image with
bin/rbd ls -l # list all disk images
Now let's mount a filesystem on foo and do some file operations.
First, map the disk image as a block device. By default, this maps the disk image in the parent mode, meaning that the all access to the disk is treated as the parent. In speculative recovery, this is the mode the primary instance uses.
sudo bin/rbd map foo # map a block device
It should be mapped as /dev/rbd0
on the system (you can use lsblk
to check). Next, make a filesystem
sudo mkfs.ext4 /dev/rbd0 # make the ext4 fs
Next, mount the filesystem
sudo mount /dev/rbd0 /srv # mount to /srv
sudo chown $(id -u):$(id -g) /srv # change the ownership
Now, write a file (in parent mode)
echo "I am the parent" > /srv/parent
Finally, unmount the filesystem and unmap the disk
sudo umount /srv
sudo bin/rbd unmap /dev/rbd0 # can pass foo here as well
Now, we will use super
to clone a child disk and do some file operations on the child as well.
First, call super
sudo bin/rbd super foo # creates and maps the child disk
This command creates a child disk and maps it as a block device /dev/rbd0
. Now all access to this disk will be treated as the child. In speculative recovery, this is the mode the backup instance uses.
Next, mount the filesystem
sudo mount /dev/rbd0 /srv
We should be seeing the parent
file written while in parent mode, as well as its content "I am the parent"
cat /srv/parent # I am the parent
Now, create a another file as the child
echo "I am the child" > /srv/child
Finally, unmount the filesystem and unmap the disk
sudo umount /srv
sudo bin/rbd unmap /dev/rbd0 # can pass foo here as well
Now we have two disk versions, one with the parent
file one with an additional child
file. Let us switch back to the parent mode and check whether the content of the parent is still there. This is testing the basic isolation requirement for disk clones (note that super
only provides one-way isolation in that the child may be interfered by the parent. For more details please refer to the paper).
Again, map the disk and mount the filesystem
sudo bin/rbd map foo # map as the parent
sudo mount /dev/rbd0 /srv
Inside /srv
, we should only be seeing the parent
file with content "I am the parent" (please use ls
and cat
to verify)!
Finally, unmount the filesystem and unmap the disk
sudo umount /srv
sudo bin/rbd unmap /dev/rbd0
Now we have two independent versions of the disk. To clarify, when doing rbd ls -l
we will only be seeing one disk foo
. The parent and child versions of foo
does not exist "explicitly" but only show up when mapped as a block device depending on your access mode. If you map the disk image using rbd map
, you are in parent mode; if you map the image using rbd super
, you are in child mode.
Next, assume that we decide to promote the child version and deallocate the parent by calling collapse
:
bin/rbd collapse foo --promote # collapse by promoting the child
A progress bar would show up indicating the progress of the asynchronous garbage collection that is deallocating the parent. The call to collapse
is a synchronous operation on the terminal, but the disk is not being blocked from normal I/O operations. To deallocate the child instead, use the --abort
option. If no option specified, collapse
simply returns the dirty bit value of the parent disk.
After the call to collapse
returns, the disk image now has only one version, which is the newly promoted parent (previously the child). And now let us see its content:
sudo bin/rbd map foo # map as the parent
sudo mount /dev/rbd0 /srv
Inside /srv
, we should see both files parent
and child
with their corresponding content (please use ls
and cat
to verify)!
Finally, clean up:
sudo umount /srv
sudo bin/rbd unmap /dev/rbd0
bin/rbd rm foo # remove the disk image
(Optionally, you can shut down the cluster)
./reset.sh # shut down the cluster
Now, please proceed to Part 3.