
Einwanderer: "Das größte Trojanische Pferd aller Zeiten"

Telepolis - Sa, 2016-08-06 22:01
Donald Trump bedient weiterhin den völkischen Nationalismus, den letztlich auch die Islamisten ausbeuten
Kategorien: Politik + Kultur

Fehler oder Teil einer psychologischen Kriegsführung?

Telepolis - Sa, 2016-08-06 22:00
Über den Tweet, wonach Erdogan in der Nacht des versuchten Staatsstreichs Asyl in Deutschland gesucht habe soll
Kategorien: Politik + Kultur

Aleppo: Der parfümierte Tod

Telepolis - Fr, 2016-08-05 18:00
"Ihre Spucke versüßt das Meer" - wie die Propaganda der "Rebellen" funktioniert, am Beispiel des Rockstars der Dschihadisten al-Muhaysini
Kategorien: Politik + Kultur

Iran: Euphorie ist konfliktgeladener Realität gewichen

Telepolis - Fr, 2016-08-05 15:00
Regierung zeigt sich machtlos gegenüber Bassidsch-Miliz; Ayatollah Khamenei bezeichnet Nuklear-Deal als "Verschwörung des Feindes"
Kategorien: Politik + Kultur

Kern und Juncker uneins über EU-Umgang mit der Türkei

Telepolis - Fr, 2016-08-05 13:00
Während der österreichische Bundeskanzler die Beitrittsverhandlungen als gescheitert ansieht, will sie der EU-Kommissionspräsident erst nach einer Wiedereinführung der Todesstrafe abbrechen
Kategorien: Politik + Kultur

Goldene Zeiten für Schwätzer und Bullshitbingo

Telepolis - Fr, 2016-08-05 11:13
Propaganda in der FAZ gegen offenes WLAN erreicht neuen Tiefpunkt
Kategorien: Politik + Kultur

Sven Plöger, Tatjana Festerling und der Punjabi Elvis

Telepolis - Fr, 2016-08-05 11:00
YouTube und Co. - unsere wöchentliche Telepolis-Videoschau
Kategorien: Politik + Kultur

Kanzlerin Merkel verliert, Seehofer wieder als Kandidat gehandelt

Telepolis - Fr, 2016-08-05 10:15
ARD-DeutschlandTrend: Die Flüchtlingspolitik bleibt die Maßgabe zur Beurteilung. Zwei Drittel sind gar nicht zufrieden mit Merkel
Kategorien: Politik + Kultur

Türkischer Außenminister: Österreich ist "Zentrum des radikalen Rassismus"

Telepolis - Fr, 2016-08-05 10:00
Der Konflikt zwischen der Türkei und "dem Westen" eskaliert
Kategorien: Politik + Kultur

Was für ein Spiel treibt Wolfgang Drexler?

Telepolis - Fr, 2016-08-05 09:00
Der Vorsitzende des NSU-Ausschusses Baden-Württemberg kämpft für die offizielle Version der Behörden
Kategorien: Politik + Kultur

Panama Papers: "Dafür sorgen, dass möglichst viele Fälle ans Tageslicht kommen"

Telepolis - Fr, 2016-08-05 07:00
Frederik Obermaier von der SZ über die Panama Papers, Schwerpunkte bei der Auswertung und den Umgang mit 2,6 Terabyte Daten
Kategorien: Politik + Kultur

Trump - zu unkonventionell für das Partei-Establishment

Telepolis - Fr, 2016-08-05 05:00
Der größte Albtraum mancher Parteipolitiker der Republikaner: Ihr ungeliebter Kandidat könnte die Wahlen gewinnen
Kategorien: Politik + Kultur

Percona XtraDB Cluster on Ceph

MySQL High Performance - Do, 2016-08-04 22:31

This post discusses how XtraDB Cluster and Ceph are a good match, and how their combination allows for faster SST and a smaller disk footprint.

My last post was an introduction to Red Hat’s Ceph. As interesting and useful as it was, it wasn’t a practical example. Like most of the readers, I learn about and see the possibilities of technologies by burning my fingers on them. This post dives into a real and novel Ceph use case: handling of the Percona XtraDB Cluster SST operation using Ceph snapshots.

If you are familiar with Percona XtraDB Cluster, you know that a full state snapshot transfer (SST) is required to provision a new cluster node. Similarly, SST can also be triggered when a cluster node happens to have a corrupted dataset. Those SST operations consist essentially of a full copy of the dataset sent over the network. The most common SST methods are Xtrabackup and rsync. Both of these methods imply a significant impact and load on the donor while the SST operation is in progress.

For example, the whole dataset will need to be read from the storage and sent over the network, an operation that requires a lot of IO operations and CPU time. Furthermore, with the rsync SST method, the donor is under a read lock for the whole duration of the SST. Consequently, it can take no write operations. Such constraints on SST operations are often the main motivations beyond the reluctance of using Percona XtraDB cluster with large datasets.

So, what could we do to speed up SST? In this post, I will describe a method of performing SST operations when the data is not local to the nodes. You could easily modify the solution I am proposing for any non-local data source technology that supports snapshots/clones, and has an accessible management API. Off the top of my head (other than Ceph) I see AWS EBS and many SAN-based storage solutions as good fits.

The challenges of clone-based SST

If we could use snapshots and clones, what would be the logical steps for an SST? Let’s have a look at the following list:

  1. New node starts (joiner) and unmounts its current MySQL datadir
  2. The joiner and asks for an SST
  3. The donor creates a consistent snapshot of its MySQL datadir with the Galera position
  4. The donor sends to the joiner the name of the snapshot to use
  5. The joiner creates a clone of the snapshot name provided by the donor
  6. The joiner mounts the snapshot clone as the MySQL datadir and adjusts ownership
  7. The joiner initializes MySQL on the mounted clone

As we can see, all these steps are fairly simple, but hide some challenges for an SST method base on cloning. The first challenge is the need to mount the snapshot clone. Mounting a block device requires root privileges – and SST scripts normally run under the MySQL user. The second challenge I encountered wasn’t expected. MySQL opens the datadir and some files in it before the SST happens. Consequently, those files are then kept opened in the underlying mount point, a situation that is far from ideal. Fortunately, there are solutions to both of these challenges as we will see below.

SST script

So, let’s start with the SST script. The script is available in my Github at:

You should install the script in the /usr/bin directory, along with the other user scripts. Once installed, I recommend:

chown root.root /usr/bin/wsrep_sst_ceph chmod 755 /usr/bin/wsrep_sst_ceph

The script has a few parameters that can be defined in the [sst] section of the my.cnf file.

The Ceph pool where this node should create the clone. It can be a different pool from the one of the original dataset. For example, it could have a replication factor of 1 (no replication) for a read scaling node. The default value is: mysqlpool
What mount point to use. It defaults to the MySQL datadir as provided to the SST script.
The options used to mount the filesystem. The default value is: rw,noatime
The Ceph keyring file to authenticate against the Ceph cluster with cephx. The user under which MySQL is running must be able to read the file. The default value is: /etc/ceph/ceph.client.admin.keyring
Whether or not the script should cleanup the snapshots and clones that are no longer is used. Enable = 1, Disable = 0. The default value is: 0
Root privileges

In order to allow the SST script to perform privileged operations, I added an extra SST role: “mount”. The SST script on the joiner will call itself back with sudo and will pass “mount” for the role parameter. To allow the elevation of privileges, the follow line must be added to the /etc/sudoers file:

mysql ALL=NOPASSWD: /usr/bin/wsrep_sst_ceph

Files opened by MySQL before the SST

Upon startup, MySQL opens files at two places in the code before the SST completes. The first one is in the function mysqld_main , which sets the current working directory to the datadir (an empty directory at that point).  After the SST, a block device is mounted on the datadir. The issue is that MySQL tries to find the files in the empty mount point directory. I wrote a simple patch, presented below, and issued a pull request:

diff --git a/sql/ b/sql/ index 90760ba..bd9fa38 100644 --- a/sql/ +++ b/sql/ @@ -5362,6 +5362,13 @@ a file name for --log-bin-index option", opt_binlog_index_name); } } } + + /* + * Forcing a new setwd in case the SST mounted the datadir + */ + if (my_setwd(mysql_real_data_home,MYF(MY_WME)) && !opt_help) + unireg_abort(1); /* purecov: inspected */ + if (opt_bin_log) { /*

With this patch, I added a new my_setwd call right after the SST completed. The Percona engineering team approved the patch, and it should be added to the upcoming release of Percona XtraDB Cluster.

The Galera library is the other source of opened files before the SST. Here, the fix is just in the configuration. You must define the base_dir Galera provider option outside of the datadir. For example, if you use /var/lib/mysql as datadir and cephmountpoint, then you should use:


Of course, if you have other provider options, don’t forget to add them there.


So, what are the steps required to use Ceph with Percona XtraDB Cluster? (I assume that you have a working Ceph cluster.)

1. Join the Ceph cluster

The first thing you need is a working Ceph cluster with the needed CephX credentials. While the setup of a Ceph cluster is beyond the scope of this post, we will address it in a subsequent post. For now, we’ll focus on the client side.

You need to install the Ceph client packages on each node. On my test servers using Ubuntu 14.04, I did:

wget -q -O- '' | sudo apt-key add - sudo apt-add-repository 'deb trusty main' apt-get update apt-get install ceph

These commands also installed all the dependencies. Next, I copied the Ceph cluster configuration file /etc/ceph/ceph.conf:

[global] fsid = 87671417-61e4-442b-8511-12659278700f mon_initial_members = odroid1, odroid2 mon_host =,, auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd_journal = /var/lib/ceph/osd/journal osd_journal_size = 128 osd_pool_default_size = 2

and the authentication file /etc/ceph/ceph.client.admin.keyring from another node. I made sure these files were readable by all. You can define more refined privileges for a production system with CephX, the security layer of Ceph.

Once everything is in place, you can test if it is working with this command:

root@PXC3:~# ceph -s cluster 87671417-61e4-442b-8511-12659278700f health HEALTH_OK monmap e2: 3 mons at {odroid1=,odroid2=,serveur-famille=} election epoch 474, quorum 0,1,2 odroid1,odroid2,serveur-famille mdsmap e204: 1/1/1 up {0=odroid3=up:active} osdmap e995: 4 osds: 4 up, 4 in pgmap v275501: 1352 pgs, 5 pools, 321 GB data, 165 kobjects 643 GB used, 6318 GB / 7334 GB avail 1352 active+clean client io 16491 B/s rd, 2425 B/s wr, 1 op/s

Which gives the current state of the Ceph cluster.

2. Create the Ceph pool

Before we can use Ceph, we need to create a first RBD image, put a filesystem on it and mount it for MySQL on the bootstrap node. We need at least one Ceph pool since the RBD images are stored in a Ceph pool.  We create a Ceph pool with the command:

ceph osd pool create mysqlpool 512 512 replicated

Here, we have defined the pool mysqlpool with 512 placement groups. On a larger Ceph cluster, you might need to use more placement groups (again, a topic beyond the scope of this post). The pool we just created is replicated. Each object in the pool will have two copies as defined by the osd_pool_default_size parameter in the ceph.conf file. If needed, you can modify the size of a pool and its replication factor at any moment after the pool is created.

3. Create the first RBD image

Now that we have a pool, we can create a first RBD image:

root@PXC1:~# rbd -p mysqlpool create PXC --size 10240 --image-format 2

and “map” the RBD image to a host block device:

root@PXC1:~# rbd -p mysqlpool map PXC /dev/rbd1

The commands return the local RBD block device that corresponds to the RBD image. The other steps are not specific to RBD images, we need to create a filesystem and prepare the mount points.

The rest of the steps are not specific to RBD images. We need to create a filesystem and prepare the mount points:

mkfs.xfs /dev/rbd1 mount /dev/rbd1 /var/lib/mysql -o rw,noatime,nouuid chown mysql.mysql /var/lib/mysql mysql_install_db --datadir=/var/lib/mysql --user=mysql mkdir /var/lib/galera chown mysql.mysql /var/lib/galera

You need to mount the RBD device and run the mysql_install_db tool only on the bootstrap node. You need to create the directories /var/lib/mysql and /var/lib/galera on the other nodes and adjust the permissions similarly.

4. Modify the my.cnf files

You will need to set or adjust the specific wsrep_sst_ceph settings in the my.cnf file of all the servers. Here are the relevant lines from the my.cnf file of one of my cluster node:

[mysqld] wsrep_provider=/usr/lib/ wsrep_provider_options="base_dir=/var/lib/galera" wsrep_cluster_address=gcomm://,, wsrep_node_address= wsrep_sst_method=ceph wsrep_cluster_name=ceph_cluster [sst] cephlocalpool=mysqlpool cephmountoptions=rw,noatime,nodiratime,nouuid cephkeyring=/etc/ceph/ceph.client.admin.keyring cephcleanup=1

At this point, we can bootstrap the cluster on the node where we mounted the initial RBD image:

/etc/init.d/mysql bootstrap-pxc

5. Start the other XtraDB Cluster nodes

The first node does not perform an SST, so nothing exciting so far. With the patched version of MySQL (the above patch), starting MySQL on a second node triggers a Ceph SST operation. In my test environment, the SST take about five seconds to complete on low-powered VMs. Interestingly, the duration is not directly related to the dataset size. Because of this, a much larger dataset, on a quiet database, should take about the exact same time. A very busy database may need more time, since an SST requires a “flush tables with read lock” at some point.

So, after their respective Ceph SST, the other two nodes have:

root@PXC2:~# mount | grep mysql /dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid) root@PXC2:~# rbd showmapped id pool image snap device 1 mysqlpool PXC2-1463776424 - /dev/rbd1 root@PXC3:~# mount | grep mysql /dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid) root@PXC3:~# rbd showmapped id pool image snap device 1 mysqlpool PXC3-1464118729 - /dev/rbd1

The original RBD image now has two snapshots that are mapped to the clones mounted by other two nodes:

root@PXC3:~# rbd -p mysqlpool ls PXC PXC2-1463776424 PXC3-1464118729 root@PXC3:~# rbd -p mysqlpool info PXC2-1463776424 rbd image 'PXC2-1463776424': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.108b4246146651 format: 2 features: layering flags: parent: mysqlpool/PXC@1463776423 overlap: 10240 MB


Apart from allowing faster SST, what other benefits do we get from using Ceph with Percona XtraDB Cluster?

The first benefit is the inherent data duplication over the network removes the need for local data replication. Thus, instead of using raid-10 or raid-5 with an array of disks, we could use a simple raid-0 stripe set if the data is already replicated to more than one server.

The second benefit is a bit less obvious: you don’t need as much storage. Why? A Ceph clone only stores the delta from its original snapshot. So, for large, read intensive datasets, the disk space savings can be very significant. Of course, over time, the clone will drift away from its parent snapshot and will use more and more space. When we determine that a Ceph clone uses too much disk space, we can simply refresh the clone by restarting MySQL and forcing a full SST. The SST script will automatically drop the old clone and snapshot when the cephcleanup option is set, and it will create a new fresh clone. You can easily evaluate how much space is consumed by the clone using the following commands:

root@PXC2:~# rbd -p mysqlpool du PXC2-1463776424 warning: fast-diff map is not enabled for PXC2-1463776424. operation may be slow. NAME PROVISIONED USED PXC2-1463776424 10240M 164M

Also, nothing prevents you using a different configuration of Ceph pools in the same XtraDB cluster. Therefore a Ceph clone can use a different pool than its parent snapshot. That’s the whole purpose of the cephlocalpool parameter. Strictly speaking, you only need one node to use a replicated pool, as the other nodes could run on clones that are stored data in a non-replicated pool (saving a lot of storage space). Furthermore, we can define the OSD affinity of the non-replicated pool in a way that it stores data on the host where it is used, reducing the cross node network latency.

Using Ceph for XtraDB Cluster SST operation demonstrates one of the array of possibilities offered to MySQL by Ceph. We continue to work with the Red Hat team and Red Hat Ceph Storage architects to find new and useful ways of addressing database issues in the Ceph environment. There are many more posts to come, so stay tuned!

DISCLAIMER: The wsrep_sst_ceph script isn’t officially supported by Percona.

ANC verliert bei Kommunalwahlen in Südafrika

Telepolis - Do, 2016-08-04 22:00
Gewinner sind die liberale DA und die extremistische EFF
Kategorien: Politik + Kultur

Asyl: Lange Wartezeiten senken Aussichten für Flüchtlinge auf einen Job

Telepolis - Do, 2016-08-04 22:00
Wissenschaftler haben mit Daten aus der Schweiz einen kausalen Zusammenhang hergestellt, schon geringere Verkürzungen der Wartezeit auf die Entscheidung verbessern die Integration wesentlich
Kategorien: Politik + Kultur

Terrordrohung in Österreich: Negativ

Telepolis - Do, 2016-08-04 17:00
Die Bedrohungslage ist derzeit schwer einzuschätzen wie auch der Fall einer Muslima mit "verdächtiger" Lektüre in einem britischen Flugzeug demonstriert
Kategorien: Politik + Kultur

Gewalt und Islam: Die Vergangenheitszukunft

Telepolis - Do, 2016-08-04 16:00
Das Zwiegespräch zwischen dem Dichter Adonis und der Psychoanalytikerin Houria Abdelouahed sorgte in Frankreich für Diskussionsstoff und liegt jetzt auf Deutsch vor
Kategorien: Politik + Kultur

Iran: Wirtschaftskrieg zwischen den USA und Frankreich

Telepolis - Do, 2016-08-04 14:00
Der Think Tank United Against Nuclear Iran soll dabei als "Geheimwaffe" fungieren
Kategorien: Politik + Kultur

Clint Eastwood wählt Donald Trump

Telepolis - Do, 2016-08-04 12:00
Dass Teile des republikanischen Parteiestablishments sich von ihrem Kandidaten distanzieren, könnte dem Milliardär nicht nur schaden, sondern auch nützen
Kategorien: Politik + Kultur

Das Museum im Zeitalter seiner Virtualisierbarkeit

Telepolis - Do, 2016-08-04 10:00
Die Form des Virtuellen - Vom Leben zwischen zwei Welten
Kategorien: Politik + Kultur