News-Feeds

Kartoffelernte würde ohne Pflanzenschutzmittel weitgehend ausfallen

Telepolis - So, 2016-08-21 22:00
Massive Ernteeinbußen durch schlechtes Wetter
Kategorien: Politik + Kultur

Die türkische Regierung fordert Solidarität nach dem Selbstmordanschlag in Gaziantep

Telepolis - So, 2016-08-21 22:00
Zwar wird der IS von Erdogan verantwortlich gemacht, aber untergemischt werden auch PKK, Gülen und ausländische Geheimdienste
Kategorien: Politik + Kultur

EU: Umverteilung der Flüchtlinge kommt nicht voran

Telepolis - So, 2016-08-21 08:00
Griechenland: Neues von den Flüchtlingen und Immigranten
Kategorien: Politik + Kultur

Amok in der Männerdämmerung

Telepolis - Sa, 2016-08-20 22:00
Männlichkeitswahn ist sicher eine Reaktion auf die Ahnung, dass das Patriarchat weltweit in die Krise gekommen ist - aber das ist nicht alles
Kategorien: Politik + Kultur

Türkei: Umstrukturierung in den kurdischen Gebieten

Telepolis - Sa, 2016-08-20 22:00
Das türkische Parlament will nächste Woche die gesetzliche Grundlage dafür schaffen, oppositionelle Bürgermeister und Bezirksräte ihres Amtes zu entheben
Kategorien: Politik + Kultur

Keine Tiere essen - stattdessen Bohnen, Erbsen, Linsen

Telepolis - Sa, 2016-08-20 08:00
Hülsenfrüchte in der Ernährung: Leguminosen erweisen sich als bedeutende Eiweißquellen für Mensch und Tier
Kategorien: Politik + Kultur

Top Most Overlooked MySQL Performance Optimizations: Q & A

MySQL High Performance - Fr, 2016-08-19 23:34

Thank you for attending my 22nd July 2016 webinar titled “Top Most Overlooked MySQL Performance Optimizations“. In this blog, I will provide answers to the Q & A for that webinar.

For hardware, which disk raid level do you suggest? Is raid5 suggested performance-wise and data-integrity-wise?
RAID 5 comes with high overhead, as each write turns into a sequence of four physical I/O operations, two reads and two writes. We know that RAID 5s have some write penalty, and it could affect the performance on spindle disks. In most cases, we advise using alternative RAID levels. Use RAID 5 when disk capacity is more important than performance (e.g., archive databases that aren’t used often). Since write performance isn’t a problem in the case of SSD, but capacity is expensive, RAID 5 can help by wasting less disk space.

Regarding collecting table statistics, do you have any suggestions for analyzing large tables (over 300GB) since we had issues with MySQL detecting the wrong cardinality?
MySQL optimizer makes decisions about the execution plan (EXPLAIN), and statistics are re-estimated automatically and can be re-estimated explicitly when one calls the ANALYZE TABLE statement for the table, or the OPTIMIZE TABLE statement for InnoDB tables (which rebuilds the table and then performs an ANALYZE for the table).

When MySQL optimizer is not picking up the right index in EXPLAIN, it could be caused by outdated or wrong statistics (optimizer bugs aside). So, when you optimize the table you rebuild it so data are stored in a more compact way (assuming they changed a lot in the past) and then you re-estimate statistics based on some random sample pages checked in the table. As a result, you come up with statistics that are more correct for the data you have at the moment. This allows optimizer to get a better plan. When an explicit hint is added, you reduce possible choices for optimizer and it can use a good enough plan even with wrong statistics.

If you use versions 5.6.x and 5.7.x, and InnoDB tables, there is a way to store/fix statistics when the plans are good.  Using Persistent Optimizer Statistics prevents it from changing automatically. It’s recommended you run ANALYZE TABLE to calculate statistics (if really needed) during off peak time and make sure the table in question is not in use. Check this blogpost too.

Regarding the buffer pool, when due you think using multiple buffer pool instances make sense?
Multiple InnoDB buffer pools were introduced in MySQL 5.5, and the default value for it was 1. Now, the default value in MySQL 5.6 is 8. Enabling innodb_buffer_pool_instances is useful in highly concurrent workloads as it may reduce contention of the global mutexes. innodb_buffer_pool_instances helps to improve scalability in multi-core machines and having multiple buffer pools means that access to the buffer pool splits across all instances. Therefore, no single mutex controls the access pattern.

innodb_buffer_pool_instances only takes effect when set to 1GB (at minimum), and the total specified size for innodb_buffer_pool  is divided among all the buffer pool instances. Further, setting the innodb_buffer_pool_instances parameter is not a dynamic option, so it requires a server restart to take effect.

What do you mean “PK is appended to secondary index”
In InnoDB, secondary indexes are stored along with their corresponding primary key values. InnoDB uses this primary key value to search for the row in the clustered index. So, primary keys are implicitly added with secondary keys.

About Duplicate Keys, if I have a UNIQUE KEY on two columns, is it ok then to set a key for each of these columns also? Or should I only keep the unique key on the columns and get rid of regular key on each column also?
As I mentioned during the talk, for composite index the leftmost prefix is used. For example, If you have a UNIQUE INDEX on columns A,B as (A,B), then this index is not used for lookup for the query below:

SELECT * FROM test WHERE B='xxx';

For that query, you need a separate index on B column.

On myisam settings, doesn’t the MySQL and information_schema schemas require myisam? If so, are any settings more than default needing to be changed?
performance_schema uses the PERFORMANCE_SCHEMA storage engine, so only MySQL system database tables use the MyISAM engine. The MySQL system database is not used much and usually default settings for MyISAM engine are fine.

Will functions make my app slow compare than query?
I’m not sure how you’re comparing “queries” versus “stored functions.” Functions also need to transform, similar to the query execution plan. But it might be slower compare to well-coded SQL, even with the overhead of copying the resulting data set back to the client. Typically, functions have many SQL queries. The trade-off is that this does increase the load on the database server because more of the work is done on the server side and less is done on the client (application) side.

Foreign key will make my fetches slower?
MySQL enforces referential integrity (which ensures data consistency between related tables) via foreign keys for the InnoDB storage engine. There could be some overhead of this for the INSERT/UPDATE/DELETE foreign key column, which has to check if the value exists in a related column of other tables. There could be some overhead for this, but again it’s an index lookup so the cost shouldn’t be high. However, locking overhead comes into play as well. This blogpost from our CEO is informative on this topic. This especially affect writes, but I don’t think FK fetches slower for SELECT as it’s an index lookup.

Large pool size can have a negative impact to performance? About 62GB of pool size?
The InnoDB buffer pool is by far the most important option for InnoDB Performance, as it’s the main cache for data and indexes and it must be set correctly. Setting it large enough (i.e., larger than your dataset) shouldn’t cause any problems as long as you leave enough memory for OS needs and for MySQL buffers (e.g., sort buffer, join buffer, temporary tables, etc.).

62GB doesn’t necessarily mean a big InnoDB buffer pool. It depends on how much memory your MySQL server contains, and what the size of your total InnoDB dataset is. A good rule of thumb is to set the InnoDB buffer pool size as large as possible, while still leaving enough memory for MySQL buffers and for OS.

You find duplicate, redundant indexes by looking at information_schema.key_column_usage directly?
The key_column_usage view provides information about key columns constraints. It doesn’t provide information about duplicate or redundant indexes.

Can you find candidate missing indexes by looking at the slow query log?
Yes, as I mentioned you can find unused indexes by enabling log_queries_not_using_indexes. It writes to slow_query_log. You can also enable the user_statistics feature which adds several information_schema tables, and you can find un-used indexes with the help of user_statistics. pt-index-usage is yet another tool from Percona toolkit for this purpose. Also, check this blogpost on this topic.

How to find the unused indexes? They also have an impact on performance.
Unused indexes can be found with the help of the pt-index-usage tool from Percona toolkit as mentioned above. If you are using Percona Server, you can also use User Statistics feature. Check this blogpost from my colleague, which shows another technique to find unused indexes.

As far as I understand, MIXED will automatically use ROW for non-deterministic and STATEMENT for deterministic queries. I’ve been using it for years now without any problems. So why this recommendation of ROW?

In Mixed Mode, MySQL uses statement-based replication for most queries, switching to row-based replication only when statement-based replication would cause an inconsistency. We recommend ROW-based logging because it’s efficient and performs better as it requires less row locks. However, RBR can generate more data if a DML query affects many rows and a significant amount of data needs to be written to the binary log (and you can configure binlog_row_image parameter to control the amount of logging). Also, make sure you have good network bandwidth between master/slave(s) for RBR, as it needs to send more data to slaves. Another important thing to get best of the performance with ROW-based replication is to make sure all your database tables contain a Primary Key or Unique Key (because of this bug http://bugs.mysql.com/bug.php?id=53375).

Can you give a brief overview of sharding…The pros and cons also.
With Sharding, database data is split into multiple databases with each shard storing a subset of data. Sharding is useful to scale writes if you have huge dataset and a single server can’t handle amount of writes.

Performance and throughput could be better with sharding. On the other had, it requires lots of development and administration efforts. The application needs to be aware of the shards and keep track of which data is stored in which shard. You can use MySQL Fabric framework to manage farms of MySQL Servers. Check for details in the manual.

Why not mixed replication mode instead of row-based replication ?
As I mentioned above, MIXED uses a STATEMENT-based format by default, and converts to ROW-based replication format for non-deterministic queries. But ROW-based format is recommended as there could still be cases where MySQL fails to detect non-deterministic query behavior and replicates in a STATEMENT-based format.

Can you specify a few variables which could reduce slave lag?
Because of the single-threaded nature of MySQL (until MySQL 5.6), there is always a chance that a MySQL slave can lag from the master. I would suggest considering the below parameters to avoid slave lag:

  • innodb_flush_log_at_trx_commit <> 1, Either set it t or 0 however, it could cause you 1 second of data loss in case of crash.
  • innodb_flush_method = O_DIRECT, for unix like operating system O_DIRECT is recommended to avoid double buffering. If your InnoDB data and log files are located on SAN then O_DIRECT is probably not good choice.
  • log_bin = 0, Disable binary logging (if enabled) to minimize extra Disk IO.
  • sync_binlog = 0, Disable sync_binlog.

Those above parameters would definitely help to minimize slave lag. However, along with that make sure your slave(s) hardware is as strong as the master. Make sure your read queries are fast enough. Don’t overload slave to much, and distribute read traffic evenly between slave(s). Also, you should have the same table definitions on slave(s) as the master (e.g., master server indexes must exists on slave(s) tables too). Last but not least, I wrote a blogpost on how to diagnose and cure replication lag. It might be useful for further reading.

Ende der privaten Gefängnisse in den USA

Telepolis - Fr, 2016-08-19 22:00
Justizministerium erkennt, dass staatliche Einrichtungen besser für die Gefangenen sind
Kategorien: Politik + Kultur

Der Putsch nach dem Putsch

Telepolis - Fr, 2016-08-19 22:00
Was wissen wir über die türkische Putschnacht, über Akteure und Hintergründe? Und welche Bedeutung haben die Säuberungsaktionen der AKP für die Zukunft des Landes?
Kategorien: Politik + Kultur

Percona Server 5.5.51-38.1 is now available

MySQL High Performance - Fr, 2016-08-19 15:58

Percona announces the release of Percona Server 5.5.51-38.1 on August 19, 2016. Based on MySQL 5.5.51, including all the bug fixes in it, Percona Server 5.5.51-38.1 is now the current stable release in the 5.5 series.

Percona Server is open-source and free. You can find release details of the release in the 5.5.51-38.1 milestone on Launchpad. Downloads are available here and from the Percona Software Repositories.

Bugs Fixed:
  • PAM Authentication Plugin would abort authentication while checking UNIX user group membership if there were more than a thousand members. Bug fixed #1608902.
  • PAM Authentication Plugin didn’t support spaces in the UNIX user group names. Bug fixed #1544443.
  • If DROP DATABASE would fail to delete some of the tables in the database, the partially-executed command is logged in the binlog as DROP TABLE t1, t2, ... for the tables for which drop succeeded. A slave might fail to replicate such DROP TABLE statement if there exist foreign key relationships to any of the dropped tables and the slave has a different schema from the master. Fixed by checking, on the master, whether any of the database to be dropped tables participate in a Foreign Key relationship, and fail the DROP DATABASE statement immediately. Bug fixed #1525407 (upstream #79610).
  • Percona Server 5.5 could not be built with the -DMYSQL_MAINTAINER_MODE=ON option. Bug fixed #1590454.
  • In the client library, any EINTR received during network I/O was not handled correctly. Bug fixed #1591202 (upstream #82019).
  • The included .gitignore in the percona-server source distribution had a line *.spec, which means someone trying to check in a copy of the percona-server source would be missing the spec file required to build the RPM packages. Bug fixed #1600051.
  • The fix for bug #1341067 added a call to free some of the heap memory allocated by OpenSSL. This was not safe for repeated calls if OpenSSL is linked twice through different libraries and each is trying to free the same. Bug fixed #1604676.
  • If the changed page bitmap redo log tracking thread stops due to any reason, then shutdown will wait for a long time for the log tracker thread to quit, which it never does. Bug fixed #1606821.
  • Performing slow InnoDB shutdown (innodb_fast_shutdown set to 0) could result in an incomplete purge, if a separate purge thread is running (which is a default in Percona Server). Bug fixed #1609364.
  • Due to security reasons ld_preload libraries can now only be loaded from the system directories (/usr/lib64, /usr/lib) and the MySQL installation base directory.
Other bugs fixed:

#1515591 (upstream #79249), #1612551, #1609523, #756387, #1097870, #1603073, #1606478, #1606572, #1606782, #1607224, #1607359, #1607606, #1607607, #1607671, #1608385, #1608424, #1608437, #1608515, #1608845, #1609422, #1610858, #1612084, #1612118, and #1613641.

Find the release notes for Percona Server 5.5.51-38.1 in our online documentation. Report bugs on the launchpad bug tracker.

Flüchtlinge: So konservativ wie die deutsche Gesellschaft

Telepolis - Fr, 2016-08-19 08:00
Studien zu Migranten: Vielleicht sollte die Blickrichtung mal umgedreht und die Integrationsfähigkeit der deutschen Gesellschaft erforscht werden
Kategorien: Politik + Kultur

Gerüchteküche: Hat das Pentagon Atomwaffen aus Incirlik nach Rumänien verlegt?

Telepolis - Fr, 2016-08-19 06:00
Russland verstärkt Truppen an der Grenze zur Ukraine, melden Westmedien aufgeregt
Kategorien: Politik + Kultur

Trump holt Breitbart-Chef ins Team

Telepolis - Do, 2016-08-18 22:00
Entscheidung könnte erneuten Strategiewechsel bedeuten
Kategorien: Politik + Kultur

Bommi Baumanns finale Pointe

Telepolis - Do, 2016-08-18 22:00
Ein talentierter Geschichtenerzähler hat den Löffel abgegeben. Es gibt nichts mehr zu lachen. Nicht nur ich bedaure das. Die Leber nimmt es eben übel, wenn man sie zu sehr malträtiert
Kategorien: Politik + Kultur

ProxySQL 1.2.1 GA Release

MySQL High Performance - Do, 2016-08-18 17:55

The GA release of ProxySQL 1.2.1 is available. You can get it from https://github.com/sysown/proxysql/releases. There are also Docker images for Release 1.2.1: https://hub.docker.com/r/percona/proxysql/.

ProxySQL is a high-performance proxy, currently for MySQL and its forks (like Percona Server and MariaDB). It acts as an intermediary for client requests seeking resources from the database. ProxySQL was created for DBAs by René Cannaò, as a means of solving complex replication topology issues.

This post is published with René’s approval. René is busy implementing more new ProxySQL features, so I decided to make this announcement!

Release highlights:
  • Support for backend SSL connections
  • Support for encrypted password  (mysql_users table now supports both plain text password and hashed password, in the same format of mysql.user.password)
  • Improved monitoring module
  • Better integration with Percona XtraDB Cluster
    • New feature: the Scheduler, that allows the extension of ProxySQL with external scripts

The last point is especially important in conjunction with our recent Percona XtraDB Cluster 5.7 RC1 release. When we ship Percona XtraDB Cluster 5.7 GA, we plan to make ProxySQL the default proxy solution choice for Percona XtraDB Cluster. ProxySQL is aware of the cluster and node status, and can direct traffic appropriately.

ProxySQL 1.2.1 comes with additional scripts to support this:

ProxySQL 1.2.1 and these scripts are compatible with existing Percona XtraDB Cluster 5.6 GA releases.

ProxySQL 1.2.1 is a solid release, currently used by many demanding high-performance production workloads – it is already battle tested! Give it a try if you are looking for a proxy solution.

ProxySQL is available under OpenSource license GPLv3, which allows you unlimited usage in production. ProxySQL has no plans to change the license!

Jemen: Ignorierte Zerstörung

Telepolis - Do, 2016-08-18 06:00
Der Krieg im Jemen ist hauptsächlich eine einseitige und Zerstörung unter der Federführung Saudi-Arabiens, die im Schatten der Weltöffentlichkeit weiterhin fortgeführt wird
Kategorien: Politik + Kultur

Spaniens dreckige Klimapolitik

Telepolis - Mi, 2016-08-17 22:16
Während die Kohle weltweit an Bedeutung verliert, nahm ihr Einsatz in Spanien um fast 24% zu, das sich von der versprochenen Dekarbonisierung entfernt
Kategorien: Politik + Kultur

Endet die Migrationsroute jetzt in Südbaden?

Telepolis - Mi, 2016-08-17 22:00
Seit das Wetter im Mittelmeer nördlich von Libyen weniger stürmisch ist, nehmen die Zahlen der über das Mittelmeer und Italien nach Norden Wandernden wieder deutlich zu
Kategorien: Politik + Kultur

Gespräche über grenzüberschreitenden Terror in Kaschmir?

Telepolis - Mi, 2016-08-17 22:00
Staatsführer Indiens und Pakistans provozieren sich gegenseitig
Kategorien: Politik + Kultur

TokuDB/PerconaFT fragmented data file performance improvements

MySQL High Performance - Mi, 2016-08-17 17:05

In this blog post, we’ll discuss how we’ve improved TokuDB and PerconaFT fragmented data file performance.

Through our internal benchmarking and some user reports, we have found that with long term heavy write use TokuDB/PerconaFT performance can degrade significantly on large data files. Using smaller node sizes makes the problem worse (which is one of our performance tuning recommendations when you have faster storage). The problem manifests as low CPU utilization, a drop in overall TPS and high client response times during prolonged checkpointing.

This post explains a little about how PerconaFT structures dictionary files and where the current implementation breaks down. Hopefully, it explains the nature of the issue, and how our solution helps addresses it. It also provides some contrived benchmarks that prove the solution.

PerconaFT map file disk format

NOTE. This post uses the terms index, data file, and dictionary are somewhat interchangeable. We will use the PerconaFT term “dictionary” to refer specifically to a PerconaFT key/value data file.

PerconaFT stores every dictionary in its own data file on disk. TokuDB stores each index in a PerconaFT dictionary, plus one additional dictionary per table for some metadata. For example, if you have one TokuDB table with two secondary indices, you would have four data files or dictionaries: one small metadata dictionary for the table, one dictionary for the primary key/index, and one for each secondary index.

Each dictionary file has three major parts:

  • Two headers (yes, two) made up of various bits of metadata, file versions, a checkpoint logical sequence number (CLSN), the offset of this headers block translation table, etc…
  • Two (yes, two, one per header) block translation tables (BTT) that maps block numbers (BNs) to the physical offsets and sizes of the data blocks within the file.
  • Data blocks and holes (unused space). Unlike InnoDB, PerconaFT data blocks (nodes) are variable sizes and can be any size from a minimum of a few bytes for an empty internal node all the way up to the block size defined when the tree created (4MB by default if we don’t use compression) and anywhere in between, depending on the amount of data within that node.

Each dictionary file contains two versions of the header stored on disk, and only one is valid at any given point in time. Since we fix the size of the header structure, we always know their locations. The first at offset zero, the other is immediately after the first. The header that is currently valid is the header with the later/larger CLSN.

We write the header and the BTT to disk during a checkpoint or when a dictionary is closed (the only time we do so). The header overwrites the older header (the one with the older CLSN) on disk. From that moment onward, the disk space used by the previous version of the dictionary (the whole thing, not just the header) that is not also used by the latest version, is considered immediately free.

There is much more magic to how the PerconaFT does checkpoint and consistency, but that is really out of the scope of this post. Maybe a later post that addresses the sharp checkpoint of the PerconaFT can dive into this.

The block allocator

The block allocator is the algorithm and container that manages the list of known used blocks and unused holes within an open dictionary file. When a node gets written, it is the responsibility of the block allocator to find a suitable location in the file for the nodes data. It is always placed into a new block, never overwrites an existing block (except for reclaimed block space from blocks that are removed or moved and recorded during the last checkpoint). Conversely, when a node gets destroyed it is the responsibility of the block allocator to release that used space and create a hole out of the old block. That hole also must be merged with any other holes that are adjacent to it to have a record of just one large hole rather than a series of consecutive smaller holes.

Fragmentation and large files

The current implementation of the PerconaFT block allocator maintains a simple array of used blocks in memory for each open dictionary. The used blocks are ordered ascending by their offset in the file. The holes between the blocks are calculated by knowing the offset and size of the two bounding blocks. For example, one can calculate the hole offset and size between two adjacent blocks as: b[n].offset + b[n].size and b[n+1].offset – (b[n].offset + b[n].size), respectively.

To find a suitable hole to place node data, the current block allocator starts at the first block in the array. It iterates through the blocks looking for a hole between blocks that is large enough to hold the nodes data. Once we find a hole, we cut the space needed for the node out of the hole and the remainder is left as a hole for another block to possibly use later.

Note. Forcing alignment to 512 offsets for direct I/O has overhead, regardless if direct I/O is used or not.

This linear search severely degrades the PerconaFT performance for very large and fragmented dictionary files. We have some solid evidence from the field that this does occur. We can see it via various profiling tools as a lot of time spent within block_allocator_strategy::first_fit. It is also quite easy to create a case by using very small node (block) sizes and small fanouts (forces the existence of more nodes, and thus more small holes). This fragmentation can and does cause all sorts of side effects as the search operation locks the entire structure within memory. It blocks nodes from translating their node/block IDs into file locations.

Let’s fix it…

In this block storage paradigm, fragmentation is inevitable. We can try to dance around and propose different ways to prevent fragmentation (at the expense of higher CPU costs, online/offline operations, etc…). Or, we can look at the way the block allocator works and try to make it more efficient. Attacking the latter of the two options is a better strategy (not to say we aren’t still actively looking into the former).

Tree-based “Max Hole Size” (MHS) lookup

The linear search block allocator has no idea where bigger and smaller holes might be located within the set (a core limitation). It must use brute force to find a hole big enough for the data it needs to store. To address this, we implemented a new in-memory, tree-based algorithm (red-black tree). This replaces the current in-memory linear array and integrates the hole size search into the tree structure itself.

In this new block allocator implementation, we store the set of known in-use blocks within the node structure of a binary tree instead of a linear array. We order the  tree by the file offset of the blocks. We then added a little extra data to each node of this new tree structure. This data tells us the maximum hole we can expect to find in each child subtree. So now when searching for a hole, we can quickly drill down the tree to find an available hole of the correct size without needing to perform a fully linear scan. The trade off is that merging holes together and updating the parental max hole sizes is slightly more intricate and time-consuming than in a linear structure. The huge improvement in search efficiency makes this extra overhead pure noise.

We can see in this overly simplified diagram, we have five blocks:

  • offset 0 : 1 byte
  • offset 3 : 2 bytes
  • offset 6 : 3 bytes
  • offset 10 : 5 bytes
  • offset 20 : 8 bytes

We can calculate four holes in between those blocks:

  • offset 1 : 2 bytes
  • offset 5 : 1 byte
  • offset 9 : 1 byte
  • offset 15 : 5 bytes

We see that the search for a 4-byte hole traverses down the right side of the tree. It discovers a hole at offset 15. This hole is a big enough for our 4 bytes. It does this without needing to visit the nodes at offsets 0 and 3. For you algorithmic folks out there, we have gone from an O(n) to O(log n) search. This is tremendously more efficient when we get into severe fragmentation states. A side effect is that we tend to allocate blocks from holes closer to the needed size rather than from the first one big enough to fit. The small hole fragmentation issue may actually increase over time, but that has yet to be seen in our testing.

Benchmarks

As our CTO Vadim Tkachenko asserts, there are “Lies, Damned Lies and Benchmarks.” We’re going to show a simple test case where we thought, “What is the worst possible scenario that I can come up with in a small-ish benchmark to show the differences?”. So, rather than try and convince you using some pseudo-real-world benchmark that uses sleight of hand, I’m telling you up front that this example is slightly absurd, but pushes the issue to the foreground.

That scenario is actually pretty simple. We shape the tree to have as many nodes as possible, and intentionally use settings that reduce concurrency. We will use a standard sysbench OLTP test, and run it for about three hours after the prepare stage has completed:

  • Hardware:
    • Intel i7, 4 core hyperthread (8 virtual cores) @ 2.8 GHz
    • 16 GB of memory
    • Samsung 850 Pro SSD
  • Sysbench OLTP:
    • 1 table of 160M rows or about 30GB of primary key data and 4GB secondary key data
    • 24 threads
    • We started each test server instance with no data. Then we ran the sysbench prepare, then the sysbench run with no shutdown in between the prepare and run.
    • prepare command : /data/percona/sysbench/sysbench/sysbench –test=/data/percona/sysbench/sysbench/tests/db/parallel_prepare.lua –mysql-table-engine=tokudb –oltp-tables-count=1 –oltp-table-size=160000000 –mysql-socket=$(PWD)/var/mysql.sock –mysql-user=root –num_threads=1 run
    • run command : /data/percona/sysbench/sysbench/sysbench –test=/data/percona/sysbench/sysbench/tests/db/oltp.lua –mysql-table-engine=tokudb –oltp-tables-count=1 –oltp-table-size=160000000 –rand-init=on –rand-type=uniform –num_threads=24 –report-interval=30 –max-requests=0 –max-time=10800 –percentile=99 –mysql-socket=$(PWD)/var/mysql.sock –mysql-user=root run
  • mysqld/TokuDB configuration
    • innodb_buffer_pool_size=5242880
    • tokudb_directio=on
    • tokudb_empty_scan=disabled
    • tokudb_commit_sync=off
    • tokudb_cache_size=8G
    • tokudb_checkpointing_period=300
    • tokudb_checkpoint_pool_threads=1
    • tokudb_enable_partial_eviction=off
    • tokudb_fsync_log_period=1000
    • tokudb_fanout=8
    • tokudb_block_size=8K
    • tokudb_read_block_size=1K
    • tokudb_row_format=tokudb_uncompressed
    • tokudb_cleaner_period=1
    • tokudb_cleaner_iterations=10000

So as you can see: amazing results, right? Sustained throughput, immensely better response time and better utilization of available CPU resources. Of course, this is all fake with a tree shape that no sane user would implement. It illustrates what happens when the linear list contains small holes: exactly what we set out to fix!

Closing

Look for this improvement to appear in Percona Server 5.6.32-78.0 and 5.7.14-7. It’s a good one for you if you have huge TokuDB data files with lots and lots of nodes.

Credits!

Throughout this post, I referred to “we” numerous times. That “we” encompasses a great many people that have looked into this in the past and implemented the current solution. Some are current and former Percona and Tokutek employees that you may already know by name. Some are newer at Percona. I got to take their work and research, incorporate it into the current codebase, test and benchmark it, and report it here for all to see. Many thanks go out to Jun Yuan, Leif Walsh, John Esmet, Rich Prohaska, Bradley Kuszmaul, Alexey Stroganov, Laurynas Biveinis, Vlad Lesin, Christian Rober and others for all of the effort in diagnosing this issue, inventing a solution, and testing and reviewing this change to the PerconaFT library.