News-Feeds

How Apache Spark makes your slow MySQL queries 10x faster (or more)

MySQL High Performance - Mi, 2016-08-17 15:26

In this blog post, we’ll discuss how to improve the performance of slow MySQL queries using Apache Spark.

Introduction

In my previous blog post, I wrote about using Apache Spark with MySQL for data analysis and showed how to transform and analyze a large volume of data (text files) with Apache Spark. Vadim also performed a benchmark comparing performance of MySQL and Spark with Parquet columnar format (using Air traffic performance data). That works great, but what if we don’t want to move our data from MySQL to another storage (i.e., columnar format), and instead want to use “ad hock” queries on top of an existing MySQL server? Apache Spark can help here as well.

TL;DR version:

Using Apache Spark on top of the existing MySQL server(s) (without the need to export or even stream data to Spark or Hadoop), we can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.

The idea is simple: Spark can read MySQL data via JDBC and can also execute SQL queries, so we can connect it directly to MySQL and run the queries. Why is this faster? For long running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes. In my examples below, MySQL queries are executed inside Spark and run 5-10 times faster (on top of the same MySQL data).

In addition, Spark can add “cluster” level parallelism. In the case of MySQL replication or Percona XtraDB Cluster, Spark can split the query into a set of smaller queries (in the case of a partitioned table it will run one query per each partition for example) and run those in parallel across multiple slave servers of multiple Percona XtraDB Cluster nodes. Finally, it will use map/reduce the type of processing to aggregate the results.

I’ve used the same “Airlines On-Time Performance” database as in previous posts. Vadim created some scripts to download data and upload it to MySQL. You can find the scripts here: https://github.com/Percona-Lab/ontime-airline-performance. I’ve also used Apache Spark 2.0, which was released July 26, 2016.

Apache Spark Setup

Starting Apache Spark in standalone mode is easy. To recap:

  1. Download the Apache Spark 2.0 and place it somewhere.
  2. Start master
  3. Start slave (worker) and attach it to the master
  4. Start the app (in this case spark-shell or spark-sql)

Example:

root@thor:~/spark# ./sbin/start-master.sh less ../logs/spark-root-org.apache.spark.deploy.master.Master-1-thor.out 15/08/25 11:21:21 INFO Master: Starting Spark master at spark://thor:7077 15/08/25 11:21:21 INFO Utils: Successfully started service 'MasterUI' on port 8080. 15/08/25 11:21:21 INFO MasterWebUI: Started MasterWebUI at http://10.60.23.188:8080 root@thor:~/spark# ./sbin/start-slave.sh spark://thor:7077

To connect to Spark we can use spark-shell (Scala), pyspark (Python) or spark-sql. Since spark-sql is similar to MySQL cli, using it would be the easiest option (even “show tables” works). I also wanted to work with Scala in interactive mode so I’ve used spark-shell as well. In all the examples I’m using the same SQL query in MySQL and Spark, so working with Spark is not that different.

To work with MySQL server in Spark we need Connector/J for MySQL. Download the package and copy the mysql-connector-java-5.1.39-bin.jar to the spark directory, then add the class path to the conf/spark-defaults.conf:

spark.driver.extraClassPath = /usr/local/spark/mysql-connector-java-5.1.39-bin.jar spark.executor.extraClassPath = /usr/local/spark/mysql-connector-java-5.1.39-bin.jar

Running MySQL queries via Apache Spark

For this test I was using one physical server with 12 CPU cores (older Intel(R) Xeon(R) CPU L5639 @ 2.13GHz) and 48G of RAM, SSD disks. I’ve installed MySQL and started spark master and spark slave on the same box.

Now we are ready to run MySQL queries inside Spark. First, start the shell (from the Spark directory, /usr/local/spark in my case):

$ ./bin/spark-shell --driver-memory 4G --master spark://server1:7077

Then we will need to connect to MySQL from spark and register the temporary view:

val jdbcDF = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://localhost:3306/ontime?user=root&password=", "dbtable" -> "ontime.ontime_part", "fetchSize" -> "10000", "partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28" )).load() jdbcDF.createOrReplaceTempView("ontime")

So we have created a “datasource” for Spark (or in other words, a “link” from Spark to MySQL). The Spark table name is “ontime” (linked to MySQL ontime.ontime_part table) and we can run SQL queries in Spark, which in turn parse it and translate it in MySQL queries.

partitionColumn” is very important here. It tells Spark to run multiple queries in parallel, one query per each partition.

Now we can run the query:

val sqlDF = sql("select min(year), max(year) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and (origin = 'RDU' or dest = 'RDU') GROUP by carrier HAVING cnt > 100000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10") sqlDF.show()

MySQL Query Example

Let’s go back to MySQL for a second and look at the query example. I’ve chosen the following query (from my older blog post):

select min(year), max(year) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 100000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10

The query will find the total number of delayed flights per each airline. In addition, the query will calculate the smart “ontime” rating, taking into consideration the number of flights (we do not want to compare smaller air carriers with the large ones, and we want to exclude the older airlines who are not in business anymore).

The main reason I’ve chosen this query is that it is hard to optimize it in MySQL. All conditions in the “where” clause will only filter out ~70% of rows. I’ve done a basic calculation:

mysql> select count(*) FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI'); +-----------+ | count(*) | +-----------+ | 108776741 | +-----------+ mysql> select count(*) FROM ontime; +-----------+ | count(*) | +-----------+ | 152657276 | +-----------+ mysql> select round((108776741/152657276)*100, 2); +-------------------------------------+ | round((108776741/152657276)*100, 2) | +-------------------------------------+ | 71.26 | +-------------------------------------+

Table structure:

CREATE TABLE `ontime_part` ( `YearD` int(11) NOT NULL, `Quarter` tinyint(4) DEFAULT NULL, `MonthD` tinyint(4) DEFAULT NULL, `DayofMonth` tinyint(4) DEFAULT NULL, `DayOfWeek` tinyint(4) DEFAULT NULL, `FlightDate` date DEFAULT NULL, `UniqueCarrier` char(7) DEFAULT NULL, `AirlineID` int(11) DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `TailNum` varchar(50) DEFAULT NULL, ... `id` int(11) NOT NULL AUTO_INCREMENT, PRIMARY KEY (`id`,`YearD`), KEY `covered` (`DayOfWeek`,`OriginState`,`DestState`,`Carrier`,`YearD`,`ArrDelayMinutes`) ) ENGINE=InnoDB AUTO_INCREMENT=162668935 DEFAULT CHARSET=latin1 /*!50100 PARTITION BY RANGE (YearD) (PARTITION p1987 VALUES LESS THAN (1988) ENGINE = InnoDB, PARTITION p1988 VALUES LESS THAN (1989) ENGINE = InnoDB, PARTITION p1989 VALUES LESS THAN (1990) ENGINE = InnoDB, PARTITION p1990 VALUES LESS THAN (1991) ENGINE = InnoDB, PARTITION p1991 VALUES LESS THAN (1992) ENGINE = InnoDB, PARTITION p1992 VALUES LESS THAN (1993) ENGINE = InnoDB, PARTITION p1993 VALUES LESS THAN (1994) ENGINE = InnoDB, PARTITION p1994 VALUES LESS THAN (1995) ENGINE = InnoDB, PARTITION p1995 VALUES LESS THAN (1996) ENGINE = InnoDB, PARTITION p1996 VALUES LESS THAN (1997) ENGINE = InnoDB, PARTITION p1997 VALUES LESS THAN (1998) ENGINE = InnoDB, PARTITION p1998 VALUES LESS THAN (1999) ENGINE = InnoDB, PARTITION p1999 VALUES LESS THAN (2000) ENGINE = InnoDB, PARTITION p2000 VALUES LESS THAN (2001) ENGINE = InnoDB, PARTITION p2001 VALUES LESS THAN (2002) ENGINE = InnoDB, PARTITION p2002 VALUES LESS THAN (2003) ENGINE = InnoDB, PARTITION p2003 VALUES LESS THAN (2004) ENGINE = InnoDB, PARTITION p2004 VALUES LESS THAN (2005) ENGINE = InnoDB, PARTITION p2005 VALUES LESS THAN (2006) ENGINE = InnoDB, PARTITION p2006 VALUES LESS THAN (2007) ENGINE = InnoDB, PARTITION p2007 VALUES LESS THAN (2008) ENGINE = InnoDB, PARTITION p2008 VALUES LESS THAN (2009) ENGINE = InnoDB, PARTITION p2009 VALUES LESS THAN (2010) ENGINE = InnoDB, PARTITION p2010 VALUES LESS THAN (2011) ENGINE = InnoDB, PARTITION p2011 VALUES LESS THAN (2012) ENGINE = InnoDB, PARTITION p2012 VALUES LESS THAN (2013) ENGINE = InnoDB, PARTITION p2013 VALUES LESS THAN (2014) ENGINE = InnoDB, PARTITION p2014 VALUES LESS THAN (2015) ENGINE = InnoDB, PARTITION p2015 VALUES LESS THAN (2016) ENGINE = InnoDB, PARTITION p_new VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

Even with a “covered” index, MySQL will have to scan ~70M-100M of rows and create a temporary table:

mysql> explain select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_part type: range possible_keys: covered key: covered key_len: 2 ref: NULL rows: 70483364 Extra: Using where; Using index; Using temporary; Using filesort 1 row in set (0.00 sec)

What is the query response time in MySQL:

mysql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10; +------------+----------+---------+----------+-----------------+------+ | min(yearD) | max_year | Carrier | cnt | flights_delayed | rate | +------------+----------+---------+----------+-----------------+------+ | 2003 | 2013 | EV | 2962008 | 464264 | 0.16 | | 2003 | 2013 | B6 | 1237400 | 187863 | 0.15 | | 2006 | 2011 | XE | 1615266 | 230977 | 0.14 | | 2003 | 2005 | DH | 501056 | 69833 | 0.14 | | 2001 | 2013 | MQ | 4518106 | 605698 | 0.13 | | 2003 | 2013 | FL | 1692887 | 212069 | 0.13 | | 2004 | 2010 | OH | 1307404 | 175258 | 0.13 | | 2006 | 2013 | YV | 1121025 | 143597 | 0.13 | | 2003 | 2006 | RU | 1007248 | 126733 | 0.13 | | 1988 | 2013 | UA | 10717383 | 1327196 | 0.12 | +------------+----------+---------+----------+-----------------+------+ 10 rows in set (19 min 16.58 sec)

19 minutes is definitely not great.

SQL in Spark

Now we want to run the same query inside Spark and let Spark read data from MySQL. We will create a “datasource” and execute the query:

scala> val jdbcDF = spark.read.format("jdbc").options( | Map("url" -> "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql", | "dbtable" -> "ontime.ontime_sm", | "fetchSize" -> "10000", | "partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2015", "numPartitions" -> "48" | )).load() 16/08/02 23:24:12 WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 27; Input number of partitions: 48; Lower bound: 1988; Upper bound: 2015. dbcDF: org.apache.spark.sql.DataFrame = [id: int, YearD: date ... 19 more fields] scala> jdbcDF.createOrReplaceTempView("ontime") scala> val sqlDF = sql("select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10") sqlDF: org.apache.spark.sql.DataFrame = [min(yearD): date, max_year: date ... 4 more fields] scala> sqlDF.show() +----------+--------+-------+--------+---------------+----+ |min(yearD)|max_year|Carrier| cnt|flights_delayed|rate| +----------+--------+-------+--------+---------------+----+ | 2003| 2013| EV| 2962008| 464264|0.16| | 2003| 2013| B6| 1237400| 187863|0.15| | 2006| 2011| XE| 1615266| 230977|0.14| | 2003| 2005| DH| 501056| 69833|0.14| | 2001| 2013| MQ| 4518106| 605698|0.13| | 2003| 2013| FL| 1692887| 212069|0.13| | 2004| 2010| OH| 1307404| 175258|0.13| | 2006| 2013| YV| 1121025| 143597|0.13| | 2003| 2006| RU| 1007248| 126733|0.13| | 1988| 2013| UA|10717383| 1327196|0.12| +----------+--------+-------+--------+---------------+----+

spark-shell does not show the query time. This can be retrieved from Web UI or from spark-sql. I’ve re-run the same query in spark-sql:

./bin/spark-sql --driver-memory 4G --master spark://thor:7077 spark-sql> CREATE TEMPORARY VIEW ontime > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:mysql://localhost:3306/ontime?user=root&password=", > dbtable "ontime.ontime_part", > fetchSize "1000", > partitionColumn "yearD", lowerBound "1988", upperBound "2014", numPartitions "48" > ); 16/08/04 01:44:27 WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 26; Input number of partitions: 48; Lower bound: 1988; Upper bound: 2014. Time taken: 3.864 seconds spark-sql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10; 16/08/04 01:45:13 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. 2003 2013 EV 2962008 464264 0.16 2003 2013 B6 1237400 187863 0.15 2006 2011 XE 1615266 230977 0.14 2003 2005 DH 501056 69833 0.14 2001 2013 MQ 4518106 605698 0.13 2003 2013 FL 1692887 212069 0.13 2004 2010 OH 1307404 175258 0.13 2006 2013 YV 1121025 143597 0.13 2003 2006 RU 1007248 126733 0.13 1988 2013 UA 10717383 1327196 0.12 Time taken: 139.628 seconds, Fetched 10 row(s)

So the response time of the same query is almost 10x faster (on the same server, just one box). But now how was this query translated to MySQL queries, and why it is so much faster? Here is what is happening inside MySQL:

Inside MySQL

Spark:

scala> sqlDF.show() [Stage 4:> (0 + 26) / 26]

MySQL:

mysql> select id, info from information_schema.processlist where info is not NULL and info not like '%information_schema%'; +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | id | info | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 10948 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2001 AND yearD < 2002) | | 10965 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2007 AND yearD < 2008) | | 10966 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1991 AND yearD < 1992) | | 10967 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1994 AND yearD < 1995) | | 10968 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1998 AND yearD < 1999) | | 10969 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2010 AND yearD < 2011) | | 10970 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2002 AND yearD < 2003) | | 10971 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2006 AND yearD < 2007) | | 10972 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1990 AND yearD < 1991) | | 10953 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2009 AND yearD < 2010) | | 10947 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1993 AND yearD < 1994) | | 10956 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD < 1989 or yearD is null) | | 10951 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2005 AND yearD < 2006) | | 10954 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1996 AND yearD < 1997) | | 10955 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2008 AND yearD < 2009) | | 10961 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1999 AND yearD < 2000) | | 10962 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2011 AND yearD < 2012) | | 10963 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2003 AND yearD < 2004) | | 10964 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1995 AND yearD < 1996) | | 10957 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2004 AND yearD < 2005) | | 10949 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1989 AND yearD < 1990) | | 10950 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1997 AND yearD < 1998) | | 10952 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2013) | | 10958 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 1992 AND yearD < 1993) | | 10960 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2000 AND yearD < 2001) | | 10959 | SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2012 AND yearD < 2013) | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 26 rows in set (0.00 sec)

Spark is running 26 queries in parallel, which is great. As the table is partitioned it only uses one partition per query, but scans the whole partition:

mysql> explain partitions SELECT `YearD`,`ArrDelayMinutes`,`Carrier` FROM ontime.ontime_part WHERE (((NOT (DayOfWeek IN (6, 7)))) AND ((NOT (OriginState IN ('AK', 'HI', 'PR', 'VI')))) AND ((NOT (DestState IN ('AK', 'HI', 'PR', 'VI'))))) AND (yearD >= 2001 AND yearD < 2002)G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_part partitions: p2001 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 5814106 Extra: Using where 1 row in set (0.00 sec)

In this case, as the box has 12 CPU cores / 24 threads, it efficently executes 26 queries in parallel and the partitioned table helps to avoid contention issues (I wish MySQL could scan partitions in parallel, but it can’t at the time of writing).

Another interesting thing is that Spark can “push down” some of the conditions to MySQL, but only those inside the “where” clause. All group by/order by/aggregations are done inside Spark. It  needs to retrieve data from MySQL to satisfy those conditions and will not push down group by/order by/etc to MySQL.

That also means that queries without “where” conditions (for example “select count(*) as cnt, carrier from ontime group by carrier order by cnt desc limit 10”) will have to retrieve all data from MySQL and load it to Spark (as opposed to MySQL will do all group by inside). Running it in Spark might be slower or faster (depending on the amount of data and use of indexes) but it also requires more resources and potentially more memory dedicated for Spark. The above query is translated to 26 queries, each does a “select carrier from ontime_part where (yearD >= N AND yearD < N)”

Pushing down the whole query into MySQL 

If we want to avoid sending all data from MySQL to Spark we have the option of creating a temporary table on top of a query (similar to MySQL’s create temporary table as select …). In Scala:

val tableQuery = "(select yeard, count(*) from ontime group by yeard) tmp" val jdbcDFtmp = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://localhost:3306/ontime?user=root&password=", "dbtable" -> tableQuery, "fetchSize" -> "10000" )).load() jdbcDFtmp.createOrReplaceTempView("ontime_tmp")

In Spark SQL:

CREATE TEMPORARY VIEW ontime_tmp USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://localhost:3306/ontime?user=root&password=mysql", dbtable "(select yeard, count(*) from ontime_part group by yeard) tmp", fetchSize "1000" ); select * from ontime_tmp;

Please note:

  1. We do not want to use “partitionColumn” here, otherwise we will see 26 queries like this in MySQL: “SELECT yeard, count(*) FROM (select yeard, count(*) from ontime_part group by yeard) tmp where (yearD >= N AND yearD < N)” (obviously not optimal)
  2. This is not a good use of Spark, more like a “hack.” The only good reason to do it is to be able to have the result of the query as a source of an additional query.
Query cache in Spark

Another option is to cache the result of the query (or even the whole table) and then use .filter in Scala for faster processing. This requires sufficient memory dedicated for Spark. The good news is we can add additional nodes to Spark and get more memory for Spark cluster.

Spark SQL example:

CREATE TEMPORARY VIEW ontime_latest USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://localhost:3306/ontime?user=root&password=", dbtable "ontime.ontime_part partition (p2013, p2014)", fetchSize "1000", partitionColumn "yearD", lowerBound "1988", upperBound "2014", numPartitions "26" ); cache table ontime_latest; spark-sql> cache table ontime_latest; Time taken: 465.076 seconds spark-sql> select count(*) from ontime_latest; 5349447 Time taken: 0.526 seconds, Fetched 1 row(s) spark-sql> select count(*), dayofweek from ontime_latest group by dayofweek; 790896 1 634664 6 795540 3 794667 5 808243 4 743282 7 782155 2 Time taken: 0.541 seconds, Fetched 7 row(s) spark-sql> select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_latest WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and (origin='RDU' or dest = 'RDU') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10; 2013 2013 MQ 9339 1734 0.19 2013 2013 B6 3302 516 0.16 2013 2013 EV 9225 1331 0.14 2013 2013 UA 1317 177 0.13 2013 2013 AA 5354 620 0.12 2013 2013 9E 5520 593 0.11 2013 2013 WN 10968 1130 0.1 2013 2013 US 5722 549 0.1 2013 2013 DL 6313 478 0.08 2013 2013 FL 2433 205 0.08 Time taken: 2.036 seconds, Fetched 10 row(s)

Here we cache partitions p2013 and p2014 in Spark. This retrieves the data from MySQL and loads it in Spark. After that all queries run on the cached data and will be much faster.

With Scala we can cache the result of a query and then use filters to only get the information we need:

val sqlDF = sql("SELECT flightdate, origin, dest, depdelayminutes, arrdelayminutes, carrier, TailNum, Cancelled, Diverted, Distance from ontime") sqlDF.cache().show() scala> sqlDF.filter("flightdate='1988-01-01'").count() res5: Long = 862

Using Spark with Percona XtraDB Cluster

As Spark can be used in a cluster mode and scale with more and more nodes, reading data from a single MySQL is a bottleneck. We can use MySQL replication slave servers or Percona XtraDB Cluster (PXC) nodes as a Spark datasource. To test it out, I’ve provisioned Percona XtraDB Cluster with three nodes on AWS (I’ve used m4.2xlarge Ubuntu instances) and also started Apache Spark on each node:

  1. Node1 (pxc1): Percona Server + Spark Master + Spark worker node + Spark SQL running
  2. Node2 (pxc2): Percona Server + Spark worker node
  3. Node3 (pxc3): Percona Server + Spark worker node

All the Spark worker nodes use the memory configuration option:

cat conf/spark-env.sh export SPARK_WORKER_MEMORY=24g

Then I can start spark-sql (also need to have connector/J JAR file copied to all nodes):

$ ./bin/spark-sql --driver-memory 4G --master spark://pxc1:7077

When creating a table, I still use localhost to connect to MySQL (url “jdbc:mysql://localhost:3306/ontime?user=root&password=xxx”). As Spark worker nodes are running on the same instance as Percona Cluster nodes, it will use the local connection. Then running a Spark SQL will evenly distribute all 26 MySQL queries among the three MySQL nodes.

Alternatively we can run Spark cluster on a separate host and connect it to the HA Proxy, which in turn will load balance selects across multiple Percona XtraDB Cluster nodes.

Query Performance Benchmark

Finally, here is the query response time test on the three AWS Percona XtraDB Cluster nodes:

Query 1: select min(yearD), max(yearD) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_part WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') GROUP by carrier HAVING cnt > 1000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitioned) 19 min 16.58 sec 192.17 sec 6.02 Covered index (partitioned) 2 min 10.81 sec 48.38 sec 2.7

 

Query 2: select dayofweek, count(*) from ontime_part group by dayofweek;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitoned) 19 min 15.21 sec 195.058 sec 5.92 Covered index (partitioned) 1 min 10.38 sec 27.323 sec 2.58

 

Now, this looks really good, but it can be better. With three nodes @ m4.2xlarge we will have 8*3 = 24 cores total (although they are shared between Spark and MySQL). We can expect 10x improvement, especially without a covered index.

However, on m4.2xlarge the amount of RAM did not allow me to run MySQL out of memory, so all reads were from EBS non-provisioned IOPS, which only gave me ~120MB/sec. I’ve redone the test on a set of three dedicated servers:

  • 28 cores E5-2683 v3 @ 2.00GHz
  • 240GB of RAM
  • Samsung 850 PRO

The test was running completely off RAM:

Query 1 (from the above)

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitoned) 3 min 13.94 sec 14.255 sec 13.61 Covered index (partitioned) 2 min 2.11 sec 9.035 sec 13.52

 

Query 2: select dayofweek, count(*) from ontime_part group by dayofweek;

Query / Index type MySQL Time Spark Time (3 nodes) Times Improvement No covered index (partitoned)  2 min 0.36 sec 7.055 sec 17.06 Covered index (partitioned) 1 min 6.85 sec 4.514 sec 14.81

 

With this amount of cores and running out of RAM we actually do not have enough concurrency as the table only have 26 partitions. I’ve tried the unpartitioned table with ID primary key and use 128 partitions.

Note about partitioning

I’ve used partitioned table (partition by year) in my tests to help reduce MySQL level contention. At the same time the “partitionColumn” option in Spark does not require that MySQL table is partitioned. For example, if a table has a primary key, we can use this CREATE VIEW in Spark :

CREATE OR REPLACE TEMPORARY VIEW ontime USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://127.0.0.1:3306/ontime?user=root&password=", dbtable "ontime.ontime", fetchSize "1000", partitionColumn "id", lowerBound "1", upperBound "162668934", numPartitions "128" );

Assuming we have enough MySQL servers (i.e., nodes or slaves), we can increase the number of partitions and that can improve the parallelism (as opposed to only 26 partitions when running one partition by year). Actually, the above test gives us even better response time: 6.44 seconds for query 1.

Where Spark doesn’t work well

For faster queries (those that use indexes or can efficiently use an index) it does not make sense to use Spark. Retrieving data from MySQL and loading it into Spark is not free. This overhead can be significant for faster queries. For example, a query like this select count(*) from ontime_part where YearD = 2013 and DayOfWeek = 7 and OriginState = 'NC' and DestState = 'NC'; will only scan 1300 rows and will return instant (0.00 seconds reported by MySQL).

An even better example is this: select max(id) from ontime_part. In MySQL, the query will use the index and all calculations will be done inside MySQL. Spark, on the other hand, will have to retrieve all IDs (select id from ontime_part) from MySQL and calculate maximum. That took 24.267 seconds.

Conclusion

Using Apache Spark as an additional engine level on top of MySQL can help to speed up the slow reporting queries and add much-needed scalability for the long running select queries. In addition, Spark can help with query caching for frequent queries.

PS: Visual explain plan with Spark

Spark Web GUI provides lots of ways of monitoring Spark jobs. For example, it shows the “job” progress:

And SQL visual explain details:

Aufstand der Mitfahrer oder falsches Vertrauen in die Start-up-Ökonomie

Telepolis - Mi, 2016-08-17 07:00
Erneut führt eine Mitfahrerbörse Gebühren ein und stützt damit die These, dass es im Kapitalismus keine Räume jenseits des Wertgesetzes gibt
Kategorien: Politik + Kultur

Erdogan, der Terrorpate?

Telepolis - Mi, 2016-08-17 06:00
Die Bundesregierung wirft der türkischen Regierung Unterstützung von Islamisten vor. Warum gerade jetzt?
Kategorien: Politik + Kultur

Türkei drängt nach der Eroberung von Manbij

Telepolis - Di, 2016-08-16 22:00
Während die vor allem von Kurden getragene Offensive weitergehen soll, fordert Ankara den von der US-Regierung versprochenen Rückzug der Kurden. Das Stimson Center fordert wegen der instabilen Lage den Abzug der US-Atomwaffen aus Incirlik
Kategorien: Politik + Kultur

USA: Negativwahlkampf mit Janukowitsch und Merkel

Telepolis - Di, 2016-08-16 22:00
Während Donald Trump Hillary Clinton mit der deutschen Bundeskanzlerin vergleicht, versuchen Parteigänger der Demokraten den Milliardär mit dem gestürzten ukrainischen Staatspräsidenten in Verbindung zu bringen
Kategorien: Politik + Kultur

Webinar Thursday 8/18: Preventing and Resolving MySQL Downtime

MySQL High Performance - Di, 2016-08-16 17:14

Join Percona’s Jervin Real for a webinar on Thursday August 18, 2016 at 10 am PDT (UTC-7) on Preventing and Resolving MySQL Downtime.

Preventing MySQL downtime and emergencies is difficult. Often complex combinations of several things going wrong cause these emergencies. Without knowledge of the causes of emergencies, preventative proactive measures often fail to prevent further problems — no matter how sincere. This talk discusses some of the ways to prevent real production system emergencies, and suggests specific actions for:

  • Application stack configuration
  • MySQL server configuration
  • Operating system configuration
  • Troublesome server features
  • Special features of Percona Server
  • MySQL health checks
  • Percona Toolkit

Register for the webinar here.

 

Jervin Real, Technical Services Manager As Technical Services Manager, Jervin partners with Percona’s customers on building reliable and highly performant MySQL infrastructures, while also doing other fun stuff like watching cat videos on the internet. Jervin joined Percona in Apr 2010. Starting as a PHP programmer, Jervin quickly got involved with the LAMP stack. He has worked on several high-traffic sites and a number of specialized web applications: i.e., mobile content distribution. Before joining Percona, Jervin also worked with several hosting companies, providing care for customer hosted services and data on both Linux and Windows.

Percona Toolkit 2.2.19 is now available

MySQL High Performance - Di, 2016-08-16 15:05

Percona is pleased to announce the availability of Percona Toolkit 2.2.19.  Released August 16, 2016. Percona Toolkit is a collection of advanced command-line tools that perform a variety of MySQL server and system tasks that DBAs find too difficult or complex for to perform manually. Percona Toolkit, like all Percona software, is free and open source.

This release is the current GA (Generally Available) stable release in the 2.2 series. Downloads are available here and from the Percona Software Repositories.

New Features:
  • 1221372: pt-online-schema-change now aborts with an error if the server is a slave, because this can break data consistency in case of row-based replication. If you are sure that the slave will not use row-based replication, you can disable this check using the --force-slave-run option.
  • 1485195: pt-table-checksum now forces replica table character set to UTF-8.
  • 1517155: Introduced --create-table-engine option to pt-heartbeat, which sets a storage engine for the heartbeat table different from the database default engine.
  • 1595678 and 1595912: Introduced --slave-user and --slave-password options to pt-online-schema-change, pt-table-sync, and pt-table-checksum.
  • 1610385: pt-online-schema-change now re-checks the list of slaves in the DSN table. This enables changing the contents of the table while the tool is running.
Bugs Fixed:
  • 1581752: Fixed pt-query-digest date and time parsing from MySQL 5.7 slow query log.
  • 1592166: Fixed memory leak when pt-kill kills a query.
  • 1592608: Fixed overflow of CONCAT_WS when pt-table-checksum or pt-table-sync checksums large BLOB, TEXT, or BINARY columns.
  • 1593265: Fixed pt-archiver deleting rows that were not archived.
  • 1610386: Fixed pt-slave-restart handling of GTID ranges where the left-side integer is larger than 9.
  • 1610387: Removed extra word ‘default’ from the --verbose help for pt-slave-restart.
  • 1610388: Fixed pt-table-sync not quoting enum values properly. They are now recognized as CHAR fields.

Find release details in the release notes and the 2.2.19 milestone at Launchpad. Report bugs on the Percona Toolkit launchpad bug tracker

I’m Colin Charles, and I’m here to evangelize open source databases!

MySQL High Performance - Di, 2016-08-16 14:02

Let me introduce myself, I’m Colin Charles.

Percona turns ten years old this year. To me, there is no better time to join the company as the Chief Evangelist in the CTO office.

I’ve been in the MySQL world a tad longer than Percona has, and have had the pleasure of working on MySQL at MySQL AB and Sun Microsystems. Most recently I was one of the founding team members for MariaDB Server in 2009. I watched that grow into the MariaDB Corporation (after the merger with SkySQL) and the MariaDB Foundation.

For me, it’s about the right server for the right job. Today they all support a myriad of different features and different storage engines. Each server has its own community that supports and discusses their pros and cons. This is now true for both the MySQL and MongoDB ecosystems.

I’ve always had a lot of respect for the work Percona does — pragmatic engineering, deeply technical consulting (and blog posts) and amazing conferences. A big deal for me, and a big reason why I’m now here, is that Percona truly believes in the spirit of open source software development. Their obvious support of the open source community is a great pull factor for users as well.

I just spent time on the Percona Live Europe conference committee. (I’ve been involved in MySQL-related conferences since 2006, and was even Program Chair for a couple of years). There, I got to see how the conference is evolving beyond just stock MySQL to also include MongoDB and other open source databases

Recently I visited a customer who was not just interested in using a database, but also in offering a database-as-a-service to their internal customers. I discussed OpenStack with them, and knowing that Percona, the company I now represent, can support the architecture and deployment too? That’s kind of priceless.

We’re all crazy about databases and their position in the overall IT structure. They provide us with cool apps, internet functionality, and all sorts of short cuts to our daily lives. Percona’s role in providing solutions that address the issues that infrastructure faces is what really excites me about my new journey here.

Umbildung der russischen Regierung: Was wird aus Putins Weggefährten Sergej Iwanow?

Telepolis - Di, 2016-08-16 06:00
Nach der Abberufung als Leiter der Präsidialverwaltung könnte der Hardliner Iwanow einen Regierungsposten übernehmen, spekulieren Beobachter in Moskau
Kategorien: Politik + Kultur

IWF: "Zunehmende Neigung zur Scheinheiligkeit"

Telepolis - Mo, 2016-08-15 22:00
Wissenschaftler haben die Kreditbewilligungen und politischen Auflagen des Internationalen Währungsfonds vor und nach der Krise untersucht
Kategorien: Politik + Kultur

Unruhen in Milwaukee

Telepolis - Mo, 2016-08-15 22:00
Machtkampf zwischen Polizei und Gangs
Kategorien: Politik + Kultur

Percona Live Europe 2016 Schedule Now Live

MySQL High Performance - Mo, 2016-08-15 17:01

This post reveals the full Percona Live Europe 2016 schedule for Amsterdam this October 3-5.

The official The Percona Live Open Source Database Conference Europe 2016 schedule is now live, and you can find it here.

The schedule demonstrates that this conference has something for everyone! Whether your interest is in MySQL, MongoDB or other open source databases, there are talks that will interest you.

The Percona Live Open Source Database Conference is the premier event for the diverse and active open source database community, as well as businesses that develop and use open source database software. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, and other open source databases. Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience – all to help you learn how to tackle your open source database challenges in a whole new way.

Some of the talks for each area are:

MySQL

MongoDB

Open Source Databases

Check out the full schedule now!

Advanced Tickets

Purchase your passes now and get the advanced tickets discount. The earlier you buy, the better the value. You can register for Percona Live Europe here.

Sponsor Percona Live

Sponsor the Percona Live Open Source Database Performance Conference Europe 2016. Sponsorship gets you bigger visibility at the most important open source database conference in Europe. Benefits to sponsorship include:

  • Worldwide Audience: Made up of DBAs, developers, CTOs, CEOs, technology evangelists, entrepreneurs, and technology vendors.
  • Perfect Location: In Amsterdam City Centre, walking distance from Amsterdam Central Station.
  • Perfect Event: The showcase event for the rich and diverse MySQL, MongoDB and open source database markets in Europe.

Click here to sponsor now.

Milliardäre für Clinton

Telepolis - So, 2016-08-14 22:00
Bei den kommenden Präsidentschaftswahlen können US-Bürger zwischen einer konservativen Hillary Clinton und einem rechtspopulistischen Donald Trump "wählen"
Kategorien: Politik + Kultur

Philippinische Regierung gibt amerikanischem Druck nach

Telepolis - So, 2016-08-14 22:00
Außenminister sichert Untersuchung von Todesfällen zu - Präsident Duterte bezeichnet US-Botschafter als "schwulen Hurensohn"
Kategorien: Politik + Kultur

Alternativen zu Trump und Clinton

Telepolis - Sa, 2016-08-13 22:00
Über zehn Prozent der US-Amerikaner wollen im November für andere Kandidaten stimmen
Kategorien: Politik + Kultur

Klimaschutz: Die große Koalition schläft

Telepolis - Sa, 2016-08-13 22:00
Bundesregierung macht ihre eigenen Hausaufgaben nicht
Kategorien: Politik + Kultur

Reich werden mit Amazon

Telepolis - Sa, 2016-08-13 22:00
Seit der Einführung des Leihsystems Kindle Unlimited haben die Betrügereien eine Dimension angenommen, die alles Vorangegangene sprengt
Kategorien: Politik + Kultur

Etuis, Aufkleber oder Gewöhnung?

Telepolis - Fr, 2016-08-12 22:00
Die Schockbilder auf Zigarettenpackungen haben den Verbraucher erreicht
Kategorien: Politik + Kultur

Wem gehört der Mond

Telepolis - Fr, 2016-08-12 22:00
Gesetze versuchen, die Ansprüche im Weltraum zu reglementieren - doch welches Recht der Mensch tatsächlich hat, ist eine andere Frage
Kategorien: Politik + Kultur

Tuning Linux for MongoDB

MySQL High Performance - Fr, 2016-08-12 19:36

In this post, we’ll discuss tuning Linux for MongoDB deployments.

By far the most common operating system you’ll see MongoDB running on is Linux 2.6 and 3.x. Linux flavors such as CentOS and Debian do a fantastic job of being a stable, general-purpose operating system. Linux runs software on hardware ranging from tiny computers like the Raspberry Pi up to massive data center servers. To make this flexibility work, however, Linux defaults to some “lowest common denominator” tunings so that the OS will boot on anything.

Working with databases, we often focus on the queries, patterns and tunings that happen inside the database process itself. This means we sometimes forget that the operating system below it is the life-support of database, the air that it breathes so-to-speak. Of course, a highly-scalable database such as MongoDB runs fine on these general-purpose defaults without complaints, but the efficiency can be equivalent to running in regular shoes instead of sleek runners. At small scale, you might not notice the lost efficiency, but at large scale (especially when data exceeds RAM) improved tunings equate to fewer servers and less operational costs. For all use cases and scale, good OS tunings also provide some improvement in response times and removes extra “what if…?” questions when troubleshooting.

Overall, memory, network and disk are the system resources important to MongoDB. This article covers how to optimize each of these areas. Of course, while we have successfully deployed these tunings to many live systems, it’s always best to test before applying changes to your servers.

If you plan on applying these changes, I suggest performing them with one full reboot of the host. Some of these changes don’t require a reboot, but test that they get re-applied if you reboot in the future. MongoDB’s clustered nature should make this relatively painless, plus it might be a good time to do that dreaded “yum upgrade” / “aptitude upgrade“, too.

Linux Ulimit

To prevent a single user from impacting the entire system, Linux has a facility to implement some system resource constraints on processes, file handles and other system resources on a per-user-basis. For medium-high-usage MongoDB deployments, the default limits are almost always too low. Considering MongoDB generally uses dedicated hardware, it makes sense to allow the Linux user running MongoDB (e.g., “mongod”) to use a majority of the available resources.

Now you might be thinking: “Why not disable the limit (or set it to unlimited)?” This is a common recommendation for database servers. I think you should avoid this for two reasons:

  • If you hit a problem, a lack of a limit on system resources can allow a relatively smaller problem to spiral out of control, often bringing down other services (such as SSH) crucial to solving the original problem.
  • All systems DO have an upper-limit, and understanding those limitations instead of masking them is an important exercise.

In most cases, a limit of 64,000 “max user processes” and 64,000 “open files” (both have defaults of 1024) will suffice. To be more exact you need to do some math on the number of applications/clients, the maximum size of their connection pools and some case-by-case tuning for the number of inter-node connections between replica set members and sharding processes. (We might address this in a future blog post.)

You can deploy these limits by adding a file in “/etc/security/limits.d” (or appending to “/etc/security/limits.conf” if there is no “limits.d”). Below is an example file for the Linux user “mongod”, raising open-file and max-user-process limits to 64,000:

mongod       soft        nproc        64000 mongod       hard        nproc        64000 mongod       soft        nofile       64000 mongod       hard        nofile       64000

Note: this change only applies to new shells, meaning you must restart “mongod” or “mongos” to apply this change!

Virtual Memory Dirty Ratio

The “dirty_ratio” is the percentage of total system memory that can hold dirty pages. The default on most Linux hosts is between 20-30%. When you exceed the limit the dirty pages are committed to disk, creating a small pause. To avoid this hard pause there is a second ratio: “dirty_background_ratio” (default 10-15%) which tells the kernel to start flushing dirty pages to disk in the background without any pause.

20-30% is a good general default for “dirty_ratio”, but on large-memory database servers this can be a lot of memory! For example, on a 128GB-memory host this can allow up to 38.4GB of dirty pages. The background ratio won’t kick in until 12.8GB! We recommend that you lower this setting and monitor the impact to query performance and disk IO. The goal is reducing memory usage without impacting query performance negatively. Reducing caches sizes also guarantees data gets written to disk in smaller batches more frequently, which increases disk throughput (than huge bulk writes less often).

A recommended setting for dirty ratios on large-memory (64GB+ perhaps) database servers is: “vm.dirty_ratio = 15″ and vm.dirty_background_ratio = 5″, or possibly less. (Red Hat recommends lower ratios of 10 and 3 for high-performance/large-memory servers.)

You can set this by adding the following lines to /etc/sysctl.conf”:

vm.dirty_ratio = 15 vm.dirty_background_ratio = 5

To check these current running values:

$ sysctl -a | egrep "vm.dirty.*_ratio" vm.dirty_background_ratio = 5 vm.dirty_ratio = 15

Swappiness

“Swappiness” is a Linux kernel setting that influences the behavior of the Virtual Memory manager when it needs to allocate a swap, ranging from 0-100. A setting of 0 tells the kernel to swap only to avoid out-of-memory problems. A setting of 100 tells it to swap aggressively to disk. The Linux default is usually 60, which is not ideal for database usage.

It is common to see a setting of 0″ (or sometimes “10”) on database servers, telling the kernel to prefer to swap to memory for better response times. However, Ovais Tariq details a known bug (or feature) when using a setting of 0 in this blog post: https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/.

Due to this bug, we recommended using a setting of 1″ (or “10” if you  prefer some disk swapping) by adding the following to your /etc/sysctl.conf”:

vm.swappiness = 1

To check the current swappiness:

$ sysctl vm.swappiness vm.swappiness = 1

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply a dirty_ratio or swappiness change!

Transparent HugePages

*Does not apply to Debian/Ubuntu or CentOS/RedHat 5 and lower*

Transparent HugePages is an optimization introduced in CentOS/RedHat 6.0, with the goal of reducing overhead on systems with large amounts of memory. However, due to the way MongoDB uses memory, this feature actually does more harm than good as memory access are rarely contiguous.

Disabled THP entirely by adding the following flag below to your Linux kernel boot options:

transparent_hugepage=never

Usually this requires changes to the GRUB boot-loader config in the directory /boot/grub” or /etc/grub.d” on newer systems. Red Hat covers this in more detail in this article (same method on CentOS): https://access.redhat.com/solutions/46111.

Note: We recommended rebooting the system to clear out any previous huge pages and validate that the setting will persist on reboot.

NUMA (Non-Uniform Memory Access) Architecture

Non-Uniform Memory Access is a recent memory architecture that takes into account the locality of caches and CPUs for lower latency. Unfortunately, MongoDB is not “NUMA-aware” and leaving NUMA setup in the default behavior can cause severe memory in-balance.

There are two ways to disable NUMA: one is via an on/off switch in the system BIOS config, the 2nd is using the numactl” command to set NUMA-interleaved-mode (similar effect to disabling NUMA) when starting MongoDB. Both methods achieve the same result. I lean towards using the numactl” command due to future-proofing yourself for the mostly inevitable addition of NUMA awareness. On CentOS 7+ you may need to install the numactl” yum/rpm package.

To make mongod start using interleaved-mode, add numactl –interleave=all” before your regular mongod” command:

$ numactl --interleave=all mongod <options here>

To check mongod’s NUMA setting:

$ sudo numastat -p $(pidof mongod) Per-node process memory usage (in MBs) for PID 7516 (mongod) Node 0 Total --------------- --------------- Huge 0.00 0.00 Heap 28.53 28.53 Stack 0.20 0.20 Private 7.55 7.55 ---------------- --------------- --------------- Total 36.29 36.29

If you see only 1 x NUMA-node column (“Node0”) NUMA is disabled. If you see more than 1 x NUMA-node, make sure the metric numbers (Heap”, etc.) are balanced between nodes. Otherwise, NUMA is NOT in “interleave” mode.

Note: some MongoDB packages already ship logic to disable NUMA in the init/startup script. Check for this using “grep” first. Your hardware or BIOS manual should cover disabling NUMA via the system BIOS.

Block Device IO Scheduler and Read-Ahead

For tuning flexibility, we recommended that MongoDB data sits on its own disk volume, preferably with its own dedicated disks/RAID array. While it may complicate backups, for the best performance you can also dedicate a separate volume for the MongoDB journal to separate it’s disk activity noise from the main data set. The journal does not yet have it’s own config/command-line setting, so you’ll need to mount a volume to the journal” directory inside the dbPath. For example, /var/lib/mongo/journal” would be the journal mount-path if the dbPath was set to /var/lib/mongo”.

Aside from good hardware, the block device MongoDB stores its data on can benefit from 2 x major adjustments:

IO Scheduler

The IO scheduler is an algorithm the kernel will use to commit reads and writes to disk. By default most Linux installs use the CFQ (Completely-Fair Queue) scheduler. This is designed to work well for many general use cases, but with little latency guarantees. Two other popular schedulers are deadline” and noop”. Deadline excels at latency-sensitive use cases (like databases) and noop is closer to no scheduling at all.

We generally suggest using the deadline” IO scheduler for cases where you have real, non-virtualised disks under MongoDB. (For example, a “bare metal” server.) In some cases I’ve seen noop” perform better with certain hardware RAID controllers, however. The difference between deadline” and cfq” can be massive for disk-bound deployments.

If you are running MongoDB inside a VM (which has it’s own IO scheduler beneath it) it is best to use noop” and let the virtualization layer take care of the IO scheduling itself.

Read-Ahead

Read-ahead is a per-block device performance tuning in Linux that causes data ahead of a requested block on disk to be read and then cached into the filesystem cache. Read-ahead assumes that there is a sequential read pattern and something will benefit from those extra blocks being cached. MongoDB tends to have very random disk patterns and often does not benefit from the default read-ahead setting, wasting memory that could be used for more hot data. Most Linux systems have a default setting of 128KB/256 sectors (128KB = 256 x 512-byte sectors). This means if MongoDB fetches a 64kb document from disk, 128kb of filesystem cache is used and maybe the extra 64kb is never accessed later, wasting memory.

For this setting, we suggest a starting-point of 32 sectors (=16KB) for most MongoDB workloads. From there you can test increasing/reducing this setting and then monitor a combination of query performance, cached memory usage and disk read activity to find a better balance. You should aim to use as little cached memory as possible without dropping the query performance or causing significant disk activity.

Both the IO scheduler and read-ahead can be changed by adding a file to the udev configuration at /etc/udev/rules.d”. In this example I am assuming the block device serving mongo data is named /dev/sda” and I am setting “deadline” as the IO scheduler and 16kb/32-sectors as read-ahead:

# set deadline scheduler and 16kb read-ahead for /dev/sda ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"

To check the IO scheduler was applied ([square-brackets] = enabled scheduler):

$ cat /sys/block/sda/queue/scheduler noop [deadline] cfq

To check the current read-ahead setting:

$ sudo blockdev --getra /dev/sda 32

Note: this change should be applied and tested with a full system reboot!

Filesystem and Options

It is recommended that MongoDB uses only the ext4 or XFS filesystems for on-disk database data. ext3 should be avoided due to its poor pre-allocation performance. If you’re using WiredTiger (MongoDB 3.0+) as a storage engine, it is strongly advised that you ONLY use XFS due to serious stability issues on ext4.

Each time you read a file, the filesystems perform an access-time metadata update by default. However, MongoDB (and most applications) does not use this access-time information. This means you can disable access-time updates on MongoDB’s data volume. A small amount of disk IO activity that the access-time updates cause stops in this case.

You can disable access-time updates by adding the flag noatime” to the filesystem options field in the file /etc/fstab” (4th field) for the disk serving MongoDB data:

/dev/mapper/data-mongodb /var/lib/mongo ext4 defaults,noatime 0 0

Use “grepto verify the volume is currently mounted:

$ grep "/var/lib/mongo" /proc/mounts /dev/mapper/data-mongodb /var/lib/mongo ext4 rw,seclabel,noatime,data=ordered 0 0

Note: to apply a filesystem-options change, you must remount (umount + mount) the volume again after stopping MongoDB, or reboot.

Network Stack

Several defaults of the Linux kernel network tunings are either not optimal for MongoDB, limit a typical host with 1000mbps network interfaces (or better) or cause unpredictable behavior with routers and load balancers. We suggest some increases to the relatively low throughput settings (net.core.somaxconn and net.ipv4.tcp_max_syn_backlog) and a decrease in keepalive settings, seen below.

Make these changes permanent by adding the following to /etc/sysctl.conf” (or a new file /etc/sysctl.d/mongodb-sysctl.conf – if /etc/sysctl.d exists):

net.core.somaxconn = 4096 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_intvl = 30 net.ipv4.tcp_keepalive_time = 120 net.ipv4.tcp_max_syn_backlog = 4096

To check the current values of any of these settings:

$ sysctl net.core.somaxconn net.core.somaxconn = 4096

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply this change!

NTP Daemon

All of these deeper tunings make it easy to forget about something as simple as your clock source. As MongoDB is a cluster, it relies on a consistent time across nodes. Thus the NTP Daemon should run permanently on all MongoDB hosts, mongos and arbiters included. Be sure to check the time syncing won’t fight with any guest-based virtualization tools like “VMWare tools” and “VirtualBox Guest Additions”.

This is installed on RedHat/CentOS with:

$ sudo yum install ntp

And on Debian/Ubuntu:

$ sudo apt-get install ntp

Note: Start and enable the NTP Daemon (for starting on reboots) after installation. The commands to do this vary by OS and OS version, so please consult your documentation.

Security-Enhanced Linux (SELinux)

Security-Enhanced Linux is a kernel-level security access control module that has an unfortunate tendency to be disabled or set to warn-only on Linux deployments. As SELinux is a strict access control system, sometimes it can cause unexpected errors (permission denied, etc.) with applications that were not configured properly for SELinux. Often people disable SELinux to resolve the issue and forget about it entirely. While implementing SELinux is not an end-all solution, it massively reduces the local attack surface of the server. We recommend deploying MongoDB with SELinus Enforcing” mode on.

The modes of SELinux are:

  1. Enforcing – Block and log policy violations.
  2. Permissive – Log policy violations only.
  3. Disabled – Completely disabled.

As database servers are usually dedicated to one purpose, such as running MongoDB, the work of setting up SELinux is a lot simpler than a multi-use server with many processes and users (such as an application/web server, etc.). The OS access pattern of a database server should be extremely predictable. Introducing Enforcing” mode at the very beginning of your testing/installation instead of after-the-fact avoids a lot of gotchas with SELinux. Logging for SELinux is directed to /var/log/audit/audit.log” and the configuration is at /etc/selinux”.

Luckily, Percona Server for MongoDB RPM packages (CentOS/RedHat) are SELinux “Enforcing” mode compatible as they install/enable an SELinux policy at RPM install time! Debian/Ubuntu SELinux support is still in planning.

Here you can see the SELinux policy shipped in the Percona Server for MongoDB version 3.2 server package:

$ rpm -ql Percona-Server-MongoDB-32-server | grep selinux /etc/selinux/targeted/modules/active/modules/mongod.pp

To change the SELinux mode to Enforcing”:

$ sudo setenforce Enforcing

To check the running SELinux mode:

$ sudo getenforce Enforcing

Linux Kernel and Glibc Version

The version of the Linux kernel and Glibc itself may be more important than you think. Some community benchmarks show a significant improvement on OLTP throughput benchmarks with the recent Linux 3.x kernels versus the 2.6 still widely deployed. To avoid serious bugs, MongoDB should at minimum use Linux 2.6.36 and Glibc 2.13 or newer.

I hope to create a follow-up post on the specific differences seen under MongoDB workloads on Linux 3.2+ versus 2.6. Until then, I recommend you test the difference using your own workloads and any results/feedback are appreciated.

What’s Next?

What’s the next thing to tune? At this point, tuning becomes case-by-case and open-ended. I appreciate any comments on use-case/tunings pairings that worked for you. Also, look out for follow-ups to this article for a few tunings I excluded due to lack of testing.

Not knowing the next step might mean you’re done tuning, or that you need more visibility into your stack to find the next bottleneck. Good monitoring and data visibility are invaluable for this type of investigation. Look out for future posts regarding monitoring your MongoDB (or MySQL) deployment and consider using Percona Monitoring and Management as an all-in-one monitoring solution. You could also try using Percona-Lab/prometheus_mongodb_exporterprometheus/node_exporter and Percona-Lab/grafana_mongodb_dashboards for monitoring MongoDB/Linux with Prometheus and Grafana.

The road to an efficient database stack requires patience, analysis and iteration. Tomorrow a new hardware architecture or change in kernel behavior could come, be the first to spot the next bottleneck! Happy hunting.