News-Feeds
Libysche Küstenwache schießt erneut auf Rettungsschiff
Widerruf eines Regierungschefs
TTIP am Abgrund?
Unter der Haube
USA: Angst vor Manipulation der Wahlcomputer durch Moskau
Totaler Terrorkrieg
Österreich plant Migrationsgipfel
Percona Live Europe featured talk with Alexander Krasheninnikov — Processing 11 billion events a day with Spark in Badoo
Welcome to a new Percona Live Europe featured talk with Percona Live Europe 2016: Amsterdam speakers! In this series of blogs, we’ll highlight some of the speakers that will be at this year’s conference. We’ll also discuss the technologies and outlooks of the speakers themselves. Make sure to read to the end to get a special Percona Live Europe registration bonus!
In this Percona Live Europe featured talk, we’ll meet Alexander Krasheninnikov, Head of Data Team at Badoo. His talk will be on Processing 11 billions events a day with Spark in Badoo. Badoo is one of the world’s largest and fastest growing social networks for meeting new people. I had a chance to speak with Alexander and learn a bit more about the database environment at Badoo:
Percona: Give me a brief history of yourself: how you got into database development, where you work, what you love about it?
Alexander: Currently, I work at Badoo as Head of Data Team. Our team is responsible for providing internal API’s for statistics data collecting and processing.
I started as a developer at Badoo, but the project I am going to cover in my talk lead to creating a separate department.
Percona: Your talk is called “Processing 11 billion events a day with Spark in Badoo.” What were the issues with your environment that led you to Spark? How did Spark solve these needs?
Alexander: When we designed the Unified Data Stream system in Badoo, we’ve extracted several requirements: scalability, fault tolerance and reliability. Altogether, these requirements moved us towards using Hadoop as deep data storage and data processing framework. Our initial implementation was built on top of Scribe + WebHDFS + Hive. But we’ve realized that processing speed and any lag of data delivery is unacceptable (we need near-realtime data processing). One of our BI team mentioned Spark as being significantly faster than Hive in some cases, (especially ones similar to ours). When investigated Spark’s API, we found the Streaming submodule — ideal for our needs. Additionally, this framework allowed us to use some third-party libraries, and write code. We’ve actually created an aggregation framework that follows “divide and conquer” principle. Without Spark, we definitely went way re-inventing lot of things from it.
Percona: Why is tracking the event stream important for your business model? How are you using the data Spark is providing you to reach business goals?
Alexander: The event stream always represents some important business/technical metrics — votes, messages, likes and so on. All this, brought together, forms the “health” of our product. The primary goal of our Spark-based system is to process a heterogeneous event stream one way, and draw charts automatically. We acheived this goal, and now we have hundreds of charts and dozens of developers/analysts/product team members using them. The system also evolved, and now we perform automatic anomaly detection over the event stream. We report strange data behavior to all the interested people.
Percona: What is changing in data use in your businesses model that keeps you awake at night? What tools or features are you looking for to address these issues?
Alexander: As I’ve mentioned before, we have an anomaly detection process for our metrics. If some of our metrics are out of expected bounds, it is treated as being an anomaly, and notification are sent. Also, we have a self-monitoring functionality for the whole system — a small event rate of heartbeats is generated, and processed with two different systems. If those show a significant difference — that defintely keeps me awake at night!