Debugging production performance issues in distributed systems is rarely easy. A large number of factors influence system behavior. Controlling all of them to get deterministic, comparable results is usually impossible. Also, any attempt at getting even a partially controlled environment is invasive for the customer: They need to spend time helping with experiment setup, and/or limit system access to control workloads. The “controlled study” approach is not very attractive from a customer’s point of view. It usually takes quite a long time to execute, as well.
At Platfora, we find ourselves more and more frequently using a post-hoc “observational study” approach – a highly cost-effective way of performance debugging. In modern distributed systems, a lot of operational data is collected. With the right tools, you can look at all that data instead of examining individual log excerpts. Instead of performing a controlled experiment, the various performance-affecting factors are controlled during analysis, by fine-tuning filters and aggregation levels. Essentially, you assume that your data already contains information about queries that are similar to the ones you would want to run in an experiment. You just need to find them.
Generic performance analysis tools usually are good at one of two extreme grains: Low-level analysis of specific runs (e.g. profilers, log file inspection with ad-hoc scripts and spreadsheets), or monitoring of predefined, coarsely aggregated metrics (for dashboarding). Even with tools, the former is typically slow and costly (in terms of developer time). The latter is only useful for detecting sweeping problems that affect the entire system, or for previously known types of issues that have predefined metrics. The type of debugging we propose here would require a different type of tool. Unexpectedly, we found we already had such a tool available.
A war story
Platfora consists of a family of integrated tools for big data exploration and discovery, including an in-browser data visualization tool backed by a custom, distributed data processing system. In one particular case, a customer opened a support ticket insisting that some Platfora “vizboards” (visualization workbooks) suddenly spent much longer in backend query processing than what their analysts were used to from earlier experience. Not all queries and users were affected. In addition, not every individual user who experienced a performance drop would have reported this, so we did not have a precise starting point of the problems. However, the customer said that the problems seemed to clear up for a while after every cluster restart.
Unfortunately, the phenomenon did not reproduce easily, and they had no “in flagrante delicto” proof – say, two runs of an identical query with two different runtimes. Platfora’s core use case is interactive data exploration, so queries change frequently, making “longitudinal” observations about the performance of the same query over time difficult (the same query may not re-occur often enough).
The usual suspects appeared innocent:
- no changes in overall workload
- independent benchmarks showed consistent hardware performance (CPU, network etc)
- the logs of this particular system showed that JVM garbage collection activity was low and stable over time
- the relevant data caches did not show increased miss ratios
Too much back and forth and experimentation would have been a waste of the customer’s time, so we decided to rely on pure data analysis until we could identify a good candidate explanation. Platfora servers produce fairly detailed instrumentation traces (at a rate of a few dozen GBs/day for the 18-node deployment under investigation here). This feature was originally intended to be able to look at the precise performance breakdown of individual queries (one example of a fine-grained analysis tool).
In this case, we did not have a single query we wanted to investigate – we needed a way to discover criteria for what “slow query” really means. Digging through the individual traces of tens of thousands of queries was not a cost-effective way of making progress. Using spreadsheets to aggregate and visualize the data would not scale to the volume of trace data we had (several Terabyte).
Drinking your own champagne (why would you eat dog food?)
This is when we recognized that we had an instance of a big data exploration problem at our hands – exactly the type of problem Platfora is designed to attack, only that we usually target business data and not debug logs.
We pointed our server at the trace log, and defined some data preparation rules about which trace entries with data volume information relate to which trace entries with CPU time information. The entire process took less than half an hour – less than what you’d need to write simple custom shell scripts for log parsing. Building a Platfora “lens” on this data resulted in a memory representation that our query engine and visualization layer could turn into interactive visualizations.
We now looked at the aggregate performance of the different operators in our query execution plans across many queries. One of the most computationally intensive operators is the merge operator that combines results from the different worker nodes into a final result. Below is a scatterplot of merge times, with each dot aggregating all merge events for one query. The number of processed rows is shown on the x-axis, and used CPU time the y-axis.
Visually, it is easy to make out that there is a clear distribution around at least two different processing rates (each is one diagonal cluster). We tried to explain the different rates by features of the query, but it turned out that only one feature was really important – query width. In the chart above this is represented by color – the darker the query dot in the scatterplot, the more group-by columns are used. For a given query width, there still are two diagonals. Note the logarithmic scale – one is more than an order of magnitude slower than the other. The customer was right, something fishy was going on! It only stood out for some of the longer running queries, but now we had a much clearer target.
To make results less fuzzy, we looked only at wide queries (more than 10 group-by columns), and found to our surprise that we needed no other variable to explain the different rates other than the time since server restart. In the following chart, we switched to showing the CPU time/merged row in the y-Axis, and show “time since server restart” for a few server lifetimes.
Merging slowed down without recovery after some uptime. We found a similar behavior when performing the same analysis for the sort operator. Interestingly, sorting performance became worse at a different point in time than merging performance. So different parts of the code were affected at different times, and only on certain nodes. Given the other bits of data we had collected, this pointed to the Java runtime becoming worse at executing our code. After memory management, the most important performance-influencing factor is the Java just-in-time (JIT) compiler. From there, we quickly determined that our Java JIT code cache capacity was exhausted after some fraction of the code had been compiled. Surprisingly, in Java 1.7 this effectively leads to a gradual shutdown of compiled execution and a fallback to interpretation – exactly the progressive slowdown that we observed. A simple configuration change to increase JIT code cache size fixed this behavior.
Let’s do that again!
We were able to resolve the customer problem with this investigation quickly, and with minimal customer involvement. This high-level, visual, and interactive style of performance analysis is very efficient, can be easily communicated to the team, and has with short and rewarding feedback cycles. It gives us a much better feeling of progress than the “staring at logs and static numbers” style. Last but not least, it is very rewarding for us as engineers to use our own product to answer our own real questions, instead of using the product just for testing.
Some call it dogfooding, but while you can safely eat dog food, would you really want to? For us, this style of analysis really has become a popular item in the toolbox. *pops cork*