July 8, 2016

CC Kanne



Drinking Your Own Champagne: Performance Debugging as Exploratory Visual Analysis


Screen Shot 2016-07-08 at 4.27.12 PMDebugging production performance issues in distributed systems is rarely easy. A large number of factors influence system behavior. Controlling all of them to get deterministic, comparable results is usually impossible.  Also, any attempt at getting even a partially controlled environment is invasive for the customer: They need to spend time helping with experiment setup, and/or limit system access to control workloads. The “controlled study” approach is not very attractive from a customer’s point of view. It usually takes quite a long time to execute, as well.

At Platfora, we find ourselves more and more frequently using a post-hoc “observational study” approach – a highly cost-effective way of performance debugging. In modern distributed systems, a lot of operational data is collected. With the right tools, you can look at all that data instead of examining individual log excerpts. Instead of performing a controlled experiment, the various performance-affecting factors are controlled during analysis, by fine-tuning filters and aggregation levels. Essentially, you assume that your data already contains information about queries that are similar to the ones you would want to run in an experiment. You just need to find them.

Generic performance analysis tools usually are good at one of two extreme grains: Low-level analysis of specific runs (e.g. profilers, log file inspection with ad-hoc scripts and spreadsheets), or monitoring of predefined, coarsely aggregated metrics (for dashboarding). Even with tools, the former is typically slow and costly (in terms of developer time). The latter is only useful for detecting sweeping problems that affect the entire system, or for previously known types of issues that have predefined metrics. The type of debugging we propose here would require a different type of tool. Unexpectedly, we found we already had such a tool available.

A war story

Platfora consists of a family of integrated tools for big data exploration and discovery, including an in-browser data visualization tool backed by a custom, distributed data processing system. In one particular case, a customer opened a support ticket insisting that some Platfora “vizboards” (visualization workbooks) suddenly spent much longer in backend query processing than what their analysts were used to from earlier experience. Not all queries and users were affected. In addition, not every individual user who experienced a performance drop would have reported this, so we did not have a precise starting point of the problems. However, the customer said that the problems seemed to clear up for a while after every cluster restart.

Unfortunately, the phenomenon did not reproduce easily, and they had no “in flagrante delicto” proof – say, two runs of an identical query with two different runtimes. Platfora’s core use case is interactive data exploration, so queries change frequently, making “longitudinal” observations about the performance of the same query over time difficult (the same query may not re-occur often enough).

The usual suspects appeared innocent:

  • no changes in overall workload
  • independent benchmarks showed consistent hardware performance (CPU, network etc)
  • the logs of this particular system showed that JVM garbage collection activity was low and stable over time
  • the relevant data caches did not show increased miss ratios

Too much back and forth and experimentation would have been a waste of the customer’s time,  so we decided to rely on pure data analysis until we could identify a good candidate explanation. Platfora servers produce fairly detailed instrumentation traces (at a rate of a few dozen GBs/day for the 18-node deployment under investigation here). This feature was originally intended to be able to look at the precise performance breakdown of individual queries (one example of a fine-grained analysis tool).

In this case, we did not have a single query we wanted to investigate – we needed a way to discover criteria for what “slow query” really means. Digging through the individual traces of tens of thousands of queries was not a cost-effective way of making progress. Using spreadsheets to aggregate and visualize the data would not scale to the volume of trace data we had (several Terabyte).

Drinking your own champagne (why would you eat dog food?)
This is when we recognized that we had an instance of a big data exploration problem at our hands – exactly the type of problem Platfora is designed to attack, only that we usually target business data and not debug logs.

We pointed our server at the trace log, and defined some data preparation rules about which trace entries with data volume information relate to which trace entries with CPU time information. The entire process took less than half an hour – less than what you’d need to write simple custom shell scripts for log parsing. Building a Platfora “lens” on this data resulted in a memory representation that our query engine and visualization layer could turn into interactive visualizations.

We now looked at the aggregate performance of the different operators in our query execution plans across many queries. One of the most computationally intensive operators is the merge operator that combines results from the different worker nodes into a final result. Below is a scatterplot of merge times, with each dot aggregating all merge events for one query. The number of processed rows is shown on the x-axis, and used CPU time the y-axis.

Screen Shot 2016-07-01 at 3.52.55 PM

Visually, it is easy to make out that there is a clear distribution around at least two different processing rates (each is one diagonal cluster). We tried to explain the different rates by features of the query, but it turned out that only one feature was really important – query width. In the chart above this is represented by color – the darker the query dot in the scatterplot, the more group-by columns are used. For a given query width, there still are two diagonals. Note the logarithmic scale – one is more than an order of magnitude slower than the other. The customer was right, something fishy was going on! It only stood out for some of the longer running queries, but now we had a much clearer target.

To make results less fuzzy, we looked only at wide queries (more than 10 group-by columns), and found to our surprise that we needed no other variable to explain the different rates other than the time since server restart. In the following chart, we switched to showing the CPU time/merged row in the y-Axis, and show “time since server restart” for a few server lifetimes.

Screen Shot 2016-07-01 at 3.53.49 PM-2

Merging slowed down without recovery after some uptime. We found a similar behavior when performing the same analysis for the sort operator. Interestingly, sorting performance became worse at a different point in time than merging performance. So different parts of the code were affected at different times, and only on certain nodes. Given the other bits of data we had collected, this pointed to the Java runtime becoming worse at executing our code. After memory management, the most important performance-influencing factor is the Java just-in-time (JIT) compiler. From there, we quickly determined that our Java JIT code cache capacity was exhausted after some fraction of the code had been compiled. Surprisingly, in Java 1.7 this effectively leads to a gradual shutdown of compiled execution and a fallback to interpretation – exactly the progressive slowdown that we observed. A simple configuration change to increase JIT code cache size fixed this behavior.

Let’s do that again!
We were able to resolve the customer problem with this investigation quickly, and with minimal customer involvement. This high-level, visual, and interactive style of performance analysis is very efficient, can be easily communicated to the team, and has with short and rewarding feedback cycles. It gives us a much better feeling of progress than the “staring at logs and static numbers” style. Last but not least, it is very rewarding for us as engineers to use our own product to answer our own real questions, instead of using the product just for testing.

Some call it dogfooding, but while you can safely eat dog food, would you really want to? For us, this style of analysis really has become a popular item in the toolbox. *pops cork*

June 13, 2016


Tabulous: the fabulous table rendering engine

At Platfora we visualize and display data. A table is one way we present data. When the volume of data is small it is easy to render a table. However, Platfora is big data discovery platform, and frequently we need to present a lot of data in a table. To be more specific, the requirement is to enable a user to interact with a table that has a million rows and a few hundreds of columns.

Full HTML: A straightforward approach (that will not work, of course)

As you might guess using HTML elements to render all of the table cells at once is not the best solution. One million rows and a hundred columns result in a massive table of 100,000,000 cells which in turn result in at least that many elements in your DOM. This results in very poor user experience and crashed browser tabs.

Viewport: Better approach (not good enough)

We first looked for a way to efficiently render such a massive table and settled on a render-only-what-user-can-see approach. The area that shows the user just the visible cells is the viewport. So let’s render only the data that has to be in the viewport. In this case, the number of elements we need to create and insert into the DOM is reduced to be significantly smaller when compared with the previous approach.

We discovered that the approach was working well for small number of columns, but not for hundreds of columns. We noticed that some parts of the table were not rendered while scrolling vertically or horizontally. This is because DOM access is expensive and scrolling forces DOM to update so frequently that the browser has a hard time catching up:


Viewport+Canvas=Tabulous: The solution that worked!

While brainstorming for ideas to solve this problem, one of the engineer jokingly suggested that we should use canvas to render the table to remove DOM creation and manipulation out of the picture altogether. And that’s when lightbulb that went on in our heads. The problem was not about presenting the data, but about scrolling it! When a grid of data is statically displayed, the user needs to interact with its cells, and normal HTML is great in supporting those interactions. However, one does not need to interact with a table while it is scrolling.

We felt we were on the right track. When scrolling a table, we render the table in canvas instead of creating new or updating existing DOM elements. This results in very smooth scrolling because drawing to canvas is significantly faster than inserting or manipulating a large number of DOM elements. Drawing in canvas is faster because canvas API draws directly to the screen while changing the DOM triggers browser to re-render it for you.

Here is how our Tabulous display works:

  1. Only render the cells for the viewport with normal HTML.
  2. If scrolling starts, switch to show the canvas-based scrolling on top of the table. Since we know the scroll position, we know what rows and columns should be rendered.
  3. After scrolling stops, re-render HTML cells again and hide the canvas.

This result in our smooth scrolling below:


The downside of the solution

This solution comes with a price. We need to implement full table rendering twice: once in HTML and again in canvas. We have to be careful to ensure that canvas and HTML rendering of data and its formatting is consistent on each browser: the styling of elements needs to be added for HTML and for drawing in canvas. If the table displays only text, lines, and basic shapes things aren’t so bad, but developing and maintaining this hybrid display gets trickier with cell-level functionality & formatting. A good example is input elements because they exist outside of a table and they look differently in different browsers.

A note on memory usage and rendering efficiencies

It’s obvious that the draw everything using HTML-only approach uses a lot of memory since it creates a proverbial metric ton of DOM elements.

However, when we compare the latter two approaches (viewport approach and viewport+canvas approach) we see similar memory usage that’s consumed differently. The difference is the viewport+canvas approach uses less memory for elements but more memory for objects. This is not surprising because we store all of the information about what we need to render in objects and draw same things together. That’s because changing canvas context is expensive. For example, if you want to draw alternating red and blue lines, drawing all of the red lines, and then drawing all of the blue lines is more efficient than drawing them alternately.

Browser support

This solution will not work with older browsers that do not support canvas.


March 10, 2016



Why Platfora Uses Spark for Data Preparation

This article was originally posted in the Platfora blog, and was written in collaboration with Engineering Manager, Mayank Pradhan and Staff Engineer, Max Seiden.

Since Platfora 5.0, we have been leveraging Apache Spark to power data preparation in our Big Data Discovery platform. Subsequently, we have had a number of articles mentioning our strategic decision to adopt Spark. In this blog post, we will approach the topic from another angle, by discussing the technical benefits that Spark brings to Platfora’s data processing pipeline.

Spark is a general, distributed processing engine born out of UC Berkeley that has gained tremendous momentum in the past few years. Its popularity is often attributed to better performance and ease of use over Hadoop MapReduce, an earlier big data processing system based off a paper from Google of the same name. Unlike Hadoop MapReduce, which writes all output to disk, Spark has the ability to cache intermediate results in memory, and therefore is able to serve small queries with low latency. Over time, a number of libraries have been built on top of Spark to provide richer functionalities such as SQL access, machine learning, graph processing, and streaming.

For Platfora however, the biggest reason why we adopted Spark for data preparation, is that it provides a single set of APIs applicable for both interactive and batch scenarios.

Data Preparation in Platfora

Platfora offers an end-to-end data discovery platform that allows users to go from raw data to visualization in one place. Part of this process involves a step called data preparation, also known as Extract, Transform, and Load (ETL), where data is fetched from different sources and is cleaned to a state suitable for analytical queries. To make ETL easier, Platfora provides a UI that first shows samples and statistics of the input data, and lets users interactively perform actions like changing parsing options or adding computed fields. Once the user is satisfied with his or her changes, Platfora then launches batch jobs to perform the actions over the entire dataset.

dataset_workspace_Aditi copy

Platfora’s data preparation capabilities present a rich and interactive way to clean data.

Data Preparation Prior to Spark

Up until 5.0, Platfora used MapReduce to process batch data preparation. But since MapReduce is not designed for interactive workloads, we had to build our own engine to serve the interactive data preparation operations. This approach had many drawbacks. First and foremost, as hinted earlier, it required the same ETL logic to be implemented twice. Development overhead aside, maintenance also became expensive as inconsistencies were introduced in very subtle ways. For example, how should each implementation detect and treat overflow during arithmetic operations? What does the NULL value denote in each system? The effort required to prevent these corner cases required extensive testing, which took time away from important feature development work. Moreover, the MapReduce interface made it difficult to implement complex transformations such as union and explode without bringing in other SQL-on-Hadoop solutions like Hive.

Data Preparation with Spark

For all the disadvantages listed in the previous section, it became clear to us that Spark would help us alleviate the problem. By design, Spark can be used in both interactive and batch ways. This makes it possible to write a single version of code and have it run on arbitrarily small or big data by simply changing the input and system configurations. Adding new features is much more straightforward, as is maintenance. In addition, we get many of the goodies from Spark for free, such as a simpler API, and Spark SQL’s improvements to logical and physical execution of queries. Sure enough, thanks to Spark, Platfora 5.0 is packed with new features: we introduced a new Dataset Inspector Panel, and allowed users to create datasets using SQL. Since then, we have also added the ability to easily create union datasets.

Spark has enabled us to deliver a feature-filled release, and made Platfora’s Big Data Discovery platform a more powerful tool to discover critical business insights. We are happy with what we have achieved, and are excited to explore more ways that we can use Spark to make our product even better. If you are interested in seeing what Platfora looks like with Spark, we have prepared a quick demo here.

February 9, 2016


Data driven Personas: Constructing Archetypal Users with Clickstreams and User Telemetry

Platfora’s User Experience team has been exploring new ways of understanding user behavior by harvesting and utilizing product telemetry and clickstream data. We recently presented a paper to the ACM CHI 2016 conference detailing a novel approach of defining/refining archetypal users called Personas in the Human Computer Interaction community.

Personas are developed in the industry to better understand a user’s workflow by examining that user’s behavior, goals, needs, wants, and frustrations. Generally, to create a persona, Ux researchers/designers rely on survey data, user interviews, and user observation. Personas constructed with these perceived data-driven approaches somewhat lack generalizability and scalability as it is costly to frequently rebuild personas based on new data points from additional users. Furthermore, surveys, self-reports, and user observations are often incomplete approaches to building personas, resulting in personas that poorly reflect the user’s evolving workflow.

In light of these shortcomings, we developed a quantitative, data-driven approach to create personas that incorporates user behavior via clicks gathered from user telemetry data (over a span of two years) from the actual Platfora product use in the field. Our method first aggregates 3.5 million clicks from a selected 2400 users into 39000 clickstreams and then structures them into 10 meaningful workflows via hierarchical clustering.

Blog post 1

Next, we use mixture model, a statistical approach, that incorporates these clustered workflows to create 5 representative personas.

Blog post 2

We validate these personas with user behavior experts (Product Managers, Fields Engineers, Technical Documentation & Ux Design folks) of our product to ensure that workflows and the persona goals represent actual product use. To the best of our knowledge, our novel automatic bottom-up data-driven approach to create personas is one of the first to leverage low level user behaviors via raw user interface clicks to surface personas that reflect a product’s long-term use.

Come and see us at the ACM CHI 2016 conference and learn more about this novel approach harnessing the power of behavior analytics.

July 20, 2015

Roman Reykhel



Catching visual bugs in an ever-changing UI

Working at Platfora for the past 2 years as a QA Engineer, I have caught my share of bugs. More often than not they have been bugs related to styling. A button misaligned, text missing from a dropdown, icon mismatches, and the list goes on and on. I have always spotted these issues with the naked eye while testing for other things. However, in order to catch all these issues and analyze them across multiple browsers was a big manual effort. Recently, here at Platfora, the QA team has begun using a third-party tool by the name of Applitools Eyes.

Verifying that the UI appears correctly across all browsers

The first step was to validate all our product’s web pages across different browsers. With the help of Sauce Labs (another 3rd party tool we are using), which allows for cross-browser testing in VM’s, we were able to run visual tests across all the browsers we needed. Every day we run the same set of tests of the UI on top of a fresh install of Platfora running the latest build. What Applitools Eyes does is takes a snapshot of the page you want to test and creates a baseline image the first time you run the test. From then on, every time the test is run, the image of the page will compare to the baseline.

Taking pressure off manual QA: increasing coverage, testing faster and more accurately

Here is an example of how Eyes has helped catch a real visual bug, which is a missing icon inside a button:

As you can see in the following screenshot, the page had an icon missing next to the save button at the bottom of the page. For whatever reason, QA didn’t see this problem and file a bug. The Applitools Eyes test which was run in Jenkins, caught the issue and the log pointed to the Applitools validation page to view the visual error.

Roman Blog 1

Roman blog 2

The image inside of the validation page shows a purple highlight where the visual anomaly resides in the page. In this case, it is highlighting the location of the missing arrow icon next to the save button. The fact that it came from automation is a step in the right direction. Before, a bug like this would have only been caught by manual testing. It’s nice to know that I can sit back and have a bunch of tests run to check that the UI is not breaking across browsers, build to build. Applitools Eyes has attracted front-end developers as well. When they saw this tool in action, they wanted to start using it themselves. They are now in the process of running their own tests to quickly see if their code changes have messed anything up in the page before they go ahead and check in their code.

Roman blog 3

The future of UI tests

Integrating Eyes into our automation framework has been seamless. With the ability to use frameworks such as Selenium, Robot Framework, and Sauce Labs alongside this tool, getting tests up and running has been quite fast and easy. We are growing our number of tests and with new releases upcoming, we are excited to see what bugs we can catch in early development.

February 23, 2015

Albert Wu



Docker, AWS, and Elastic CI at Platfora

Chances are you have heard of Docker over the last few months. The applications for Docker are far reaching and can range from linking portable single micro­service oriented containers to more complex multi­service containers. At Platfora, we are deploying Docker as a key component to our CI Build System. We take advantage of Docker’s lightweight and portable containers, and AWS EC2 for our build system running Bamboo, Atlassian’s CI Build System. If you are unfamiliar with Bamboo, Bamboo allows you to run remote build agents and connect to a master node. The master node can distribute jobs; aggregate artifacts and results; and even act as an agent. This allows you to distribute builds over a cluster of machines, significantly increasing compute power and resources. Pairing Bamboo and Elastic Agents with AWS EC2 is a natural fit and gives you access to a wealth of resources.Old_vs_New_Elastic_CI

The Old

You may be asking yourself how Docker fits into the picture and why we are excited about its application. Previously, running Elastic Bamboo Agents involved highly customized AMI’s with varying degrees of pre-and post­initialized provisioning. Creating custom AMI’s in Amazon is a relatively simple procedure, especially when paired with Packer. You would provision out and snapshot an AMI running version XYZ dependencies, and use that AMI as the base image for the elastic agent. Changing dependencies and storing multiple AMI images can quickly begin to get unwieldy if you have multiple builds with different dependencies. Not to mention, spinning up an AMI and provisioning it can often take a non­trivial amount of time. Most importantly, since these build agents are running in AWS and can only be run in AWS via Bamboo, a developer cannot easily run this locally.

The New

With the introduction of Docker, we are given flexibility, portability, and repeatability. We use a combination of lightweight custom AMI’s (still built using Packer) and Docker Images to create fast elastic build agents. The base AMI simply contains a few custom kernel settings, a pre­cached repository checkout, pre­cached Docker images, and the ability to run Docker. Our custom AMI’s pre­cache many of the build components needed and can be online within a couple of minutes from initialization, they are available to the build system. We perform all of our compilation and run all our tests inside Docker containers.


Shipping a Docker container from a laptop to the build environment takes a matter of minutes. Making dependency updates is simple and painless. Docker’s layering allows for only updated layers to be built. Subsequently, pulling new images only pulls the delta of changes versus downloading the entire container every time. Starting and stopping a Docker container is nearly instantaneous and allows for each build job to run in a completely isolated clean container. Switching between multiple build environments is now automatically handled and quickly accomplished without needing to restart/reconfigure an AMI. There is a lot to be excited about with Docker and we are just scratching the surface.

October 21, 2014

Vignesh Sukumar



Handling Skew in a Map-Reduce data processing world

Distributed data processing frameworks like the Map-Reduce programming model have made large-scale data processing parallelizable across huge datasets. One of the key elements in this model is the re-distribution (shuffle) of data to different nodes so that all data belonging to one key can be processed on the same node. A popular algorithm that translates naturally to the Map-Reduce programming model is the ‘Join operation’ on two tables, table A and B, which produces a table that has records joined on a common key between the two tables.

In this blog, I will describe an interesting scalability problem that arises when this join operation is executed with values that are highly skewed. For the fun of it, I will not propose any solutions here, but I hope the problem I describe below interests you enough that many of you scalability junkies are already conjuring up solutions!

The business use case

A very common use case for the join algorithm is web-analytics where web server logs have to be joined against dimension tables (for example, a user dimension) to provide answers to valuable questions like “what’s the average age of my users? what geographies are they accessing my resources from?”. Dataset joins are uber-popular in Data Analytics, and a true big data analytics platform should be able to scale linearly to any dataset size.
Let’s consider the simple case of web server logs, which has userIDs in each log line, joined against the user dimension table on the userID primary key. Here is a simple diagram describing how the join will happen through the Map and Reduce phases if the join had to separate odd and even valued UserIDs:


skew diagram

Web log(userID, file accessed)                     User table (userID, First Name)


The Challenge

A particularly interesting problem that happens during this join is the case of skewed values in the web server logs (called the fact dataset, in Analytics parlance): what if there are very few values that occur a disproportionate number of times, say about 80% of the time? In this case, the userID “0” is shown to occur disproportionately (and is a very valid case too: the ‘unknown’ or ‘anonymous’ user being logged in the server logs is quite common). With skewed values, note that the single reducer process handling this value will be swamped with a huge number of records, whereas the rest of the reducers have very little to process.

This problem is really exciting for any high-scalability junkie: you’ll soon realize that throwing any number of compute nodes into your Map-Reduce cluster will not solve it as the lone reducer handling the skewed value will still be left to process all of the values. The join operation in this particular case does not scale horizontally!

Wrapping up

This problem clearly highlights the fact that while it is possible to translate popular algorithms easily into Map-Reduce, a lot of attention to detail is required to make it really scalable for all forms and distributions of data.

It is fair to say that, as the #1 Scalable Data Analytics platform natively on top of Hadoop, our customers use ‘joins’ extensively in our data processing pipeline. Identifying and solving these kind of algorithmic problems in the platform itself frees the business analyst from having to deal with these kind of data distribution challenges – Platfora simply takes care of it!

And, if you’re still reading, engineering problems at scale certainly excite you…why don’t you check out our job openings?


September 29, 2014

Eric Rowell



Introducing Canteen, the Ultimate HTML5 Canvas Testing Library


Let’s be honest.  Writing HTML5 Canvas unit tests is hard.  Your first idea for testing HTML5 Canvas apps was probably to try out image comparisons.  After a bit more thought, you may have realized that you could do simple string comparisons by encoding the canvas into data URIs via the canvas.toDataURL() method.  If you’ve tried either of these approaches, you’ve probably, and painfully, realized that most of your tests will fail in different browsers, and even in different versions of the same browser.  Why?  Because all browsers and browser versions have slightly different outputs for the same canvas commands due to internal calculations, anti-aliasing, fonts, and other factors.

So what’s a developer to do?  Should you manage different comparison images or data URIs for every browser, and every version, for every test?  Should you use image recognition programs to try and detect graphical patterns in your canvas outputs, or even try to formulate fancy threshold algorithms to reduce the chance of test failures?  It’s okay if you’ve answered “umm, no thanks” to any of these questions.  Is there a sane way to test HTML5 Canvas?

Test Smarter, not Harder

So how can we test the visual output of the HTML5 Canvas without using the error prone methods of image comparisons or data URI string comparisons?  What if we could record all of the drawing instructions that go into making up the visual output?  These instructions would not only yield a testable output, but they would also work in a cross browser and cross browser version way.  As turns out, we can.

We can essentially record all of the drawing instructions, such as method calls and property changes, by creating a wrapper around the HTML5 Canvas context object that records all of these instructions and then proxies them to the native HTML5 Canvas context for rendering.  At a high level, this is exactly what Canteen does, and it makes testing HTML5 Canvas outputs incredibly easy.

The Canteen Project

Canteen is short for “Canvas in Test Environments”, and can be thought of as a mechanism to “store” drawing instructions.  Canteen was created at Platfora to make HTML5 Canvas testing as easy as possible.  We’ve open sourced it so that the developer community can benefit from it as well.

Since Canteen automatically wraps the HTML5 Canvas context whenever the canvas getContext() method is called from your application code, or from third party libraries and frameworks, all you need to do is include the Canteen library somewhere in your page, like this:


<script src=”canteen.min.js”></script>


Once that’s in place, you can use Canteen like this:


var canvas =document.getElementById(‘canvas’);
// when getContext() is called, Canteen automatically instantiates
// and returns a Canteen canvas context wrapper
var context = canvas.getContext(‘2d’);

// draw stuff
context.arc(50, 50, 30, 0, Math.PI *2, false);
context.fillStyle =‘red’;

// return a strict array of the instruction stack
var stack = context.stack();

// return a strict json string of the instruction stack, i.e. [{“method”:”beginPath”,”arguments”:[]},{“method”:”arc”,”arguments”:[50,50,30,0,6.283185307179586,false]},{“attr”:”fillStyle”,”val”:”red”},{“method”:”fill”,”arguments”:[]}]
var json = context.json();

// return a strict md5 hash of the instruction stack, i.e. “593812a5c4abaae60c567bf96e59631d”
var hash = context.hash();

// example unit test assertion
assert.equal(context.hash(), ‘593812a5c4abaae60c567bf96e59631d’); // test passes

// clear the stack


Pretty cool, right?  The general idea is to draw stuff like you would normally, and then at any point access the instruction stack via the stack(), json(), or hash() method.  The hash() method is particularly useful for unit tests because it’s really easy to copy and paste the hash string into your unit test suite whenever you’ve made a change to your app that produces a change in visual output.

Next Steps

You can find the source code or download the build from the Github repo here:


Let us know what you think!  Now get out there and make the world a slightly less buggy place.


September 15, 2014

Jerry Lu



Telemetry – Development on Reality

Good, Fast, Cheap. Every company wants it all, yet where do we draw the line of over-engineering, product being over simplified, or whether a design performs well enough? Telemetry has the answers.

In the early days of Platfora, before it was first released, we were stuck with a problem: how do we confirm, realistically, features built on the product was done correctly? It’s easy to say “we build a user friendly product”, while “user friendly” could encompass many things, such as “Performance” (reaction time on clicks, page loads), feature visibility (whether visual elements can be found and are frequently clicked.), Scalability (How “Big” is our perception Big Data to us vs our customers).

Afterall, we aren’t a Cloud Company, therefore we don’t have the advantage of having customer data in our servers, and so we’ve built in a system to send telemetry to our servers from day 1. Gaining customer insight, as long as they opt in.

How do we use it?

We’ve done an earlier post on how we use Telemetry for User Experience, so I’ll skip that. Below are some use cases on how telemetry has helped us in Development. There’s more than what’s listed below, and we’re adding new telemetry everyday.

Automated Performance Collection

We have automated machines that runs our test cases (UI/Selenium, Junit/API Testing) against boxes with static data. Everyday, we get to see how performance has regressed/improved over time.

Since telemetry is baked in and NOT using instrumentation tools (e.g. JProfiler), the metrics are real without profiler lag. We also get the same metrics from customer usage, allowing us to compare our performance test scenarios to reality.

UI Performance times

This includes Javascript execution time, page load times, DOM Processing times. Many times, we developers tend have powerful machines and modern browsers, this gives us realistic Browser User Agent breakdown, their associated execution time and actual latency times. What’s fast in our development environment may not be in the customer’s environment.

Consumption patterns for scalability

How “Big” is “Big Data” for our customers? While we continue to push the limits of Big Data, we often need to balance high performance on realistic customer data sizes, and specific types of data skew. A car that can achieve a highest top speed of 250mph, may not have the best 0-60 acceleration times. Therefore we frequently look at customer aggregated data sizes, and how we perform at different tiers.

How do we collect it?

Rules of telemetry

Given that telemetry collection involves collecting customer patterns, we need to be very careful on what we collect. The following principles apply when we collect telemetry

  • Telemetry collected MUST be anonymous: During the collection phase, all origin information is obfuscated from the telemetry data.
  • Metrics is all we collect: Never collect actual data, which may contain sensitive data, metrics is all that is needed to see usage patterns.
  • Telemetry collection should not obstruct regular execution: Telemetry collection should have a minimal footprint, which it should not impact performance, nor cause failure on regular functionality.


Here’s the architectural snippet on how Telemetry works. In Platfora’s running process, we have a seperate thread that’s responsible for spilling telemetry lines to disk. The collection process goes as follow

  • The thread responds to a queue, where when new telemetry data is collected, the data is placed in a queue and the execution thread continues. This allows execution to continue without waiting if telemetry load builds up.
  • Every 5 minutes, the telemetry files are rolled over, and sent to our Telemetry Servers, using SSL encrypted connections, with client certificates for authentication. This way only actual customers can submit telemetry. Provide added security through an encrypted connection.
  • Telemetry Servers receive the files and write to disk immediately, preventing data loss.  The servers are also stateless, allowing us to spin up more servers dynamically behind a load balancer as telemetry load increases.
  • Every 6 hours, we concatenate the received files and place them in HDFS, which we use Platfora to visualize them.

August 19, 2014

Max Seiden



Promise[ing] Cache Concurrency

Platfora is a full-stack platform for interactive, collaborative big-data analytics on top of Hadoop. As such, it is comprised of many complex sub-systems and components. On the data processing front, this includes a MapReduce-based data pipeline and an in-memory columnar M.P.P. query processor. For end users, this includes multiple complex single-page apps – covering ETL to data cataloging to visualization to system administration – which are all powered by a feature-and-function-rich web application. In isolation, each of these systems has its own set of challenges and concerns; in concert, these systems must both make and hold equally challenging assumptions regarding performance and reliability. Needless to say, we have a wide variety of exciting engineering and product challenges.

For this post however, I’ll focus on the design and implementation decisions behind one of our system’s numerous caches: our “dfs cache”. An instance of this component resides on each node in a Platfora cluster, and manages loading artifacts, produced by our MR pipeline, from a distributed file system (Hadoop / S3) onto local disk. Depending on the size of the data, the distinct number of artifacts can be quite large, and can individually reach GBs in size. Additionally, the cache ensures consistency between local disk and a DFS, enforces constraints (ex: disk usage), and synchronizes both concurrent consumers and internal operations. It’s also a very popular resource (so to speak); almost every major component of the system that touches data utilizes it in some way; thus it must provide high-performance, reliable, and correct behavior in a variety of situations, including concurrent query execution and cluster rebalancing.

Concurrency is not a Solved Problem

Of the responsibilities listed above, the need to achieve high-performance in the face of numerous, varied, concurrent consumers is both necessary and challenging, and introduces three primary requirements. The first is that the cache must respond quickly to all incoming requests – particularly those from the query processor – so as not to be a performance bottleneck. The second is that it cannot undertake exclusive operations. In practice this means that cluster rebalancing (which is generally an expensive operation) must not block or noticeably impact query execution. Lastly, it must properly synchronize concurrent consumers that are requesting the same data, particularly when the load time is non-negligible and the number of consumers could become quite large.

Given the description above, it is clear that the cache must have a highly concurrent internal implementation that maximizes data-load throughput. It must also provide an interface that transparently synchronizes concurrent consumers, without exposing the cache’s internal concurrency mechanisms, and without unnecessarily impacting a consumer’s performance, or blocking it altogether. In short, it must be both fast and correct, to fulfill its requirements. Multithreaded programs are historically difficult to write though, and arguably even harder to test thoroughly. In order to guarantee atomicity, one must employ synchronization mechanisms such as locks, while in order to achieve high performance, one must reason about which computations can run concurrently. This balancing act typically results in complex bookkeeping, nuanced race-conditions, and gnarly heisenbugs, all of which can result in both increased development time, as well as higher incidences of bugs or regressions. As such, it’s possible that chasing concurrency bugs can soak up precious developer time, which would be better applied towards core engineering and product needs.

Scala Promises a Better Future

For our use case, waiting for data to load has the potential to block a consumer’s thread, which unnecessarily ties up a resource from the other component. Furthermore, given that many of the distributed, concurrent components in our system are built on top of the Akka framework, operations that may block a thread for any significant amount of time are generally considered an anti-pattern. Thus, in order to obviate the possibility of blocking a consumer thread, the use of a callback-based model of “synchronization” is advantageous. This technique is hardly new or obscure: nodejs has certainly capitalized on its stated benefits; message passing in the Akka Actor framework also yields a similar property. It’s also not a panacea for addressing blocking or contention. It is however a handy tool to have in the toolbox, when the right situation arises.

Using Scala’sFuture[V] and Promise[V] primitives for synchronization and cache state management (respectively) do allow for a callback-based design –sprayio’s cache module works on this principle. More importantly however, the use of these primitives greatly simplifies the internal implementation of the component itself. An instance of future represents an asynchronous computation that will either complete with success or failure. An instance of promise is a mechanism through which an associated future can be completed. Additionally, when a consumer registers a callback with a future instance, it must provide its own execution context (ex: a thread pool) on which to run the callback – this explicitly decouples the computation represented by the future from the computation embodied by the callback. (For a more thorough overview of these primitives, check out Scala’s documentation)

Going back to the cache, the description above has the following implications. Firstly, the internal hash table that maintains the cache entries now maps keys to promises; these entries need not be replaced regardless of whether the entry is loading, has loaded successfully, or has failed while loading. This alone obviates the need for an explicit, synchronized “loading-vs-loaded” bit in the hash table. Secondly, since the promise never needs to be replaced, the cache can simply return its associated future to all consumers that request the entry. If the entry has completed loading, then a consumer’s callback will execute immediately; otherwise the callback is put onto a queue is executed when the cache completes its loading by “fulfilling” the promise. In both cases however, the result (success or failure) propagates to all consumers in a consistent and decoupled manner. Lastly, while the use of promise and future remove the need for explicit locking of cache entries and ensure that consumers are not blocked, synchronization is still required cache-internal data-structures, such as the hash table or cache metadata. However, the scope of this synchronization is much more contained, and can easily be addressed with Akka Actors – a coordinator maintains the hash table and metadata, and delegates its entry loading to a pool of workers.

As a side note, for those who prefer actual code to written descriptions, I’ve posted a Gist with a basic example of this pattern.

Wrapping Up

As described above, the use of Scala’s promise and future primitives can alleviate some of the complexities that typically arise when implementing concurrent systems. Furthermore, when coupled with the Akka framework, the Actor model yields analogous benefits in the context of implementing distributed systems. In both cases though, benefits are not just seen in reduced component complexity; they also manifest as increased code readability, concurrency models that are generally easier to reason about, more idiomatic in the context of Scala’s best practices. Most importantly though, leveraging these types of tools has enabled Platfora Engineering to spend less time on solved problems, and more time attacking the real challenges of building a powerful data analytics platform.