A fair benchmark for Impala 2.0 - resultsMonday, Dec 1st, 2014
We have benchmarked Impala 2.0 and we found that it scales very well in terms of dataset size, number of nodes and a limited number of users. It is lacking some advanced SQL functionality and we experienced serious stability issues when testing for multiple users querying the system concurrently.
As discussed in our previous blog post, we've taken a critical look at the original TPC-DS benchmarks of Impala. A lot has happened since. For one, Cloudera has responded. We don't agree on everything but they appear to be changing some aspects for their next benchmarks. Not sure if Cloudera is applying those changes because of our blog, or if these changes were already on the roadmap. Either way:
- Numbers will be stored as DECIMAL in stead of DOUBLE, like the specification states
- Queries modification will be limited in the future. We would like to see NO modification, but it is a start
- More queries will be added to the benchmark
Also, Cloudera announced their commitment to work on BigBench, a new industry benchmark for Big Data systems. Excited what the future brings.
All code available on github. Our setup is a small 5 node cluster:
- Intel core i5
- 16 GB RAM
- 500GB SSD
- 1Gbit ethernet
We installed Cloudera CDH5.2 with Impala 2.0 and we used 1 node as a master node. The other 4 nodes are all impala daemons. No other workloads were running on the cluster while we were executing the benchmark.
We ran tests on 1GB, 10GB, 50GB, 100GB, 200GB and 300GB of data. This can be considered a small dataset, some argue would argue too small. This all depends on your use case, of course. For us, this is an interesting testing range for 3 reasons:
- We wanted to know whether running Impala on 5 commodity nodes could be an alternative to running a proprietary database on a single, more expensive node. Many 'traditional' data warehouses are between 5GB and 500GB of size.
- We wanted to detect a performance difference between in-memory datasets and disk-based datasets. The smallest datasets easily fit in memory. The bigger ones definitely don't. The 4 impala daemons in our setup have a total 48GB available. We assigned 12GB to each daemon and allowed 4GB for the OS and other overhead.
- We assumed our results could be extrapolated to bigger datasets and better hardware because of the linear scalability properties of Cloudera Impala. These assumptions have partly been confirmed in our tests, and are further validated.
We tested Impala on three aspects:
- Functionality: How many queries can Impala successfully run?
- Scalability: How does Impala behave with increasing data, increasing workloads and increasing resources?
- Stability: How consistent are the results? How many crashes did we experience?
We've run all 99 queries of the benchmark. Of those 99 queries, we were able to execute 53, slightly more than half. The queries that failed to execute often required advanced SQL functionality that was not available in Impala. This is not necessarily an issue. It's good to be aware of the limitations of a platform and to think of ways to work around them. In particularly the following missing functionality caused the most failures:
- No support for a specific kind of JOIN
- No support for ROLLUP
- No support for correlated subqueries
- No support for INTERSECT and EXCEPT
- No support for subqueries in the HAVING clause
- ORDER BY errors
To measure the scalability of a system, we considered 3 dimensions: increasing data, increasing workloads and increasing resources. Ideally, a database should scale linearly across all of these dimensions. Scaling linearly means a system behaves predictably. With linear scaling, we can estimate the impact of adding new users, adding new hardware or increasing the dataset size.
Impala scales very well with increasing data. The chart is quite clear about this. The scaling is linear in most cases. This makes for a very predictable system. The slowest queries, represented by the 75th percentile are a bit more flaky, but most queries are very well-behaved and sometimes a doubling of data results in less than doubling of execution time. This is probably due to the fact that a lot of data can be discarded when using filtering.
What happens if we add more nodes to the cluster? How much speedup can we gain if we go from 2 nodes to 4? In ideal cases we would see a performance increase of 100%. Doubling the number of nodes should ideally cut the query times in half. But more nodes means more communication overhead. So this ideal is not often met.
The performance increase varies a lot by query. At the very least, we see a 10% performance improvement. A lot of queries perform 60 - 70% faster. Some queries actually get in the 90% range, which is really good.
It would be useful to test going from 4 nodes to 8 nodes or from 8 to 16. But we didn't have the hardware to perform those tests. Note that we only increased the impala daemons. In both cases, we also used 1 impala master which was installed on a separate node.
Impala scales very well with increasing workload. Here we test multiple users concurrently executing a similar workload. Adding users adds a predictable extra load to the system. Again, the 75th percentile is a bit more flaky.
In the single-user scenarios we didn't experience any stability issues. However, when scaling to multiple users, Impala crashed so often that it became impossible to do any more measurements. We don't know if it's related to the driver we're using (impyla) or wether it's something in Impala itself. We do know that the entire Impala service goes down and needs to be rebooted. The client doesn't crashes. We've been in contact with cloudera about this and they say they can easily scale to 10s or 100s of users using the Impala shell. I have no doubt they do. It's just not something we could replicate on our test setup. We even tried with the smallest dataset size, 1GB, to make sure it wasn't some kind of memory issue. But even then, we were unable to complete a single TPC-DS run with more than 5 users.