A fair benchmark for Impala 2.0 - results

Monday, Dec 1st, 2014

We have benchmarked Impala 2.0 and we found that it scales very well in terms of dataset size, number of nodes and a limited number of users. It is lacking some advanced SQL functionality and we experienced serious stability issues when testing for multiple users querying the system concurrently.

Context

As discussed in our previous blog post, we've taken a critical look at the original TPC-DS benchmarks of Impala. A lot has happened since. For one, Cloudera has responded. We don't agree on everything but they appear to be changing some aspects for their next benchmarks. Not sure if Cloudera is applying those changes because of our blog, or if these changes were already on the roadmap. Either way:

  • Numbers will be stored as DECIMAL in stead of DOUBLE, like the specification states
  • Queries modification will be limited in the future. We would like to see NO modification, but it is a start
  • More queries will be added to the benchmark

Also, Cloudera announced their commitment to work on BigBench, a new industry benchmark for Big Data systems. Excited what the future brings.

Setup

All code available on github. Our setup is a small 5 node cluster:

  • Intel core i5
  • 16 GB RAM
  • 500GB SSD
  • 1Gbit ethernet

We installed Cloudera CDH5.2 with Impala 2.0 and we used 1 node as a master node. The other 4 nodes are all impala daemons. No other workloads were running on the cluster while we were executing the benchmark.

Dataset sizes

We ran tests on 1GB, 10GB, 50GB, 100GB, 200GB and 300GB of data. This can be considered a small dataset, some argue would argue too small. This all depends on your use case, of course. For us, this is an interesting testing range for 3 reasons:

  • We wanted to know whether running Impala on 5 commodity nodes could be an alternative to running a proprietary database on a single, more expensive node. Many 'traditional' data warehouses are between 5GB and 500GB of size.
  • We wanted to detect a performance difference between in-memory datasets and disk-based datasets. The smallest datasets easily fit in memory. The bigger ones definitely don't. The 4 impala daemons in our setup have a total 48GB available. We assigned 12GB to each daemon and allowed 4GB for the OS and other overhead.
  • We assumed our results could be extrapolated to bigger datasets and better hardware because of the linear scalability properties of Cloudera Impala. These assumptions have partly been confirmed in our tests, and are further validated.

Results

We tested Impala on three aspects:

  • Functionality: How many queries can Impala successfully run?
  • Scalability: How does Impala behave with increasing data, increasing workloads and increasing resources?
  • Stability: How consistent are the results? How many crashes did we experience?

Functionality

We've run all 99 queries of the benchmark. Of those 99 queries, we were able to execute 53, slightly more than half. The queries that failed to execute often required advanced SQL functionality that was not available in Impala. This is not necessarily an issue. It's good to be aware of the limitations of a platform and to think of ways to work around them. In particularly the following missing functionality caused the most failures:

  • No support for a specific kind of JOIN
  • No support for ROLLUP
  • No support for correlated subqueries
  • No support for INTERSECT and EXCEPT
  • No support for subqueries in the HAVING clause
  • ORDER BY errors

Scalability

To measure the scalability of a system, we considered 3 dimensions: increasing data, increasing workloads and increasing resources. Ideally, a database should scale linearly across all of these dimensions. Scaling linearly means a system behaves predictably. With linear scaling, we can estimate the impact of adding new users, adding new hardware or increasing the dataset size.

Increasing data

Impala scales very well with increasing data. The chart is quite clear about this. The scaling is linear in most cases. This makes for a very predictable system. The slowest queries, represented by the 75th percentile are a bit more flaky, but most queries are very well-behaved and sometimes a doubling of data results in less than doubling of execution time. This is probably due to the fact that a lot of data can be discarded when using filtering.

Increasing resources

What happens if we add more nodes to the cluster? How much speedup can we gain if we go from 2 nodes to 4? In ideal cases we would see a performance increase of 100%. Doubling the number of nodes should ideally cut the query times in half. But more nodes means more communication overhead. So this ideal is not often met.

The performance increase varies a lot by query. At the very least, we see a 10% performance improvement. A lot of queries perform 60 - 70% faster. Some queries actually get in the 90% range, which is really good.

It would be useful to test going from 4 nodes to 8 nodes or from 8 to 16. But we didn't have the hardware to perform those tests. Note that we only increased the impala daemons. In both cases, we also used 1 impala master which was installed on a separate node.

Increasing workloads

Impala scales very well with increasing workload. Here we test multiple users concurrently executing a similar workload. Adding users adds a predictable extra load to the system. Again, the 75th percentile is a bit more flaky.

Stability

In the single-user scenarios we didn't experience any stability issues. However, when scaling to multiple users, Impala crashed so often that it became impossible to do any more measurements. We don't know if it's related to the driver we're using (impyla) or wether it's something in Impala itself. We do know that the entire Impala service goes down and needs to be rebooted. The client doesn't crashes. We've been in contact with cloudera about this and they say they can easily scale to 10s or 100s of users using the Impala shell. I have no doubt they do. It's just not something we could replicate on our test setup. We even tried with the smallest dataset size, 1GB, to make sure it wasn't some kind of memory issue. But even then, we were unable to complete a single TPC-DS run with more than 5 users.

Kris
Data architect

Comments

Thanks for sharing these informations.
Have you been able to track down the concurrency reliability issue ?
This sounds weird as it is supposed, according to Cloudera, to be where their technology really shines.

When we sent you an email on November 12th, you mentioned you had trouble setting up partitioning and had system health issues in the graph you sent. Were you able to resolve these prior to publication and if so can you share updated details? It also looks like you have since run into additional unexpected issues since my email. For the future, we’d be happy to engage if brought into the process early so we can help address the admitted issues in your setup prior to benchmarking. I’d like to highlight some additional key suggestions:
Setup - As described in blog post I do not think your node hardware reflects Hadoop nodes as it has a desktop processor, single SSD disk, and only 16GB of RAM. To do an accurate benchmark I’d expect hardware reflective of realistic deployments
Client - Impyla is a project in development and currently not yet ready for support. I suggest running future benchmarks using impala-shell, ODBC, or JDBC. You had indicated in your email on November 12th that you were planning to run impala-shell. Did you get a chance to do those runs?

We do appreciate your interest in Impala and are happy to work with you on your benchmark setup if you would like to engage ahead of time.

@Laurent: No, we were unable to track down the issues with concurrency. The first culprit would be Impyla. But it's not only the client that crashes, it's the Impala Daemons themselves, bringing the entire service down.

@Dileep: Thanks for replying. I really appreciate the Cloudera team being present in the community. I don't mind involving you as early as possible. But we do have to be able to run independent benchmarks. Otherwise, we might turn into something like this blogpost where IBM is being helped by an 'independent company' to do benchmarking. Yet, there is not a single negative remark about the IBM offering.

- The system health issues remained. Those were caused by the concurrency crashes by the way. The graph I sent, was showing the Impala Daemons repeatedly going down. So, the system in itself was as healthy as can be.

- You're right about the setup. Production Hadoop clusters don't run on desktop hardware. 32-core processors with 128GB of RAM will handle concurrency much better, no doubt. We are working together with a research institution to replicate results on those kinds of servers. Still, I do think some lessons can be learned from the current setup. For one, I would like to use all the advances in really Big Data engineering and use it for smaller dataset sizes (gigabytes in stead of terabytes) where smaller, older servers would suffice.

- Client: Unfortunately, due to time constraints, we had no chance of implementing another client. We will take this with us for future testing.

@Kris what kind of network switch was used for the benchmark? Switches play a big role in performance in Big Data, especially where operations such as distributed joins are being performed. This would affect scalability as the nodes could easily saturate the back plane bandwidth if it isn't high enough.

For Impala based workloads 10-Gbps is becoming progressively standard, again, this would dramatically affect performance.

Below is a reference architecture for Hadoop workloads, which gives some useful insights appreciate its vendor lead, however the fundamentals apply regardless of supplier.

http://www.cisco.com/c/dam/en/us/td/docs/unified_computing/ucs/UCS_CVDs/...

@Kris what kind of network switch was used for the benchmark? Switches play a big role in performance in Big Data, especially where operations such as distributed joins are being performed. This would affect scalability as the nodes could easily saturate the back plane bandwidth if it isn't high enough, adding more nodes wouldn't give the expected performance increases

For Impala based workloads 10-Gbps is becoming progressively standard, again, this would also affect performance.

Below is a reference architecture for Hadoop workloads, which gives some useful insights appreciate its vendor lead, however the fundamentals apply regardless of supplier.

http://www.cisco.com/c/dam/en/us/td/docs/unified_computing/ucs/UCS_CVDs/...

@Justin, standard 1Gbps consumer-grade switch. We are doing some repeat testing on bigger, more production-grade hardware. Happy to share if we have results.

Add new comment

Image CAPTCHA