A fair benchmark for Impala 2.0 - part 1

Friday, Oct 31st, 2014

In the last couple of weeks, we've been studying Impala and we've been trying to replicate their TPC-DS benchmark results (here and here) on Impala 2.0, their latest release. This is important to us because some of our clients own a large enterprise data warehouse, and they are looking for options to migrate away from the expensive licenses offered by the dinosaurs. Impala being a prominent, scalable open-source SQL database, we need to understand it inside-out.

Cloudera published their code publicly. And they are clear about not comparing their results to official TPC-DS results. Yet I still feel that they are at least bending the truth in their favour. This is unhelpful and unnecessary. Unhelpful because with those tweaks, the tests performed are further away from actual real-life scenarios than the original TPC-DS tests. More on that later. Unnecessary because Impala 2.0 is actually a pretty good SQL database. I don't understand why they needed to hype up the results even more. Well, actually, I can guess. It's probably their marketing machine working overtime.

How Cloudera bends the truth and why it is unhelpful

In short:

  • they are not honest about the dataset sizes and structure
  • they cherry-pick and tweak the queries
  • their use of partitioning is too complex

This is unhelpful because it doesn't reflect real-life data warehouses anymore. Let's dive into the details.

1. They are not honest about dataset sizes

They are subtle about this. When they say 'a 10TB scale-factor dataset', they actually mean 'a 4TB dataset'. That's not a rounding error. How can this be? Because, with the 10TB scale factor, they only create 9 of 18 dimension tables. And only 1 of 6 fact tables. Here's an overview of the TPC-DS data warehouse at various scale factors, with highlighted in green, the tables that are actually loaded:

2. They change the datatypes of the columns

In the one fact table they do load, store_sales, they change the data types of the facts from DECIMAL to DOUBLE. Typically, processing doubles is faster than processing decimals. But there's a reason why those facts are DECIMAL They also turn dates into strings.

3. They cherry-pick the queries they choose to benchmark

Why only 24 or so of the 99 queries? We did load the entire data warehouse and we did execute all of the queries. The motive behind their cherry-picking soon became apparent: They cherry-pick because a lot of the queries can't be completed on Impala. The biggest reasons for failure were:

  • IllegalStateException, 'Illegal reference to non-materialized slot'. Not sure what it means. queries 16, 33, 51, 56, 64, 94, 60, 95, 97
  • Memory limit exceeded. This even happened for datasets that fit in memory. Although not always. queries 4, 11, 30, 72, 74, 78, 81
  • No support for specific JOIN: queries 13, 28, 48, 61, 88, 90
  • No support for ROLLUP: queries 5, 18, 22, 67, 77, 80
  • No support for correlated subqueries. queries 6, 32, 58, 69, 92
  • No support for INTERSECT and EXCEPT: queries 8, 14, 38, 87
  • No support for subqueries in the HAVING clause: queries 23, 24, 44
  • ORDER BY error: queries 47, 57
  • Loading DECIMAL results in the client. This was a problem for several queries, but it was related to the impyla client, not the database itself. So we ignored these issues.

In the end, we managed to execute 54 of the 99 queries. We did do some syntactic rewriting when possible. For instance, Date additions and string concatenations use a different syntax in Impala. We've built some regular expressions to fix that. However we did not rewrite entire queries, just to make them run. Which brings me to my next point.

4. Queries are changed in an unrealistic way

If your syntax changes a bit from what TPC-DS generates, it's fine to tweak the queries a bit. Like we did for date additions and string concatenations. It is not fine, however, to delete parts of the query that you can't execute, like a ROLLUP. It's also not fine to add additional partitioning clauses, like they do almost everywhere. Why not? Vertical partitioning is a powerful technique to speed up queries. Yes, but almost no client will automatically generate the partitioning clause like they did. And you certainly won't do it if you're doing some ad-hoc querying. The database should be smart enough to only query the right partitions behind the scenes. More on that in point 6.

5. The way queries are run, is not published

In TPC-DS, you generate several query streams which execute the queries in a random order. This is done to prevent you from preparing for a certain query. Like in real life, you don't know what's coming next. In contrast, if you execute the same query 10 times, the database has probably performed some kind of caching so your execution goes faster. It's not clear how the queries were executed, how many times and in which order. This is not published.

Also, the query generation in TPC-DS makes sure the query parameters are random. In one stream query 1 might filer on Year='2008'. The next stream filters on Year='1992'. These parameters are also dependant on dataset size. So you should generate different query streams for each dataset size. This is not happening. In stead, they have a fixed set of queries they execute on all datasets.

6. Partitioning in Impala is not straight-forward

As said before, vertical partitioning is a powerful technique to speed up queries. If you have 10 years worth of data, but you only care about last year, a database can be smart enough to ignore 90% of the data and execute your query 10 times faster. To enable this, you need to partition your data by year. Every year gets its own partition. And when you query, the database smartly selects the right partitions for you.

Except that, that's not how it works in Impala. There are some cases where Impala smartly selects the right partitions for you. But it's not clear to me when it does. For the other cases, your only solution is to manually add the needed partition clauses. Also, loading a lot of data into a partitioned table simply crashes. Impala runs out of memory. For the benchmark, Cloudera has written a special Python script to avoid any crashes. We decided not to repeat that, because that's a no-go for a lot of production systems. These tables get created and loaded by various ETL tools and automated processes. You can't expect to have some python script stand in the middle to handle the loading for you. Well, you can, if you really really need to. But we wouldn't call it 'standard operating procedure' anymore. So we left it out.

Our benchmarks

In a next blog, we'll share our first actual results of Impala. Next, we want to also benchmark Hive 0.14, SparkSQL and maybe some proprietary databases. We're even in negotiations with an academic research institute to do some of the testing for us. They bring a lot of benchmarking expertise to the table and they have some pretty awesome bare-metal machines which are simply out of reach for us. This collaboration should also make sure we remain as objective as possible.

Obviously, all procedures and results will be published on github. Because you have to give it to Cloudera: They did publish everything in the open. This allows for healthy discussions. We're hoping we've added to that discussion.

Kris
Data architect

Comments

This is good stuff. Looking forward to you next set of benchmarks.
We are currently evaluating Hive vs Impala vs Spark.

Great to hear you're comparing Hive and Spark as well. Please, do share results if you have them. I'm a especially curious for Hive 0.14 as they promise Cost Based Optimizations. I wonder what kind of an impact it has.

I am an author of Cloudera’s latest performance blog post. Our blog posts and the published benchmark kit have transparently disclosed what we ran and rationale of any choices with no intent to obscure facts.

Some specific clarifications:
1. We clearly say "15TB scale-factor dataset" and use TPC's data generator with that specified scale factor.

2. Our next benchmarks will use decimal now that it's supported in a majority of the SQL-on-Hadoop engines. In our previous post only Impala and Hive had decimal support so we used double in order to include Spark SQL and Presto in the comparison.

3. As we documented, we selected a representative set of queries and documented the modifications made to comment out unsupported language elements. We're in the process of adding the additional queries with the same transparency on removed unsupported language elements.

4. The specific modifications to remove unsupported language elements are clearly documented in our blogs. We specifically commented out the language elements in the kit rather than delete them to make this very clear. As the SQL-on-Hadoop engines get more language support, we'll continue to uncomment these elements.

5. None of the engines in the post have caching. We also randomize query orders in our concurrency tests for the blog post in the same way across all engines.

6. The Python script is to facilitate loading all partitions at once on small, low-memory cluster. We documented the modification for the unsupported dynamic partition pruning feature that is not available in any of the SQL-on-Hadoop solutions tested.

Please note that you have an open invitation to either ask us questions or discuss your benchmark testing methodology with us in advance in order to strive toward a credible result. We always do the same with respect to all platforms involved in our own testing, so as to ensure that every platform performs as well as possible.

I was the european SME for Impala in Cloudera and even I think you've raised some excellent points in this article. I look forward to seeing your results and comparisons to other modern SQL-on-Hadoop implementations (especially the open source ones).

I actually think there are 2 issues more important than Impala speed/benchmarks:

1. Reliability - Hive is way more reliable and while I've been critical of Hortonworks' Stinger and "Stinger.next" as doing too much hard work for relatively modest performance improvements [compared to Impala], the reality for production is I'd rather do things that are more likely to work at any scale and with query recoverability than wonder if/when it's going to randomly fail on me. I think this is a primary reason people always seem to take exception to running only a subset of TPC queries or modifying those queries significantly, they see it as [admittedly] cheating instead of failing.

2. Impala is not in Apache Foundation - if it's not in Hortonworks then it's basically not a standard technology / won't become a standard and will inevitably get replaced by the wider community with another solution inside Apache (Drill, Hive-on-Spark and LLAP on the horizon). This is a tragedy as I know and have worked with the Impala team and they've done some good work and made some good improvements since the GA release last year, but this ultimately looks like it's going to end up being a future-legacy technology.

Any plans on testing HAWQ?

@Jon, we have no intention on testing HAWQ as we would need to get the proprietary licenses first. There is at least one big proprietary vendor who wants us to test their technology, so, who knows?

HAWQ is now open source, so licensing should no longer be a problem. I'd be very interested in seeing a side-by-side comparison of Impala to HAWQ.