A fair benchmark for Impala 2.0 - part 1Friday, Oct 31st, 2014
In the last couple of weeks, we've been studying Impala and we've been trying to replicate their TPC-DS benchmark results (here and here) on Impala 2.0, their latest release. This is important to us because some of our clients own a large enterprise data warehouse, and they are looking for options to migrate away from the expensive licenses offered by the dinosaurs. Impala being a prominent, scalable open-source SQL database, we need to understand it inside-out.
Cloudera published their code publicly. And they are clear about not comparing their results to official TPC-DS results. Yet I still feel that they are at least bending the truth in their favour. This is unhelpful and unnecessary. Unhelpful because with those tweaks, the tests performed are further away from actual real-life scenarios than the original TPC-DS tests. More on that later. Unnecessary because Impala 2.0 is actually a pretty good SQL database. I don't understand why they needed to hype up the results even more. Well, actually, I can guess. It's probably their marketing machine working overtime.
How Cloudera bends the truth and why it is unhelpful
- they are not honest about the dataset sizes and structure
- they cherry-pick and tweak the queries
- their use of partitioning is too complex
This is unhelpful because it doesn't reflect real-life data warehouses anymore. Let's dive into the details.
1. They are not honest about dataset sizes
They are subtle about this. When they say 'a 10TB scale-factor dataset', they actually mean 'a 4TB dataset'. That's not a rounding error. How can this be? Because, with the 10TB scale factor, they only create 9 of 18 dimension tables. And only 1 of 6 fact tables. Here's an overview of the TPC-DS data warehouse at various scale factors, with highlighted in green, the tables that are actually loaded:
2. They change the datatypes of the columns
In the one fact table they do load, store_sales, they change the data types of the facts from DECIMAL to DOUBLE. Typically, processing doubles is faster than processing decimals. But there's a reason why those facts are DECIMAL They also turn dates into strings.
3. They cherry-pick the queries they choose to benchmark
Why only 24 or so of the 99 queries? We did load the entire data warehouse and we did execute all of the queries. The motive behind their cherry-picking soon became apparent: They cherry-pick because a lot of the queries can't be completed on Impala. The biggest reasons for failure were:
- IllegalStateException, 'Illegal reference to non-materialized slot'. Not sure what it means. queries 16, 33, 51, 56, 64, 94, 60, 95, 97
- Memory limit exceeded. This even happened for datasets that fit in memory. Although not always. queries 4, 11, 30, 72, 74, 78, 81
- No support for specific JOIN: queries 13, 28, 48, 61, 88, 90
- No support for ROLLUP: queries 5, 18, 22, 67, 77, 80
- No support for correlated subqueries. queries 6, 32, 58, 69, 92
- No support for INTERSECT and EXCEPT: queries 8, 14, 38, 87
- No support for subqueries in the HAVING clause: queries 23, 24, 44
- ORDER BY error: queries 47, 57
- Loading DECIMAL results in the client. This was a problem for several queries, but it was related to the impyla client, not the database itself. So we ignored these issues.
In the end, we managed to execute 54 of the 99 queries. We did do some syntactic rewriting when possible. For instance, Date additions and string concatenations use a different syntax in Impala. We've built some regular expressions to fix that. However we did not rewrite entire queries, just to make them run. Which brings me to my next point.
4. Queries are changed in an unrealistic way
If your syntax changes a bit from what TPC-DS generates, it's fine to tweak the queries a bit. Like we did for date additions and string concatenations. It is not fine, however, to delete parts of the query that you can't execute, like a ROLLUP. It's also not fine to add additional partitioning clauses, like they do almost everywhere. Why not? Vertical partitioning is a powerful technique to speed up queries. Yes, but almost no client will automatically generate the partitioning clause like they did. And you certainly won't do it if you're doing some ad-hoc querying. The database should be smart enough to only query the right partitions behind the scenes. More on that in point 6.
5. The way queries are run, is not published
In TPC-DS, you generate several query streams which execute the queries in a random order. This is done to prevent you from preparing for a certain query. Like in real life, you don't know what's coming next. In contrast, if you execute the same query 10 times, the database has probably performed some kind of caching so your execution goes faster. It's not clear how the queries were executed, how many times and in which order. This is not published.
Also, the query generation in TPC-DS makes sure the query parameters are random. In one stream query 1 might filer on Year='2008'. The next stream filters on Year='1992'. These parameters are also dependant on dataset size. So you should generate different query streams for each dataset size. This is not happening. In stead, they have a fixed set of queries they execute on all datasets.
6. Partitioning in Impala is not straight-forward
As said before, vertical partitioning is a powerful technique to speed up queries. If you have 10 years worth of data, but you only care about last year, a database can be smart enough to ignore 90% of the data and execute your query 10 times faster. To enable this, you need to partition your data by year. Every year gets its own partition. And when you query, the database smartly selects the right partitions for you.
Except that, that's not how it works in Impala. There are some cases where Impala smartly selects the right partitions for you. But it's not clear to me when it does. For the other cases, your only solution is to manually add the needed partition clauses. Also, loading a lot of data into a partitioned table simply crashes. Impala runs out of memory. For the benchmark, Cloudera has written a special Python script to avoid any crashes. We decided not to repeat that, because that's a no-go for a lot of production systems. These tables get created and loaded by various ETL tools and automated processes. You can't expect to have some python script stand in the middle to handle the loading for you. Well, you can, if you really really need to. But we wouldn't call it 'standard operating procedure' anymore. So we left it out.
In a next blog, we'll share our first actual results of Impala. Next, we want to also benchmark Hive 0.14, SparkSQL and maybe some proprietary databases. We're even in negotiations with an academic research institute to do some of the testing for us. They bring a lot of benchmarking expertise to the table and they have some pretty awesome bare-metal machines which are simply out of reach for us. This collaboration should also make sure we remain as objective as possible.
Obviously, all procedures and results will be published on github. Because you have to give it to Cloudera: They did publish everything in the open. This allows for healthy discussions. We're hoping we've added to that discussion.