Our 3 favourite data analytics tools of 2015

Sunday, Jan 10th, 2016

In this blog, we look back at the 3 data analytics tools that shaped 2015 for us. This is by no means an objective blog, and I'm sure your list looks completely different. Looking forward to hear your feedback.

Spark

Cliché, I know. I'm not even going to bother posting a link. Spark is not exactly a new framework, with an initial commit from more than 5 years ago. Yet, for us, it was the first time we supported a client with bringing it to production. In true Blitzkrieg-style, Spark dominated the market in no time. Today, it seems like everybody, and their grandmother, is running Spark in some way or form. There are tons of blog posts on what Spark is and what it does. Here I'd like to focus on why specifically it has been so useful for us.

In one sentence: Easy to develop, test and scale data pipelines. ETL has become quite complex. And simple visual ETL tools often don't offer the power and flexibility you need. We've witnessed the weirdest solutions to that problem. We've seen SQL Stored Procedures which generate SQL which execute an ETL job. We've also dealt with C# code which generate SSIS packages which generate SQL which execute an ETL job. We've even seen matlab (!) scripts which call a Java API and some python scripts to execute an ETL job. We've also seen clients squeezing visual ETL tools to their extremes. I can write entire blog posts about the challenges those solutions bring. Madness.

This current client was using PIG before we helped them migrate to Spark. Don't get me wrong, I like PIG. It's a much better fit for ETL than SQL. But, like SQL, PIG has its limits. They had folders full of PIG scripts, which were glued together by oozie scripts, had their UDFs in Java, their script parameters updated by a shell script and stored in a mysql database. And all of this was triggered by an external scheduler. Needless to say, this setup was very brittle. Debugging meant walking through tons of logfiles to try to figure out what happened. And bugs were very hard to reproduce. Every morning was a new surprise.

Moving to Spark made our work an order of magnitude easier. "Yes, but our devs don't know scala". Really, that's no excuse. Your devs are still solving the same business problems, and still building very similar data pipelines. If they have the right mindset, they CAN learn the basics of scala needed to build spark applications. It helps if you have at least one scala enthousiast in the team, who can do the needed evangelisation. For us, that was Mathias Lavaert. @Mathias, thanks again for teaching the rest of us Scala!

Spark on its own doesn't make anything easier. You have to use it well. We wanted to make sure everybody could develop and test features on their own machines. We had unit tests, integration tests and end-to-end tests. We worked with pull requests and code reviews. We had continuous integration and nightly builds. While this is a given in any professional dev environment, it definitely wasn't in a BI-inspired Big Data department. Using Spark also meant that all business logic was built in one place, not spread out over 4 different tools. This brought much easier debugging and reproducibility of bugs.

Scaling Spark was relatively easy as well. In most cases, if a job worked on your laptop on limited data, it would also work in production on billions of records. Sure, we had our own scalability and out-of-memory issues. Spark is a mighty machine, and you need to spend time tuning it. That's one of its downsides. Compare this to my earlier experiences with HP Vertica, where the default parameters almost always resulted in best performance. But hey, Vertica is not really taking over any market any time soon. Also, with Spark, you get a free lunch every now and then. Upgrade from 1.2 to 1.3? Good for you, here are some performance improvements. Upgrade from 1.3 to 1.5? Great, here are some more.

Airflow

This tool has not received enough love. Airflow is a workflow manager and scheduler from AirBnB. And before you decide to skip ahead to the next section, because "OMG... boooooring", please, bear with me.

Workflow management is something that often pops up as an afterthought: "Oh yes, we need to run this job after that job, unless that other job gives an error". And very often, workflow management starts with a shell script, and then becomes a collection of python scripts backed by 2 little tables in sqlite. If a traditional workflow manager, like oozie, is being involved at all, it's often misused, and quickly hacked together. The absolute worst are the enterprise workflow management tools, shielded away by a crappy excel sheet managed through sharepoint. With an SLA to respond of 14 days. Good luck defining your data pipeline! And yes, sadly, we've been there.

Workflow management is being neglected so much because business doesn't see this as "a feature". So whatever you do, make it quick. And developers don't care either. "Whatever, let me go back writing real code."This often results in annoying crashes and bugs, and wrong outputs being generated in production. Also, nobody takes responsibility because their respective components did the jobs they were supposed to do. It's a shame. In our experience, good workflow management can make or break your data pipeline.

Airflow is an order of magnitude better than all of the above tools for several reasons. Firstly, it deeply integrates the notion of time. You have daily and hourly batch jobs. And you want to have one overview of which jobs have run for which time slots, so you know which data has been processed correctly. Secondly, you can define your dependencies through python code. This saves your eyes from some ugly, massive XML, and at the same this opens the door for dynamic direct acyclic graphs (DAGs) which are very convenient. Last but not least, Airflow offers a nice web UI for operators and support staff to keep an eye on all your production workflows.

Airflow is being used actively by at least two of our clients, where it replaced their homegrown shell and python scripts. So far, results have been very positive. Unfortunately, we couldn't convince one other client to even consider an existing open-source workflow manager, let alone something as hip as Airflow, so we had to write our own. I hate reinventing the wheel and I was concerned we would build a crappy, unusable tool. Because, obviously, we were given very little time to do it. Luckily Joris Mees delivered an excellent job there, building a stable and easy-to-use workflow manager in record time. @Joris, thanks for saving us there!

Infrastructure-as-code

Not strictly speaking a tool. It's a concept. Or maybe a collection of tools: Ansible, Docker, AWS. Choose your poison (chef, azure, vagrant, kubernetes, ...). Pascal Knapen is our cloud specialist and while he was guiding me through a solution that he was building for a client, he casually mentioned this 'ansible' folder in his git repo. "Wait, what does it do?" "Well, it spins up my AWS EC2 instances, defines VPCs, sets up a secure connection with S3, configures the firewall, uses ECS and autoscaling groups to run docker containers which are automatically deployed from github, sets up a VPN with the client intranet, ..." I had to sit down there for a minute. I've heard of and/or worked with most of these components individually. But never before did I see all these things come together. And never before did I see so much functionality in a single little git folder called 'ansible'. It was an awakening for me.

Infrastructure-as-code allows you to be agile. Never before have we been able to put so much power and control in the hands of the developer. With a single push of the button, a developer can set up an entire infrastructure and do a deployment of their product. You want to do several releases per day? No problem. This is tightly coupled with the cloud. Sure, you don't need the cloud. You can do parts of it on an on-premise data center. Sometimes, they even allow you to automatically provision a VM. Yet, often it's a job protection and a cultural issue. "We don't let developers define their own systems. What do they know? And besides, that's our job." In the cloud, you can avoid these non-productive discussions. And of course, have true on-demand flexibility and scalability.

Of course, with great power comes great responsibility. Good coding hygiene and a healthy dev environment are key to success. Additionally, we've seen larger clients having a dedicated cloud team who define a perimeter, to limit security exploits caused by unexperienced developers. Another client of ours detects problems as early as possible by spinning up an hadoop cluster on AWS and running all their integration and end-to-end tests as part of the nightly build. We ourselves have been seeking advice from external experts when we saw the need for it. In particular, we've had very pleasant experiments with Jeroen Jacobs from Head in Cloud. @Jeroen, thanks for helping us out in the past! Looking forward to working with you again in 2016.

Yet, devops can only be devops, if devs are actually being responsible for their own operational tasks. So, while it is ok to seek expert advice when you need it, it's important that devs start taking ownership of this aspect of development as well. The more end-to-end your devs can work, the faster you can move as an organisation. A good way to start, is to take this devops course, offered by Edward Viane, from another cool Belgian devops company, IN4IT.

We currently see advanced forms of automation, devops, or infastructure as code at most of our clients. We have done a lot of the work ourselves. Besides the efforts of Pascal, another cool example is how Patrick Varilly automated a highly secure deployment of Hortonworks Hadoop on the AWS GovCloud for one client. I guess I am the one at Data Minded lagging behind. Bigboards.io is also pushing the boundaries here. They dramatically lower the barrier of entry of Big Data by easily switching tech stacks on their Bigboards Hex and start experimenting and learning. This is definitely an aspect of writing code that has truly been "disrupted" by new technology.

Conclusion

We sure do live in interesting times! The world of data analytics is moving fast. Most of the tools we use in production today, barely existed a couple of years ago. There is no end in sight yet. And in upcoming blog posts, we will be talking about our outlook on 2016, but also our greatest disappointments of 2015.

Another nice observation is that there are actually plenty of companies, here in Belgium, being well on track with using data analytics and big data to improve their business. I'm not just talking about our own clients and partners, because, yes, of course, they are awesome. No, I'm also talking about the meetups, the events, the incubators, the Data4Good projects, ... We're no Silicon Valley yet, but at least we're heading in the right direction.

In the meantime, please do share your own experiences in the comments. We only ever see a thin slice of reality. And I would love to hear your war stories.

Kris
Data architect

Add new comment

Image CAPTCHA