Predicting occupancy on NMBS trains

Friday, Nov 11th, 2016

A couple of weeks ago, Nathan Bijnens published a blog post that highlights capabilities in Azure to do ML using occupancy data in Belgian trains. As I'm often confronted with busy NMBS trains myself, this sounded like a fun challenge. All code is available on github.

Data, data, data

Nathan got some pretty nice results already. How can we improve on that? There are two options: collect more data or build better models. The belief in the industry is that more data is often better than better models. I'll save you the long Quora discussion, the relevant quote is "In other words, data is important. But, data without a sound approach becomes noise." The message is clear: don't just collect crazy amounts of data just for the sake of it. Although companies like our very own Data Minded might benefit from implementing yet another big data project, it doesn't always get you closer to your goal. BUT... if we are smart about collecting more data, we might get somewhere.

Which data might have an impact on the occupancy of trains? Ask anyone who takes trains on a regular basis in Belgium, and you will get plenty of ideas:

  • In rush hour, trains are often full
  • Trains in the weekend are often more quiet
  • Trains going to the big cities are often full, especially in the mornings
  • IC trains have more people on it than L trains
  • If a train just stopped at a big city, and only going to small towns, it's probably empty
  • If the last train to the same destination has just passed, then this train is probably less full
  • If a train has fewer wagons than usual, it's going to be crowded
  • The sun is shining and it's weekend / holiday? Trains to the coast will be packed

Most of this data is not available in the original dataset. Time to do some crawling. You can find the code here.

  • The training set: Obviously, we can't live without. I've written some python to make sure we can run our analysis again when new data becomes available.
  • Stations: A small reference file, giving the latitude, longitude, and readable names for all stations in Belgium. Convenient. Retrieval is simple in python.
  • Connections: The real goldmine is the connection data. Which train are you on? Where is it coming from? Where will it stop next? How much traffic do you get in each station? That's available at spitsgids.be. For ~3 months of data, it's ~2GB of json files. Not small. I've made a zip available for August - October 2016 here.

Features, features, features

With all this data at hand, let's build some features that make sense. For me, the easiest way is to load all the source data into a SQL database. Old-fashioned, I know. But I love SQL. Once the data is loaded, it's so much easier to join, filter and group the data. Installing and configuring a SQL database is out of scope for this blog. But it shouldn't be too complex. I prefer Postgres. Loading the 3 data sources is straight-forward. Just create 3 tables and generate a bunch of INSERT statements. I'm doing a small magic trick for the connection data, as that is 1.8GB. I bulk insert every 20000 rows, as doing it row-by-row would be too slow.

Distances

For each occupancy record, we need to know which stations you stopped at before, and which stations you are heading next. Code here. It's a simple three-step process:

  1. Get all (stationfrom, vehicle, date) combinations in the occupancy file. Eg. I'm in Leuven on train IC1234 on 2016/10/12. I take date into account because, some days a train might have had different stops. Maybe the schedule was changed throughout the year. Maybe a bit of overkill, as the combination (Leuven, IC1234) is probably enough to identify all the other stations in 99% of the cases. Anyway....
  2. For that train on that date, look up the other stations in the connections table, and sort them by time
  3. Calculate the offset in indices between the stationfrom and all the stations int he list. Eg. (Leuven, Brussel-Centraal, IC1234, 2016/10/12, 2). That means that BXL Central was 2 stops ahead of Leuven for that train.

Turning those distances into features

So, now what? The simplest thing you can do, is pivot the distance table, so every row in the occupancy training set gets ~1000 more columns, most of them being NA. Eg, for the Leuven train, it will have a distance_to_bxl_central column, but also a distance_to_verviers, etc... for every station in Belgium. A single train only stops at a few stations, so most fields will be NA. It works, but it's kinda ugly. That's why I do without first, and only add them later.

An alternative is that you calculate specific metrics based on this distance information. Code now here. Here's what I did:

  • Calculate the frequency of trains for each station. The more trains stop in a station, the bigger it will be.
  • Play a bit with those frequencies
    • SUM(absolute_freq) is the sum of all the absolute frequencies on a train. That means, if a train stops at a lot of big stations, it is probably busy.
    • SUM(weighted_freq) is the sum of all the frequencies divided by their distance from the current station. That means, if the next stop is Brussels, this feature will go up, if the previous stop was another big city, this number will go down. If Brussels is only 8 stops down the line, it's probably not so busy yet. So give it less weight.
    • am_weighted_freq is -weighted_freq if it is before noon, or just weighted_freq if it is afternoon. Reasoning: in the morning, it's probably busier going TO the big stations. While in the afternoon, it's probably busier going AWAY from the big stations.

I add a bunch of other features, such as:

  • Is it the morning commute?
  • Is it the evening commute?
  • What's the day of the week?
  • What's the train type?

I know, I know, it's a gigantic ugly SQL query. I could've probably simplified this, and stored it in a new TABLE. So that the entire predict.py file is a bit easier to read. But hey, like with everyone, my time is limited. I have to do some real client work as well :-)

Time for predictions

Either way, time for those predictions. Of course, I use sklearn. Quite straightforward. I tried doing feature selection, feature scaling, logistic regression, naive bayes, ... Even ran a grid search. but I didn't really spend ages finding and tuning the best model. In the end, I'm convinced it's the features that do most of the heavy lifting. No pun intended. result.

Next steps

  • As collateral damage, I've got quite a rich dataset now to visualize a lot of the train traffic in Belgium. I would love to make some of those visualizations available to the public.
  • Moooooore features. Always more features. Some obvious ones are weather. It seems like Gilles Vandewiele is already doing that. another good one would be to find the time of the last train to the same destination. If that one just left 5min ago, it should be relatively quiet. Or the number of wagons on a train on a particular date. But I have no clue where to find that info.
  • Bigger training set. I'm convinced, like Gilles, getting more training data, is helpful to improve predictions. I guess they will come, over time.

I hope this work is helpful. I enjoyed doing it. Although I realize now that this blog is a big wall of text. Not a lot of helpful visualisations. So if you made it all the way to this point, congratulations. :-)

Kris
Data architect