Feature engineering

Context:

Data is loaded, scripts are ready, palms are sweaty...

And your data is so huge it doesn't fit in and python crash.

Before buying an expensive GPU that doesn't help at all, here are some pointers

Feature engineering at scale

Great datasets beats better models

Let's face it: heavy models don't beat feature engineering. A carefully crafted feature, that captures a specific aspect of the business life, is highly valuable. Apart from the obvious Pandas, there are some packages out there than can help to create new features, among which featuretools:

import featuretools as ft

# creating and entity set 'es'
es = ft.EntitySet(id = 'sales')

# adding a dataframe
es.entity_from_dataframe(entity_id = 'bigmart', dataframe = combi, index = 'id')

Other great packages includes tsfresh for time series and featureforge. To address the balance of the dataset between several classes, there is a good package too imbalanced-learn

All of these packages allow building datasets that lasts and push simple models to the best result possible.

The troubles of distributed computing

One of the core concern of pandas users is scalability (link). Wes McKinney, the author of the library, wrote a great article about his vision on why pandas doesn't scale much. Some pointers can still be found to optimise the run (here and there), but doing feature engineering on several gigs of data is still somewhat complex to apprehend. Fortunately, Pandas has an ecosystem that allow to scale through what is called "out-of-core" processing.

When loading a pandas dataframe, it gets stored in the RAM, basically the fast memory. However, this memory is usually up to 16-32 GB, which is not much sometime. Out-of-core processing uses instead the computer hard drive, which is usually around 1To. However, the read/write operation is also slower. That's why having a SSD drive is a good idea to boost this bottleneck. The main framework to do this kind of operation is Dask.

Another point of attention is how to optimize the computation. For example, a row-wise operation on the dataframe can be distributed easily. A shuffle operation (moving all rows) is one of the most expensive. So think about how you can compute your features.

The Dask team putted together some great best practices, a worthwhile read indeed.

Dask Example

Dask is the most obvious choice when it comes to replacing pandas dataframes. It has a wide support from 200+ contributors and 5000+ commits, and can be used indifferently from small dataset to terrabytes on clusters.

path_data = "NYC_taxi_2009-2016.parquet" # that's 35GB FYI !

@timeit
def load_df():
    df = dd.read_parquet([os.path.join(snappy_path, f) for f in os.listdir(snappy_path)])
    return df

@timeit
def describe(df):
    val = df.count()
    dd = val.compute()
    return dd

@timeit
def fare_to_euro(df):
    df["fare_amoun_euro"] = df.fare_amount*0.9
    df.compute()
    return df

(A decorator to time functions was taken from here)

import dask.dataframe as dd

# ... here the functions
df = load_df()
# 'load_df'  510.72 ms
dd = count(df)
# 'describe'  468980.57 ms

Dask also offers a distributed version, which includes notable performance and task tracking

from dask.distributed import Client
import dask.dataframe as dd

client = Client()

# ... here the functions
df = load_df()
# 'load_df'  463.89 ms
dd = count(df)
## 'describe'  269501.17 ms
measure time
Simple Metadata computation 0:08:32.429791
Simple dask column computation (add) 0:10:34.628188
Distributed metadata computation 0:04:35.804317
Distributed dask column computation 0:06:18.56741

Other tools

Several tools offers similar potential. Test them and see which poison please you.

  • Vaex
  • pandarralel
  • modin
  • Spark

Todo List

  1. If my computation takes time, think about how to optimise it
  2. if my data is too large for the RAM, use dask
  3. if my data is too large for my computer, use spark or dask distribued

Reading List