As datasets become more voluminous over time, processing time grows to update the flow with fresh incoming data, run preparation steps, and retrain models. Partitioning helps solve the issue. By splitting a dataset into subsets along meaningful dimensions (time or discrete dimensions), it leads to build the flow for the incremental data only - while keeping the historical data as it is.
Malick Konate (Data Scientist, Dataiku) will explain in details what partitioning is and how DSS users can use it to increase computation performances while dealing with large volumes of data. Using the example of a retail company, he will walk us through how this can be used to build historical data, target data processes on new data, and train a partitioned machine learning model for each country. This will also be an opportunity to share best practices and common pitfalls of managing dependencies.
Note: Partitioning is not available in the Community edition of Dataiku DSS.