TERATEC 2016 Forum
Workshop 3 - Wednesday, June 29 from 9:00 to 12:30
Tools and algorithms for Big Data applications

Tackling challenges in Big Data analytics: a distributed random forest algorithm
Marc WOLFF, ingénieur d'application, MATHWORKS

Résumé : Machine learning is today a well-known and heavily used technique for turning data into value and automating decision making. Among machine learning algorithms, random forests raise more and more interest due to their efficiency and easiness to interpret. However, training random forests models on large datasets still represents a technical challenge. We here present a random forest implementation that allows to tackle this challenge and to run machine learning analysis on Big Data problems.

Random forests are built as a combination of many decision trees, generally several hundreds of them. A frequently used approach for speeding up the training of random forests consists in performing the training of many of the underlying decision trees in parallel. Despite being efficient on traditional datasets, this method cannot be used in a Big Data context since it requires to load and replicate the dataset of interest multiple times. A better approach is to develop a decision tree algorithm that is able to handle large datasets. Any random forest model built upon such decision trees thus inherits the ability to handle Big Data problems.

In order to take advantage of HPC clusters and to operate on distributed datasets, the decision tree algorithm we propose is based on SPMD parallelism (Single Program Multiple Data) and on MATLAB's MPI API (Message Passing Interface). We will present the results we obtained with this approach in terms of performance and supported data size.

Biographie : Marc Wolff is an Application Engineer at MathWorks with a specialization in parallel computing and Big Data. After graduating in Scientific Computing at the University of Strasbourg, Marc did a Ph.D. in Applied Mathematics at the CEA. During his Ph.D., he contributed to the development of simulation codes that ran on very large computing infrastructures..

