Project Details
Projekt Print View

Extensions of the random forest algorithm and a simple inference procedure for machine learning approaches

Applicant Dr. Roman Hornung
Subject Area Epidemiology and Medical Biometry/Statistics
Medical Informatics and Medical Bioinformatics
Term since 2014
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 266459004
 
This proposal is a renewal proposal of the project in which several extensions of the random forest (RF) algorithm solving practically relevant problems were developed. The new proposal consists of three parts. RF methodology is the core of the first two parts and plays a less important role in the third part. In the first part, we will develop an RF variant tailored to multi-class outcomes. While the latter may improve the predictive performance of RFs, a clear advantage of this variant will be that its variable importance measure will better account for the multi-class nature of the outcome. This fills an important gap, as to date there appear to be no established variable importance measures tailored to multi-class outcomes. The proposed RF variant uses the diversity forest algorithm developed. We will develop another RF variant, global forests, in the second part. The trees in global forests will improve on the structure of classical trees by considering interdependent splits, which allows to better exploit interaction effects between the covariates. This is expected to lead to improved variable importance measure values for covariates that have a strong effect through their interaction with other covariates, and it may also improve predictive performance. The third part will develop a simple general inference procedure for machine learning (ML) algorithms. This procedure is proposed in light of growing concern that conclusions drawn from ML models are often treated as fixed without questioning their statistical significance. The proposed procedure is conservative, computationally feasible, applicable to any ML method, very easy to implement and intuitively understandable. It uses bootstrap sampling, but is dramatically less computationally expensiv than classical bootstrap analysis. For the first and second part of this proposal, we will perform extensive simulation studies and real data analyses to study the properties of the proposed RF variants. Both variants will be implemented in our R package 'diversityForest'. The key properties of the inference approach proposed in the third part can be easily derived analytically. Therefore, only illustrative analyses will be performed in this part. Here, we will demonstrate the applicability of the proposed approach to various concepts in ML that are usually not covered by classical inference techniques. We will use RFs in all but one of these illustrative analyses.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung