# Prostates, MRIs, forests

This post is on my undergrad research with the Image Processing and Analysis Group at Yale, which I’m presenting at SPIE Medical Imaging 2016. The goal is to predict the location of cancer within the prostate using multiparametric magnetic resonance imaging.

###A crash course in prostate cancer diagnosis

• The bad: The current, run-of-the-mill way to diagnose prostate cancer being used at the vast majority of hospitals is an ultrasound-guided needle biopsy. This is not a great system. Tumors don’t really show up in ultrasound, so the biopsy is performed in an undirected fashion using a grid over the prostate. Moreover, this tells us nothing about the size or shape of tumors when we get back a positive core.
• The good: By changing the way the magnets behave during an MRI scan, we get multiparametric MRI (mpMRI): images of the same region of the body with different medical significances. We call these different image types parameter maps or channels, and an example is down below. Unlike ultrasound, we can see tumors using mpMRI. And some really awesome research on ultrasound/MRI fusion imaging allows us line up incoming ultrasound images in real-time during a procedure with MR images we’ve already taken (since they can’t be done in real-time). That means we can use regions suspected of being cancer (ROIs) on an MR image as targets for the biopsy procedure. This leads to intelligent biopsies and intelligent treatment, since we know something about the tumors ahead of time.
• The goal: To identify ROIs on MR images, highly skilled radiologists need to analyze the MR images in-depth. MRI is done in 3D by taking 2D image slices at different heights through the pelvic region (where the prostate is). So we may have 80 images per each mpMRI channel. Most hospitals don’t have the resources for this. But using supervised learning, we might be able to identify the ROIs automatically.

###Learning to detect cancer What we want is to predict the probability of cancer being present at every voxel (a 3D pixel) position in the MR image volume of the prostate. The plan is to acquire mpMRI data that will be useful in discriminating between normal and cancerous tissue, develop a feature set which adequately describes the complex appearance of tissue, and then come up with a suitable learning model.

#####The image volumes The example above shows a slice from each of the mpMRI parameter maps (or channels) we’re using.

• T2 MRI is the classic MRI showing larger tissue structure
• Diffusion weighted MRI measures how water molecules diffuse through tissue, which gives an idea of the microstructural tissue properties. The apparent diffusion coefficient (ADC) is a useful metric computed from DWI.
• Dynamic contrast enhanced MRI is a fancy new technique. A contrast agent (i.e. a fluid which lights up on MRI) is injected into the prostate blood supply at time zero and an MRI movie (so now we’re in 4D) is taken as the contrast/blood moves around. What we’re interested in is how the blood gets taken up by different areas of tissue; since we injected contrast, this corresponds to the time-varying brightness. We can summarize this high-dimensional characterization by fitting a pharmacokinetic model to each pixel, which results in the two parameters (Ktrans and kep) shown above.

Since normal and cancerous tissue have different structural (T2), microstructural (ADC), and vascular (Ktrans and kep) properties, this mpMRI data set should give us the starting point we want.

#####All about feature engineering The simplest feature set for a voxel would be to use the intensity values for the four channels: $$[t, a, k, p]$$ where $$t$$ is T2, $$a$$ is ADC, $$k$$ is Ktrans, and $$p$$ is kep. We can do something smarter by exploiting the natural structure in images. Instead of just taking the values at the given voxel, take all the values within a 5 by 5 by 5 patch in each channel centered at the voxel. We linearize the patches to form $$[t_1, … , t_{125}, a_1, … , a_{125}, … , k_1, … , k_{125}, p_1, … , p_{125}]$$.

This is a first order approach to describing appearance, and humans need at least second order statistics to discern between two textures. Think of trying to discriminate between two different normal distributions: what we really need are the means and variances. For each of the four channel image volumes, we’ll generate seven more image volumes. Each voxel location in the new volumes corresponds to a different description of the texture in the vicinity of the voxel in the original image volume. For the image processing buffs out there, I’m using median filtering, standard deviation filtering, the Sobel gradient, and four Haralick texture features. See an example below. Now we have 32 image volumes (including the original) and we’re doing the patch thing in each. That’s 32 feature types and 4000 features for each voxel.

Let’s make it 33 and 4001. The final feature is a bit different. For diagnostic purposes, the prostate is divided into three zones with different tissue characteristics. The zones also show different tumor appearance signatures. We don’t know exactly where the zones are, but we can approximate each voxel’s zonal membership by finding its distance to the prostate boundary. This distance is the final feature.

#####The model: GDA and random forests

Let’s recap the problem. Given mpMRI data from several training MRI volumes, predict the probability of cancer being present at every voxel location in a test volume. We’re posing this as binary classification, and here are the steps:

1. Generate training data
2. For every voxel within the yellow prostate boundaries in the training volumes, generate a feature vector as described above
3. Label feature vectors as cancer if the voxel is inside a red boundary confirmed as cancer by biopsy, and label as normal otherwise
4. Learn model
5. Train two-class Gaussian discriminant analysis (GDA) model on a small subset of the features
6. Predict the GDA response for each training voxel, and append it to the feature vector
7. Train a random forest model on the GDA-enhanced training data, and compute out-of-bag feature importances
8. Select a new feature set using the out-of-bag feature importances
9. Train a final random forest model using only selected features
10. Predict
11. For every voxel within the yellow prostate boundary in the test volume, generate a feature vector as described above
12. Predict the GDA response using trained model for each test voxel, and append it to the feature vector
13. Predict the cancer probability for each voxel using final random forest model and selected features

We’ve covered steps 1 and 3.1. We just need to take care of step 2 (steps 3.2 and 3.3 follow). In designing our learning approach, I want to keep two things in mind: problems like this are often both simple and complex and not all features are created equal.

The first point means that a simple model will get us most of the way there, but to push our predictive accuracy to the max we’ll need to employ a more complex model. This motivates the use of an initial GDA model and then a random forest. For those not familiar with the basics of GDA or random forests, see the appendix below. We first learn a three-dimensional normal/cancer GDA model using the voxel intensities in the three most discriminative maps. These are almost always the median-filtered maps for T2, ADC, and kep. We compute the probability of cancer as given by the GDA model for every training example, and append these values to the feature vector. Then we learn a random forest model using the GDA-enhanced training data. On its own, the GDA model does ok but not great. We use it as a way to quickly generate initial probability estimates that will seed the forest in a sense. A forest using an initial GDA model also requires far fewer trees to converge to a minimal prediction error than one without an initial model.

The second point means that we’ll want to trim down our 4001 features to find the most powerful ones. When training the random forest on the GDA-enhanced data, we’ll compute the out-of-bag predictor importances (details in the appendix) and then average them within each feature type to get a measure of importance for each feature type. We cut out the least important ones to train our final random forest model on. See the figure below.

One small implementation note. For the distance feature and the learned GDA “feature”, there’s only a single value for each voxel. But with all the other feature types, there are 125 values (from taking the 5 by 5 by 5 patches). We want to keep the sampling proportions even in the random predictor subspace step for random forest learning, and the easiest way to do that is repeat those two features 125 times each. That means our full feature vectors have 4500 predictors.

###Results in brief We tested it using leave-one-patient-out cross-validation on our 12 patient data set. That means holding each patient out once as the test MRI volume, and training our model on the remaining 11. This mimics a real world case of using the predictions - based on previous patient cases - to guide a biopsy procedure. In short, it works pretty well.

#####Qualitative assessment The best way to assess how well the method works is to compare the probability maps to the expert tracings done by a radiologist. See the figure below.

The performance shown in the first four rows is representative of almost all of the cases. On a region level, we see almost perfect sensitivity (all tumors detected) and specificity (no false positive regions). The final row shows a case where the tumor is correctly detected, but it’s not perfectly specific: other areas show a high probability of cancer. In total, 21 of 22 tumors in the 12 patients were detected, and specificity was high.

#####Quantitative metrics and model analysis We can also look at quantitative metrics, which are more helpful for model comparison than really assessing the system per se. Here we’ll use the receiver-operator characteristic curve (ROC curve), the standard for binary classification where class probabilities are output. It judges how well sensitivity is balanced with specificity. We’re using the ROC on a voxel-by-voxel level, meaning we’re asking how good the prediction for each individual voxel was. To obtain an overall ROC curve for the model, we can vertically average the 12 ROC curves generated by cross-validation. The ROC curve is summarized by the area under the curve (AUC): an AUC of 1 means perfect prediction, an AUC of 0.5 is as good as coin-flipping, and an AUC of 0 means you probably switched your class labels by accident (or really need to rethink your approach). Our approach achieved an average ROC AUC of 0.9228. See the figure below.

For some model analysis, we made predictions using GDA alone, a random forest alone, and a related support vector machine approach (Moradi et al., 2012). Using visual analysis and ROC comparisons, we found the joint GDA-forest model performed best. See below.

Model GDA-Forest GDA Random forest SVM
Average ROC AUC 0.923 0.843 0.916 0.874
ROC AUC std. dev. 0.064 0.186 0.097 0.102

We can also use the model to investigate the contributions of the different mpMRI channels. For instance, we can rerun the entire procedure pretending we only had one channel available and then compare the ROC curves. We found that they mpMRI channels do indeed complement each other. See the figure below.

#####Conclusions and future work Let’s wrap up with some conclusions. The combination of multiparametric MRI and machine learning looks to be a powerful diagnostic tool. We also learned something about mpMRI itself: the structural diffusive, and vascular tissue characterizations it provides indeed complement each other in a diagnostic setting. There’s still plenty of work to be done. Using an automated prostate gland and zone segmenter would be a big improvement, and there’s always room for more specific pathological verification. Incorporating these improvements would create a more robust system. Fast, automated tumor detection for targeted prostate biopsies is certainly achievable using this model, and we’re closer to a day when rigorous quantitative decision making for cancer diagnosis is widespread.

###Appendix: A tale of two classifiers Gaussian discriminant analysis (GDA) and random forests are very different learning approaches. GDA is a fairly simple generative model. A random forest is a fairly complex discriminative model. For our purposes, they complement each other. I’ll give a general overview of GDA assuming knowledge of linear discriminant analysis (if you want more info, try the link above), and more specific discussion of random forests.

#####Gaussian discriminant analysis Chances are you’ve heard of linear discriminant analysis (LDA), where we assume the class-conditional PDFs are normally distributed. This is the basis for GDA. LDA restricts the covariance matrices of the different classes be the same, but in our formulation of GDA, we make no such restriction. This generates a quadratic decision boundary, and for this reason, it is sometimes referred to as quadratic discriminant analysis. As with any relaxation of generative assumptions, GDA calls for more data than LDA in order to learn the covariance matrices.

Here, we use GDA to output class probabilities, not class assignments. So for me, it’s more helpful to think of GDA in terms of Gaussian mixture models (GMMs). In the two-class problem, we’ll think of each observation as being drawn from the mixture of two classes. We want to learn the two class distributions - assumed to be normal - in order to output the mixing proportions. This is the same setup as a classic unsupervised GMM, except that we know the class assignments of the training data. So the learning step turns into a simple maximum likelihood estimation where we compute the mean and covariance matrices as normal for a normal distribution (i.e. no need for expectation-maximization).

#####Random forests I could talk a lot about the random forest learning algorithm, too often thought of as a blackbox these days. Sure, it’s really powerful, but that’s only part of the fun. It has a beautiful statistical motivation, it’s a nice culmination of simpler ideas, has some clever tools for model selection, and it really isn’t that tricky when you get to know it. I’m going to synthesize material from two sources - Leo Breiman himself and The Elements of Statistical Learning - and try to be brief. However, both are really accessible, so definitely take a look at them.

If you don’t know what CARTs (classification and regression trees) are, read the chapter in The Elements of Statistical Learning. Decision trees are the basis for random forests (hence the name). Trees are really great for some things: they’re fast, have zero bias if fully grown, and are robust to mislabeled inputs. However, with small bias comes great variance. That is, they notoriously overfit when fully grown with the leaves containing potentially very few training examples.

Breiman first addressed this problem with bootstrap aggregation for trees, or tree bagging. Given our data set $$\mathbf{X}$$, we generate $$T$$ bootstrap samples $$\mathbf{X}^\ast_1,…,\mathbf{X}^\ast_T$$. For each replicate $$\mathbf{X}^\ast_t$$, we then learn a decision tree $$f_t$$ on it. The set $$F = \lbrace f_1,…,f_T\rbrace$$ forms the ensemble. To classify a new case $$x$$, we classify it with each tree individually, and then take the majority vote. That is $$F(x) = \mathrm{mean}_t f_t(x)$$. Let’s see why this works. Since we used the bootstrap, the trees are identically distributed. However, they are obviously not independent; we used very similar samples to learn them. Thinking about the ensemble, we note that the bias of the average is the same as the bias of any of the individual trees (which is very low if the trees are grown sufficiently deep). However, the variance isn’t as simple. If we assume they each have variance $$\sigma^2$$ and non-negative pairwise correlation $$\rho$$, then $\mathrm{var}(F) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$ We can kill off the second term by using more and more bootstrap samples. But we need to alter the tree learning phase to shrink the first term $$\rho\sigma^2$$. That’s where random forests come in. The random forest algorithm is the same as the tree bagging algorithm, except that each time we make a new split during the growing of each of the trees, we consider only a random subset of the possible predictors instead of all of them. Breiman even showed there’s an optimal subset size for classification: if we have $$p$$ predictors, select $$\sqrt{p}$$ of them uniformly at random for the subset. In effect, this reduces $$\rho$$. Breiman suggests growing the trees to full depth, meaning the ensemble has zero bias and variance shrunk from a large $$B$$ and smaller $$\rho$$. However, you could sacrifice some bias for a reduction in variance $$\sigma^2$$ by limiting the depth of the tree either explicitly or by setting a minimum leaf size.

Random forests come with a couple clever techniques for model selection. Both rely on the out-of-bag (OOB) samples: the OOB samples for tree $$f_t$$ are the observations in $$\mathbf{X}$$ which are not in $$\mathbf{X}^\ast_t$$. Breiman introduces the following two measures:

• OOB error: Have each tree classify its OOB cases. For training example $$x_j$$ with true class $$c_j$$, let $$c’_j$$ be the most frequent classification of $$x_j$$ when it is an OOB case. The OOB error for the forest is then $\mathrm{err}\_\mathrm{OOB} = \mathrm{mean}\_j\; 1[c\_j \neq c'\_j]$
• OOB feature importance: Suppose we want to compute the importance of predictor $$i$$. Have each tree classify its OOB cases. Let $$m_t$$ be the number of OOB cases correctly classified by tree $$f_t$$. Now randomly permute the values of predictor $$i$$ in the OOB cases of tree $$f_t$$, and have it classify them again. Let $$m’_t$$ be the number of OOB cases correctly classified by $$f_t$$ after permutation. The raw importance of feature $$i$$ is then $\delta\_i = \mathrm{mean}\_t \; m\_t - m'\_t$The scores are normalized by assigning a significance level after converting to z-scores.

The OOB error is primarily used to decide how many trees to learn for the forest. If the OOB error stops decreasing after 100 trees, there’s no need to learn 10,000. The OOB feature importances are used for model selection. For instance, we can learn a forest with all the features and compute their importances, then remove the least important ones. This can be repeated until the feature set is satisfactory. Computing these measures can significantly increase the learning run time, so use them intelligently.