The performance of any task depends on the representation of the data. A good representation should capture the factors of variation relevant to the task at hand while discarding the nuisance variables. Since this is task-specific, the common way to build representations had been to hand-engineer them using domain knowledge. Since the advent of deep learning, this paradigm has shifted in favor of learning the representations in tandem with the task. Whereas there has been remarkable progress in representation learning with deep networks for natural images, medical images do not benefit from this paradigm as much as natural images. This is due to a number of factors particular to this domain, including relative data scarcity, class imbalance (e.g. many more “normal” images than abnormal or containing disease), and objects or patterns of interest occurring at multiple scales and without clear boundaries. Another challenge for machine learning for medical images is that the tolerance for error is often lower compared to tasks involving natural images. As a result, representation learning for medical images still requires solutions that are tailored to the data and task at hand.
In this thesis, we develop and study learning representations from complex medical data that enable high performance in several downstream tasks e.g., sequence classification and semantic segmentation. We then look at a more abstract deep learning methodology, generalization in Variational Autoencoders (VAEs), motivated by the limitations of current approaches, to improve our understanding of the relationship between available training data and representation of the more general population of images from which the training data were sampled.
The medical imaging modality we look at is Reflectance Confocal Microscopy (RCM), which is an effective, non-invasive pre-screening tool for skin cancer diagnosis. However, RCM images require extensive training and experience to assess accurately. There are few quantitative tools available to standardize image acquisition and analysis, and the available ones are not interpretable. In the first part of this work, we use a RNN with attention on CNN features to delineate in an interpretable manner the skin strata in vertically-oriented stacks of transverse RCM image slices. We introduce a new attention mechanism called Toeplitz attention, which constrains the attention map to have a Toeplitz structure. Testing our model on an expert-labeled dataset of 504 RCM stacks, we achieve 88.07% image-wise classification accuracy, which is the current state of the art.
In the second part of this work, we developed two automated semantic segmentation methods called MU-Net and MED-Net that provide pixel-wise labeling of RCM images into classes of cell structure patterns. The novelty in our approach is the modeling of textural patterns at multiple resolutions, mimicking the traditional procedure for examining pathology images, which routinely starts with low magnification (low resolution, large field of view) followed by closer inspection of suspicious areas with higher magnification (higher resolution, smaller fields of view). We trained and tested our model on non-overlapping partitions of 117 RCM mosaics of melanocytic lesions, an extensive dataset for this application, collected at four clinics in the US, and two in Italy. With patient-wise cross-validation, we achieved pixel-wise mean sensitivity and specificity of 70% and 95%, respectively, with a 0.71 Dice coefficient over six classes. In a second scenario, we partitioned the data by clinic or origin and tested the generalizability of the model across clinics. In this setting, we achieved pixel-wise mean sensitivity and specificity of 74% and 95%, respectively, with a 0.75 Dice coefficient. We compared MU-Net and MED-Net against the state-of-the-art semantic segmentation models and achieved better quantitative segmentation performance than previous approaches. Our results also suggest that, due to their nested multiscale architecture, our models annotated RCM mosaics more coherently, avoiding unrealistically fragmented annotations.
Last, we examine the generalization of the latent representations in VAEs. The VAE objective combines a reconstruction loss (the distortion) and a KL divergence term (the rate) that is often interpreted as a regularizer. Our work re-examines this view. We perform rate-distortion analyses in which we control the strength of the KL term, the network capacity, and the difficulty of the generalization problem. Lowering the coefficient of the KL term lowers generalization in low capacity models, but paradoxically improves generalization in higher capacity models. Moreover, in easier generalization tasks (where the training set examples closely approximate test set examples), lowering the coefficient even improves generalization in low capacity models. These results show that the KL term does not improve generalization in terms of reconstruction loss. This suggests future work to investigate what inductive biases can aid generalization in this class of models.