Each image segmented by one of the 6 thresholding algorithms was compared to its corresponding ground truth segmentation from the SWIMSEG data set. A pair of these images is shown below in Figure 1. From each image pair, a true positive, false positive, true negative, and false negative rate was computed for the segmented image. In the context of the images in Figure 1, these metrics represent:

  • True positive: pixel identified as a cloud in both the segmented image and the ground truth image.
  • True negative: pixel identified as sky in both the segmented image and the ground truth image.
  • False positive: pixel identified as a cloud in the segmented image but is sky in the ground truth image.
  • False negative: pixel identified as sky in the segmented image but is cloud in the ground truth image.

Using these percentages, the following metrics were calculated, using the formula in brackets.

  • Precision: How many pixels detected are actually part of a cloud? [TP/(TP + FP)]
  • Recall/Sensitivity/True Positive Rate: What fraction of the total cloud pixels were detected? [TP/(TP+FN)]
  • F1 Score: harmonic mean of precision and recall [2(PxR)/(P+R)]
  • Specificity/True Negative Rate: What fraction of the total sky pixels were detected? [TN/(TN+FP)]
  • Accuracy: percentage of pixels which are correctly classified [(TN+TP)/(TN+TP+FP+FN)]

Figure 1: Comparison to Ground Truth

Comparison to Ground Truth Image

For the fixed thresholding techniques, precision, recall, sensitivity, and specificity were averaged over the training images at each threshold and used to generate the Precision-Recall and ROC curves in the plots below (Figures 2 and 3). In both plots, the optimal operating threshold occurs where the lines most closely approach (1,1). For the adaptive thresholding techniques, since an optimal threshold is calculated for each image, only one point representing the average value over the training set could be plotted.

Figure 2: Precision-Recall Curves for Varying Thresholds

Precision Recall Curve

Figure 3: ROC Curves for Varying Thresholds

ROC Curves

The values plotted above represent an average performance over a randomly selected set of training images. However, it seemed likely that different algorithms would perform more favorably on certain types of images. To get an informal idea of how certain techniques performed on different categories of clouds, I ran four of the six algorithms on a set of thirty images. The results in Figure 4 compare the performance of the four algorithms on each image using the F1 score as a metric. I’ve selected a few of these images (Figure 5) to show the high variability in performance for different types of cloud images. Certain pictures, such as number 23 below, have thin, wispy clouds which are not identified well by the fixed thresholding methods. Others, such as number 14, include parts of the sky near the sun, where solar wash-out can lead to overestimates in actual cloud cover. This variability of performance with different types of clouds was the motivation for me to pursue some type of cloud categorization as well as detection.

Figure 4: Performance By Image

Performance By Image

Figure 5: Visual Comparison of Algorithms 

Visual Comparison of Algorithms on 4 Images

As a final comparison, the F1 scores of each algorithm (at the optimal threshold for that algorithm) were compared. These values are shown below in Figure 6. Based on all of these comparisons, I chose to use the Red-Blue Ratio Adaptive Thresholding using Otsu’s Algorithm. This technique performed well on the training set and also was more resilient to changes in cloud cover type or lighting than the fixed thresholding methods.

Figure 6: F1 Scores 

F1 Score Comparison of Algorithms