Generating data for model training.
BQA quality channel we saw earlier.abind, which is used to manipulate subsetted arrays of imagery, and reticulate, which is used to navigate back and forth between R and python. We need reticulate because we will save our final dataset as a collection of .npy numpy files – these are a convenient format for training our mapping model, which is written in python.library("RStoolbox")
library("abind")
library("dplyr")
library("gdalUtils")
library("ggplot2")
library("gridExtra")
library("purrr")
library("raster")
library("readr")
library("reticulate")
library("sf")
library("stringr")
library("tidyr")
# setting up python environment
use_condaenv("notebook")
np <- import("numpy")
source("data.R")
theme_set(theme_minimal())
set.seed(123)
Kokcha basin, and for evaluation, we use Dudh Koshi. In general, our notebook takes arbitrary lists of basins, specified by links to csv files through the basins parameter in the header. In practice, a larger list of training basins would be used to train the model, but working with that is much more computationally intensive.st_sample function. More patches will translate into more patches for training the model, but it will also increase the chance that training patches overlap. You will see a warning message about st_intersection – it’s safe to ignore that for our purpose (we are ignoring the fact that the surface of the earth is slightly curved).y) together with an ordinary data frame (the centers for sampling).p <- ggplot(y, aes(x = Longitude, y = Latitude)) +
geom_sf(data = y, aes(fill = Glaciers)) +
geom_point(data = as.data.frame(centers), col = "red", size = 2) +
scale_fill_manual(values = c("#93b9c3", "#4e326a"))
p

coord_sf modifier.generate_patch in the data.R script accompanying this notebook. This function also does all the preprocessing that we mentioned in the introduction. We’ll see the effect of this preprocessing in a minute – for now, let’s just run the patch extraction code. Note that we simultaneously extract a corresponding label, stored in patch_y. It’s these preprocessed satellite imagery - glacier label pairs that we’ll be showing to our model in order to train it.patch <- generate_patch(vrt_path, centers[5, ])
patch_y <- label_mask(ys, patch$raster)
p <- list(
plot_rgb(brick(patch$x), c(5, 4, 2), r = 1, g = 2, b = 3),
plot_rgb(brick(patch$x), rep(13, 3)),
plot_rgb(brick(patch_y), r = NULL)
)
grid.arrange(grobs = p, ncol = 3)

sample_ix <- sample(nrow(patch$x), 100)
x_df <- patch$x[sample_ix, sample_ix, ] %>%
brick() %>%
as.data.frame() %>%
pivot_longer(cols = everything())
ggplot(x_df) +
geom_histogram(aes(x = value)) +
facet_wrap(~ name, scale = "free_x")

#write_patches(vrt_path, ys, centers, params$out_dir)
#unlink(params$raw, recursive = TRUE)
For reference, the ImageNet dataset, which is a standard benchmark for computer vision problems, most images are usually cropped to 256 \(\times\) 256 pixels.↩︎