Generating data for model training.
BQA
quality channel we saw earlier.abind
, which is used to manipulate subsetted arrays of imagery, and reticulate
, which is used to navigate back and forth between R and python. We need reticulate
because we will save our final dataset as a collection of .npy
numpy files – these are a convenient format for training our mapping model, which is written in python.library("RStoolbox")
library("abind")
library("dplyr")
library("gdalUtils")
library("ggplot2")
library("gridExtra")
library("purrr")
library("raster")
library("readr")
library("reticulate")
library("sf")
library("stringr")
library("tidyr")
# setting up python environment
use_condaenv("notebook")
np <- import("numpy")
source("data.R")
theme_set(theme_minimal())
set.seed(123)
Kokcha
basin, and for evaluation, we use Dudh Koshi
. In general, our notebook takes arbitrary lists of basins, specified by links to csv files through the basins
parameter in the header. In practice, a larger list of training basins would be used to train the model, but working with that is much more computationally intensive.st_sample
function. More patches will translate into more patches for training the model, but it will also increase the chance that training patches overlap. You will see a warning message about st_intersection
– it’s safe to ignore that for our purpose (we are ignoring the fact that the surface of the earth is slightly curved).y
) together with an ordinary data frame (the centers
for sampling).p <- ggplot(y, aes(x = Longitude, y = Latitude)) +
geom_sf(data = y, aes(fill = Glaciers)) +
geom_point(data = as.data.frame(centers), col = "red", size = 2) +
scale_fill_manual(values = c("#93b9c3", "#4e326a"))
p
coord_sf
modifier.generate_patch
in the data.R
script accompanying this notebook. This function also does all the preprocessing that we mentioned in the introduction. We’ll see the effect of this preprocessing in a minute – for now, let’s just run the patch extraction code. Note that we simultaneously extract a corresponding label, stored in patch_y
. It’s these preprocessed satellite imagery - glacier label pairs that we’ll be showing to our model in order to train it.sample_ix <- sample(nrow(patch$x), 100)
x_df <- patch$x[sample_ix, sample_ix, ] %>%
brick() %>%
as.data.frame() %>%
pivot_longer(cols = everything())
ggplot(x_df) +
geom_histogram(aes(x = value)) +
facet_wrap(~ name, scale = "free_x")
#write_patches(vrt_path, ys, centers, params$out_dir)
#unlink(params$raw, recursive = TRUE)
For reference, the ImageNet dataset, which is a standard benchmark for computer vision problems, most images are usually cropped to 256 \(\times\) 256 pixels.↩︎