library(greta.gp)
library(tidyverse)
theme_set(theme_classic())
set.seed(479)Modeling Stellar Variability with GPs
Stellar light curves exhibit what astronomers call quasi-periodic variability (Angus et al. 2017). Periodicity comes from stellar rotation, and departures can be due to unevenness on the star surface (e.g., spots). We model an example light curve using Gaussian Processes. The associated kernel characterizes quasiperiodicity, which is of central scientific interest.
First, let’s download and plot the data.
# Create a data frame
data <- read_csv("https://zenodo.org/records/18371236/files/kplr005809890-2012179063303_llc.csv?download=1")
ggplot(data, aes(x = time, y = flux)) +
geom_line(color = "black", linewidth = 0.3) +
labs(
x = "Time [days]",
y = "Relative flux",
title = "KIC 5809890"
)Before fitting a Gaussian Process, we use a periodogram to identify some dominant frequencies. Note that it doesn’t give any uncertainty estimate or allow for any goodness-of-fit analysis, like a GP model.
# Compute the median time spacing (sampling rate) in days
delta_t <- median(diff(data$time))
spec <- spectrum(data$flux, pad = 6, plot = FALSE)
# Convert frequencies to actual periods in days.
periodogram_data <- tibble(
frequency = spec$freq / delta_t,
period = 1 / (spec$freq / delta_t),
power = spec$spec
) |>
filter(period > 2, period < 100)
period_max <- periodogram_data |>
slice_max(power) |>
pull(period)Peaks appear near 13 and 30 days. These are candidate rotation periods, and we’ll use them when initializing our model later on.
ggplot(periodogram_data, aes(x = period, y = power)) +
geom_line() +
scale_x_log10() +
labs(
x = "Period [days] (log scale)",
y = "Spectral Power"
)Model Definition
We subsample to reduce our later runtimes.
data_model <- data |>
arrange(time) |>
slice(seq(1, n(), by = 10))
ggplot(data_model) +
geom_point(aes(time, flux))The quasi-periodic kernel combines periodic structure with exponential decay:
\[k(t_i, t_j) = \sigma^2 \exp\left[-\frac{(t_i - t_j)^2}{2\ell^2} - \frac{2\sin^2(\pi|t_i - t_j|/P)}{\ell_p^2}\right] + \sigma_k^2\delta_{ij}\]
This decomposes as \(k = k_{\text{per}} \times k_{\text{rbf}} + k_{\text{noise}}\). The RBF component allows correlation strength across periods – it’s what makes the kernel quasiperiodic rather than just periodic.
We’ll put priors on hyperparameters. This provides an efficient alternative to cross validation that works nicely when the model is fully probabilistic, like in the case of GPs.
library(greta.gp)
logs2 <- normal(-1, 2)logamp <- normal(log(var(data_model$flux)), 2)
logperiod <- normal(log(25), 2) # Wide prior for period
logdecay <- normal(log(25 * 3), 2) # How long spots last (days)The code below constructs the kernel above using greta.gp.
k_periodic <- periodic(exp(logperiod), 1, exp(logamp))
k_decay <- rbf(exp(logdecay), 1)
full_kernel <- k_periodic * k_decay + white(exp(logs2))Estimation
With our kernel, we can now specify our Gaussian process model.
gp_model <- gp(data_model$time, full_kernel)
distribution(data_model$flux) <- normal(gp_model, data_model$flux_err)
m <- model(logperiod, logdecay, logamp, logs2)The readings discuss that the marginal likelihood may have multiple local maxima. To find a good optimum, it helps to run initialize with the periodogram result.
start_vals <- initials(logperiod = log(25))
ml_soln <- opt(m, initial_values = start_vals)
optimized_period <- exp(ml_soln$par$logperiod)
print(paste("Optimized rotation period:", round(optimized_period, 2), "days"))[1] "Optimized rotation period: 30.56 days"
We use MCMC to explore the full posterior.
draws <- mcmc(m, initial_values = do.call(initials, ml_soln$par), n_samples = 500, warmup = 1000, chains = 1)Posterior Analysis
The function project realizes the function \(\mathbf{f}\) at a finite collection of points time_grid. Our posterior mean is defined as the average
\[\mathbb{E}[\mathbf{f}_* \mid y] \approx \frac{1}{N}\sum_{i=1}^N \mathbf{f}_*^{(i)}\]
where each \(\mathbf{f}_*^{(i)}\) is the GP realization at the grid points for the \(i\)-th posterior sample of hyperparameters.
time_grid <- seq(min(data_model$time), max(data_model$time), length.out = 1000)
f_posterior <- project(gp_model, time_grid) |>
calculate(values = draws)Let’s plot the result.
fit_df <- data.frame(
time = time_grid,
flux = colMeans(f_posterior[[1]])
)
ggplot() +
geom_line(data = fit_df, aes(time, flux),
color = "#3bd3ccff", linewidth = 1) +
geom_point(data = data_model, aes(time, flux),
size = 1.2, alpha = 0.6) +
labs(
title = "Posterior Mean of f",
x = "Days",
y = "Flux"
)To answer domain questions, we should inspect the posterior over parameters. Compared to the earlier periodogram, we can now quantify uncertainty over plausible periodicity values.
data.frame(period = as.numeric(exp(draws[, "logperiod"][[1]]))
) |>
ggplot(aes(period)) +
geom_histogram() +
scale_y_continuous(expand = c(0, 0)) +
labs(x = "Period", title = "Posterior Period")data.frame(decay = as.numeric(exp(draws[, "logdecay"][[1]]))
) |>
ggplot(aes(decay)) +
scale_y_continuous(expand = c(0, 0)) +
geom_histogram() +
labs(x = "Decay", title = "Posterior Decay")