Species distribution modelling is a technique used by ecologists whereby survey data for a given species is used to predict presence and absence across a wider spatial area than the survyed data. An SDM achieves this by constructing a model that estimates the likelihood of presence at unsurveyed locations, based on similarity of conditions at locations of known occurrence (and perhaps of non-occurrence) for the species in question. A common application of this method is to predict species ranges with climate data as covariates.
We can use the Global Biodiversity Information Facility (GBIF) to gather presence records for the migratory hoverfly Volucella zonaria, also known as the hornet mimic hoverfly. In Great Britain it was only known from two specimens prior to 1940, so was regarded as rare. Since then, it has become increasingly widespread in many parts of the South and South East England, often in association with parks and gardens, where adults are usually seen visiting flowers. Elsewhere across England, only a few scattered records exist. Our research question within this vignette is to answer the question as to how V. zonaria is distributed across the UK, given the available observation data from the last twenty years.
First we load the libraries necessary for our study.
# Load libraries
library(tidyverse)
library(lubridate)
library(caret)
library(raster)
library(reshape2)
library(viridis)
# Clear environment and set working directory
rm(list = ls())
The observation records were retrieved from GBIF using the r API, rgbif
1, and the data cleanded and structured as follows.
# load the data from a query to GBIF, for V. Zonaria
gbif_response <- read_rds("~/Documents/GitHub/ComputationalEcology/data_analysis_files/clim_data/gbif_volucella_zonaria.rds")
# Create presence records across decades
zonaria_clean <- gbif_response$data %>%
# get decade of record from eventDate
mutate(decade = eventDate %>%
ymd_hms() %>%
round_date("10y") %>%
year() %>%
as.numeric()) %>%
rename(y = decimalLatitude,
x = decimalLongitude) %>%
# clean data using metadata filters
filter(
# only records with no issues
issues == "" &
# let's take data from 2000 to 2020
decade %in% c(2010,2020)) %>%
# retain only relevant variables
dplyr::select(x, y, decade,scientificName) %>% arrange(decade)
Now we need to grab some environmental data that we can use as covariates to help model the distribution of zonaria. For this vignette we will use the worldclim database as this is directly available via the raster
package. We can download 19 bioclimatic variables at 10’ resolution. They are coded as follows:
Let’s download bioclim data for the current year:
# Load UK background mask
bg <- raster("data_analysis_files/UK_bg.grd")
# Reproject to lat/lon
bg <- projectRaster(bg, crs= "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
# UK extent in lon/lat coordinates
ext_uk <- c(-12, 3, 48, 62)
# Get climate data for current date
filepath = "~/Documents/GitHub/ComputationalEcology/data_analysis_files/clim_data/"
bio_curr <- raster::getData('worldclim', var='bio', download=F, lon=-5, lat=55, res=5, path=filepath)
# Crop and reproject current climate
bio_curr <- crop(bio_curr, ext_uk)
bio_curr <- projectRaster(bio_curr, bg)
bio_curr <- resample(bio_curr, bg)
bio_curr <- mask(bio_curr, bg)
We can plot the current bioclim data - all 19 covariates:
# Function to generate individual plots
plot_bioclim <- function(df, name) {
ggplot(data = df, aes(x=x, y=y, fill = value)) +
geom_tile() +
coord_equal() +
scale_fill_viridis(name = name) +
theme(
plot.margin = rep(unit(0,"null"),4),
panel.spacing = unit(0,"null"),
legend.key.width = unit(0.01,"null")
) +
theme_bw()
}
# extract all bioclim values by xy
nested_bioclim_plots <- data.frame(rasterToPoints(bio_curr)) %>%
# Create single gouping variable by which to facet chart
melt(id = c("x","y")) %>%
# Nest data by bioclim variable
nest(-variable) %>%
# Create individual ggplots
mutate(plots = map2(.x = data,
.y = variable,
.f = plot_bioclim))
# Render plots
library(gridExtra)
margin = theme(plot.margin = rep(unit(0,"null"),4))
gridExtra::grid.arrange(grobs = nested_bioclim_plots$plots,
top="Current UK Bioclim variables ",
ncol = 2,
margin
)
Using adjacent species observation locations in building a model can cause problems with spatial autocorrelation. We can therefore thin the observations using the package spThin
, as shown below. We set a minimum spatial distance between observations of 10km. The plot below shows the retained spatially thinned V. zonaria observations.
# Load library
library(spThin)
# Generate thinned data
thinned_zonaria <- thin(
# Observtion data to thin
zonaria_clean,
# Longitude and Latitude data
lat.col = "y", long.col = "x",
# Column that reperesents species
spec.col = "scientificName",
# Minimum distance (in km) that records should be separated by
thin.par = 10,
# The process is random, so repeat three times
reps=10,
# We dont wish to see or write the output
locs.thinned.list.return = T,
write.files = F,
write.log.file = F)
## **********************************************
## Beginning Spatial Thinning.
## Script Started at: Sat Apr 18 15:06:01 2020
## lat.long.thin.count
## 90 91 92
## 2 6 2
## [1] "Maximum number of records after thinning: 92"
## [1] "Number of data.frames with max records: 2"
## [1] "No files written for this run."
# Select sample with largest number of thinned 0observations
sample_index <- which.max(sapply(thinned_zonaria,nrow))
thinned_zonaria <- data.frame(x = thinned_zonaria[[sample_index]]$Longitude,
y = thinned_zonaria[[sample_index]]$Latitude) %>%
mutate(obs = "thinned")
# Make dataframe of background raster
bg_df <- as(bg, "SpatialPixelsDataFrame") %>%
as.data.frame()
# Name the dataframe columns
colnames(bg_df) <- c("value", "x", "y")
# Generate a plot of observations a cross the UK
rbind(thinned_zonaria, zonaria_clean[1:2] %>% mutate(obs= "full")) %>%
ggplot() +
geom_raster(data=bg_df, aes(x=x, y=y)) +
geom_point(aes(x=x, y=y, colour = obs),
size = 0.5) +
theme_bw() +
facet_wrap(~obs) +
ggtitle("Species observations before and after spatial thinning")
For each bioclim variable, we want to extract its value for the location of each V. zonaria observation. We also need to create pesudo-absence data - this represents locations where no observations were made. The covariates then need to be joined to the location of each observation to give an overall dataset from which we can build our species distribution model for V. zonaria.
df_presence <- zonaria_clean[1:2] %>%
# Extract bioclim + lcm variables using long/lat
raster::extract(bio_curr, .) %>%
# Add extracted variables to long/lat
cbind(zonaria_clean) %>%
# Mark observation data as presence records
mutate(presence = 1) %>%
# Drop unecessary columns
dplyr::select(-decade, -scientificName)
# Function to pick raster cells at random from bioclim layers
raster_random_sample <- function(r_covars, n_samples, xy_coords) {
# Set seed for repeatable results
set.seed(12345)
raster::sampleRandom(
x = r_covars,
size = n_samples,
na.rm = TRUE,
xy = xy_coords)
}
# Sample background points (psedo absent?) from covariate raster
df_absence <- raster_random_sample(bio_curr, 100, T) %>%
as.data.frame() %>%
# Mark as background data
mutate(presence = 0)
# Create single dataframe for presence and absence data
df_zonaria <- rbind(df_presence, df_absence)
In order to avoid collinearity we should remove covariates with absolute correlation of 0.8 or higher. We can achieve this by calculating a correlation matrix, and then removing highly correlated covariates.
# Generate correlation matrix with covariates as correlates
cor_mat <- cor(df_zonaria %>% dplyr::select(-x,-y,-presence), method="spearman")
# Subset covariates that are highly correlated
highlyCorrelated <- findCorrelation(cor_mat, cutoff=0.8)
# Generate column names to remove
cols_to_remove <- names(df_zonaria[highlyCorrelated])
# Remove highly correlated variables from df
df_zonaria <- df_zonaria %>% dplyr::select(-cols_to_remove)
# What covarites are weekly correlated
head(df_zonaria)
## bio2 bio3 bio4 bio5 bio8 bio9 bio11 bio15 bio18 x y presence
## 1 64.48101 32.00000 4799.870 206.8308 75.69390 61.83748 39.77803 13.91587 162.2590 1.32073 52.59618 1
## 2 64.48101 32.00000 4799.870 206.8308 75.69390 61.83748 39.77803 13.91587 162.2590 1.32073 52.59618 1
## 3 61.46156 31.77255 4687.255 205.8329 80.00000 62.14391 43.53844 15.31099 168.0000 1.63029 52.76698 1
## 4 61.46156 31.77255 4687.255 205.8329 80.00000 62.14391 43.53844 15.31099 168.0000 1.63029 52.76698 1
## 5 72.06501 34.06501 4812.766 207.2257 78.24089 58.93499 34.00000 13.29069 147.0461 0.75289 52.01723 1
## 6 72.06501 34.06501 4812.766 207.2257 78.24089 58.93499 34.00000 13.29069 147.0461 0.75289 52.01723 1
Now we have joined our environmental and observation data together into one dataset, we can begin to build a species distribution model. First we need to create a categorical response variable as we want to model presence v absence in space:
# last pre-processing step
df_modelling <- df_zonaria %>%
# caret requires a factorial response variable for classification
mutate(presence = case_when(
presence == 1 ~ "presence",
presence == 0 ~ "absence") %>%
as.factor()) %>%
# drop all observations with NA variables
na.omit()
Before we undertake any machine learning, we need to split the data into training and test sets. We give 60% of the data over to training with the remainder used to test the model fit:
# for reproducibility
set.seed(12345)
# Create index for splitting
inTrain <- createDataPartition(y = df_modelling$presence, p = 0.6, list = FALSE)
# Create training dataset
training <- df_modelling[inTrain,]
# Create testing dataset
testing <- df_modelling[-inTrain,]
Now we have removed the most highly correalted variables, we have nine bio_clim
variables as covarites to fit a model. We’ll fit the model using a random forest. A Random Forest is a machine larning classifier that builds a number of decision trees. The idea is to build a forest of uncorrelated decisions, conditioned on our covariates. The resulting forest has a decision structure that can accurately classify the response data; in this case presence or absence at a location, given a set of bio_clim
covariates.
library(randomForest)
# Use five fold cross validation, repeated twice
tunecontrol <- trainControl(
# Use 5-fold cross validation
method = 'cv',
number = 5,
# Repeated cv 3 times
repeats = 3,
# Use a grid for hyper-parameters
search = "grid")
# Create tuning parameters as grid
tunegrid <- expand.grid(.mtry=c(3:9))
# for reproducibility
set.seed(12345)
# actual model build
model_fit <- train(
# Our formular for classifications
presence ~ .,
# Data to train the RF classifier
data = training,
# Train using a Random Forest
method = "rf",
# Return an accuracy metrix
metric = "Accuracy",
# Run the model using
tuneGrid = tunegrid,
trControl = tunecontrol,
# Normalise the covariate data
preProcess = c("center", "scale")
)
print(model_fit)
## Random Forest
##
## 348 samples
## 11 predictor
## 2 classes: 'absence', 'presence'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 279, 279, 278, 278, 278
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 3 0.9424845 0.7767934
## 4 0.9453830 0.7906962
## 5 0.9424845 0.7790826
## 6 0.9482402 0.7982826
## 7 0.9482402 0.8025696
## 8 0.9424845 0.7767934
## 9 0.9424845 0.7790826
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
plot(model_fit)
We can assess how good model predictions by generating a confusion matrix. This matrix shows a cross-tabulation of the observed and predicted classes, as well as a number of metrics that provide an inisght into how good the model predictions were compared to a reference data set. The reference dataset in our case was the reserved testing data.
# Make prediction
pred <- predict(model_fit, testing)
# Generate confusion matrix
confusionMatrix(
# Predicted classification results
data = pred,
# Reference data - use partioned test data
reference = testing$presence,
# Definition of positive result
positive = "presence")
## Confusion Matrix and Statistics
##
## Reference
## Prediction absence presence
## absence 26 7
## presence 14 184
##
## Accuracy : 0.9091
## 95% CI : (0.8644, 0.9428)
## No Information Rate : 0.8268
## P-Value [Acc > NIR] : 0.0002823
##
## Kappa : 0.6589
##
## Mcnemar's Test P-Value : 0.1904303
##
## Sensitivity : 0.9634
## Specificity : 0.6500
## Pos Pred Value : 0.9293
## Neg Pred Value : 0.7879
## Prevalence : 0.8268
## Detection Rate : 0.7965
## Detection Prevalence : 0.8571
## Balanced Accuracy : 0.8067
##
## 'Positive' Class : presence
##
Some definitions:
We can see the accuracy is around 90%. So the model is looking good.
Now we can now make a prediction for the current species distribution range for V. zonaria, using the selected bioclim covariate dat.
# Generate distribution data from raster to run model on
dist_dat <- data.frame(rasterToPoints(bio_curr)) %>%
# Select variables required for prediction
dplyr::select(head(names(training), -1)) %>%
na.omit()
# Set seed for repeatable results
set.seed(12345)
# Create a probability of presence/absence raster
r_pred <- rasterFromXYZ(
cbind(
# Coordinates to make prediction from
dist_dat %>% dplyr::select(x,y),
# Predict with RF species distribution model
predict(model_fit,
# Use covariates to run model against
dist_dat,
# Type of response variable
type='prob')))
# Create spatial dataframe for plotting
df_prob_presence <- as(r_pred[[2]], "SpatialPixelsDataFrame") %>% as.data.frame()
# Plot scenarios
df_prob_presence %>%
ggplot() +
geom_raster(aes(x=x, y=y, fill=presence)) +
coord_equal() +
scale_fill_viridis() +
theme_bw()
We see the range fo V.zonaira is largely confined to a line south of Liverpool and Hull. This suggests a very marked difference in the covariates used to fit the model, around this line. If we consider the top five bioclim variables used in the mode, we can therefore conclude that V. zonaria distribution must therefore be sensitive too:
As all of these covariates are related to temperature, it possibly explains the limt of the UK range of V. zonaria. V. zonaria migrates annually from the mediterranean; that is from south to north
Chamberlain S, Barve V, Mcglinn D, Oldoni D, Desmet P, Geffert L, Ram K (2020). rgbif: Interface to the Global Biodiversity Information Facility API. R package version 2.2.0, https://CRAN.R-project.org/package=rgbif.↩