Improving species distribution models for invasive non‐native species with biologically informed pseudo‐absence selection

We present a novel strategy for species distribution models (SDMs) aimed at predicting the potential distributions of range‐expanding invasive non‐native species (INNS). The strategy combines two established perspectives on defining the background region for sampling “pseudo‐absences” that have hitherto only been applied separately. These are the accessible area, which accounts for dispersal constraints, and the area outside the environmental range of the species and therefore assumed to be unsuitable for the species. We tested an approach to combine these by fitting SDMs using background samples (pseudo‐absences) from both types of background.


| INTRODUC TI ON
Human transport of species beyond their native ranges, leading to biological invasions, is an important driver of ecological change, impacting biodiversity and ecosystem function (Vilà et al., 2011).
Decision-making about the control and management of invasive non-native species (INNS) is often underpinned by scientific risk assessments, and species distribution models (SDM) are increasingly seen as a valuable tool for this (Jeschke & Strayer, 2008;Jiménez-Valverde et al., 2011;Václavík & Meentemeyer, 2009). The purpose of SDMs applied in this context is to generate risk maps that predict the potential distribution of an INNS as a function of climate and other environmental gradients . Specifically, these represent the relative likelihood of establishment should the species be introduced or disperse to each location in the modelled landscape (Elith, 2013). Risk maps can be used for prioritization of surveillance and management (Gormley et al., 2011;Peterson & Robins, 2003), to estimate the potential spread of emerging INNS in current and future climates (Branquart et al., 2016;Jiménez-Valverde et al., 2011) and to understand the biological and anthropogenic mechanisms governing invasions (Broennimann et al., 2007;Chapman, Haynes, Beal, Essl, & Bullock, 2014;Chapman, Scalone, Štefanić, & Bullock, 2017;Storkey, Stratonovitch, Chapman, Vidotto, & Semenov, 2014). Clearly, there is a need for robust and accessible SDM tools and methods to ensure the most accurate possible estimation of the potential distributions of INNS.
Species prioritized for risk assessment in one area have typically already established invasive non-native distributions in other parts of the world (Branquart et al., 2016;Roy et al., 2014;Tanner et al., 2017) necessitating global-scale models and the pooling of distribution data from native and already-invaded ranges (Broennimann & Guisan, 2008;Mainali et al., 2015). Unfortunately species' distributions are rarely documented comprehensively at the spatial resolutions of SDMs (Boakes et al., 2010). Therefore, global-scale models are typically developed using statistical algorithms that contrast the environmental conditions where the species is known to occur with those at "pseudo-absence" locations sampled from a background domain specified by the modeller.
Such SDMs are often referred to as presence-only models (Pearce & Boyce, 2006) but we use the term presence-background to differentiate them from "one-case" or true presence-only models that use only the species presences and not the background (Guillera- Arroita et al., 2015). We also differentiate the "pseudo-absence"based presence-background models that are the focus of this study from point process models for species distributions (Warton & Shepherd, 2010). Point process models generalise presencebackground models on a more formal statistical basis. However, to our knowledge they are not suitable for grid cell-resolution distribution data, have not been applied for global-scale modelling of INNS and are far less commonly used than well-known presencebackground models such as Maxent (Phillips, Dudík, Dudik, & Phillips, 2008) or the regression and machine learning approaches implemented through software platforms such as bioMod (Thuiller, Georges, Engler, & Breiner, 2016;Thuiller, Lafourcade, Engler, & Araújo, 2009).
One important issue when fitting presence-background models to INNS distribution data is that their global distributions are by definition in a non-equilibrium state and are structured by both the species' environmental tolerances and natural and anthropogenic dispersal constraints Elith, Kearney, & Phillips, 2010;Gallien, Münkemüller, Albert, Boulangeat, & Thuiller, 2010;Václavík & Meentemeyer, 2009). As a consequence, there are suitable but unoccupied regions in which climatic and environmental conditions would permit establishment by the species, but where invasion has not been realized through dispersal. If such regions are included in the background domain, then the model will conflate lack of presence of the species due to dispersal constraints with a lack of presence due to environmental unsuitability, potentially biasing the species-environment relationships and the prediction of potential distributions. Current approaches to reduce this bias emphasise restricting the background domain to an "accessible area" within dispersal range of the occurrences (Barve et al., 2011;Elith, 2013;Elith et al., 2010;Mainali et al., 2015). Although likely to lessen dispersal biases in presence-background models, we suggest that this may be overly restrictive for modelling aimed at risk mapping. If background samples are only drawn in close proximity to the occurrences then the range of environmental conditions used to train the model may be insufficient to fully characterise species-environment relationships, impeding the transfer of predictions into other regions (Fitzpatrick & Hargrove, 2009;Thuiller, Brotons, Araújo, & Lavorel, 2004).
Here, we propose a biologically informed approach to improve presence-background models for highly dispersal-limited species, such as those undergoing invasive range expansion. The goal is to exclude suitable but unoccupied regions while also maximising the range of environmental conditions used to train the model as well as prior biological knowledge about niche responses to environmental factors. The approach is based on combining two familiar types of background domain-an accessible background in proximity to  (Barve et al., 2011;Mainali et al., 2015) and an unsuitable background outside the environmental envelope of the species (Chefaoui & Lobo, 2007;Le Maitre, Thuiller, & Schonegevel, 2008;Thuiller et al., 2004). Those previous studies have tested both types of background in isolation, but the novel contributions of this study are to combine both types of background, and to emphasize the definition of the unsuitable background using biological knowledge of key limiting factors for the species, e.g. places that do not reach minimum growing temperatures or exceed maximum drought tolerance. By modelling the global distributions of five invasive non-native plants we demonstrate that this constrains the presence-background models to fit more biologically plausible response functions and increases the accuracy of distribution projections.

| Overview
Our aim was to compare global-scale presence-background SDMs for INNS developed using background domains defined as only the accessible region, only the unsuitable region, or through our proposed new approach of combining accessible and unsuitable background regions (Figures 1 and 2). Models were developed to predict the potential distributions of five plant species that are native to temperate and tropical east Asia, highly invasive in other parts of the world and have been prioritized for risk assessment as potentially emerging invasive non-native plant species in Europe (Branquart et al., 2016;Tanner et al., 2017). The species represent a range of life histories including an annual climbing vine (Humulus scandens), a perennial climbing fern (Lygodium japonicum), a perennial semi-woody forb (Lespedeza cuneata), a deciduous tree (Triadica sebifera) and an evergreen tree (Cinnamomum camphora). working groups conducting Pest Risk Analyses for the region. With these experts, we scrutinized occurrence records and removed any that appeared dubious, casual or cultivated (e.g. botanic gardens) or where the georeferencing was too imprecise (e.g. country or island centroids). The remaining records were gridded at a 0.25 × 0.25 degree resolution for global modelling and randomly partitioned into training and testing datasets comprising 80% and 20% of the grid cells, respectively. As a proxy for plant recording effort, the total number of vascular plant records (phylum Tracheophyta) per grid cell was also obtained from GBIF (see Appendix S1 in Supporting Information).

| Data for modelling
Three predictor variables, derived from WorldCliM 1.4 (Hijmans et al., 2005), were selected to represent basic constraints on plant distributions. These were mean temperature of the warmest quarter (Bio10, °C) reflecting the growing season thermal regime, mean minimum temperature of the coldest month (Bio6, °C) reflecting exposure to winter cold and the climatic moisture index (CMI, ratio of annual precipitation, Bio12, to potential evapotranspiration, then ln+1 transformed) reflecting drought stress. Potential evapotranspiration was estimated following Zomer, Trabucco, Bossio, and Verchot (2008).

| Definition of the background domains
Background samples (pseudo-absences) were drawn from two distinct regions-an accessible region and a region considered to be environmentally unsuitable for the species based on knowledge of its tolerances or requirements (Figures 1 and 2). Though both types of background represent established concepts within distribution modelling, to our knowledge, this is the first study to test whether modelling is improved by combining both types of background domain.
The accessible background attempts to cover only the region where the species has had opportunity to disperse and sample the environment (Barve et al., 2011;Mainali et al., 2015;Thuiller et al., 2004;VanDerWal, Shoo, Graham, & Williams, 2009). It has generally been defined as a zone around the occurrence data, which could be selected statistically or informed by dispersal abilities of the species (Elith, 2013;Senay, Worner, & Ikeda, 2013). For INNS, the size of the accessible region will generally be more limited in the invaded range than the native one, assuming stronger dispersal constraints associated with shorter residence time (Mainali et al., 2015). In our application, we defined the native accessible areas using a 400 km geodesic buffer around the minimum convex polygon bounding all native occurrences ( Figure 1a).
In the non-native region, we used a conservative 4-cell neighbourhood around each occurrence grid cell, equivalent to a ~30 km buffer ( Figure 1b). Though somewhat arbitrary, these buffer sizes are consistent with ones performing well in other presence-background SDM studies (Mainali et al., 2015;VanDerWal et al., 2009) and a sensitivity analysis showed model outputs were not strongly influenced by the choice of native buffer size (see Appendix S5).
The unsuitable background concept originates from existing ideas about sampling pseudo-absences only outside of the environmental envelope in which species' presences are found (Chefaoui & Lobo, 2007;Le Maitre et al., 2008;Senay et al., 2013;Thuiller et al., 2004). The rationale is to produce training datasets that maximise the distinctiveness of suitable environmental conditions from the background and therefore boost the model discrimination. However, it may also reduce model accuracy within the environmental and geographical range of the species (Acevedo, Jiménez-Valverde, Lobo, & Real, 2012). These previous studies simply screened out the ranges of all environmental variables at presence locations, or used preliminary modelling to determine unsuitable regions. However, in this study we instead used prior biological knowledge and expert opinion about the species' limiting factors to define the unsuitable conditions (Figures 1 and 2) in the expectation that this biological information would be captured in the fitted species-environment relationships. Appropriate rules to define unsuitability were determined in consultation with species experts participating in their EPPO expert working groups.
Their expert judgement informed us on the type of limit deemed to be most important for the species in different parts of its range (e.g. summer cold, drought), followed by identification of key thresholds from the literature and comparison with extreme values at the occurrence locations of the species (see Appendix S2).

| Sampling from the background domains
We obtained background samples from both the accessible region and from the unsuitable region outside of the accessible region for F I G U R E 2 Flow chart for implementing the biologically informed pseudo-absence selection for presencebackground modelling of invasive nonnative species

Combined background sample ('pseudo-absences')
Presence-background model fitting  (Figures 1 and 2). The effect was therefore to exclude potentially suitable but inaccessible regions from the combined background sample. For each of the five species in this study, ten replicate background samples were generated in order to reduce sampling variation (Barbet-Massin, Jiguet, Albert, & Thuiller, 2012).
Presence-background models were developed for each background sample and then their predictions were averaged.
The accessible region was sampled using target group sampling to reduce bias in the observed distribution due to spatial sampling effort variation (Phillips, 2009;Ranc et al., 2017). This involves weighting the background sampling by the recording density of a broader taxonomic group, which is assumed to represent recording bias for the focal species. In our modelling we used the GBIF record

| Ensemble presence-background modelling
For each species, presence-background models were developed using background samples from only the accessible area, only the unsuitable area or using the combined background samples from both the accessible and unsuitable areas. In all cases, model performance was evaluated by cross validation, using model predictions for 20% of the occurrences, accessible area and unsuitable area that were reserved from model fitting (the testing dataset).
Ensemble models were fitted using BIOMOD ('biomod2' R package v3.3-7) (Thuiller et al., 2009(Thuiller et al., , 2016 using seven statistical algorithms: generalised linear models (GLM) with linear and quadratic terms for each predictor, generalised additive models (GAM) with a maximum of four degrees of freedom per variable, multivariate adaptive regression splines (MARS), generalised boosting models (GBM), random forests (RF), artificial neural networks (ANN) and Maxent (Phillips et al., 2008). These were combined into an ensemble model by scaling their predictions with a binomial GLM and then averaging them weighted by predictive AUC scores within the training data (80:20% random split). AUC is commonly used for ensemble model weighting and is the BIOMOD default option (Thuiller et al., 2009(Thuiller et al., , 2016. Although AUC does not provide an objective measure of model performance for presence-only models (Lobo, 2008) it is informative about the relative discrimination abilities of different algorithms evaluated on the same data. It also provides a conservative model weighting scheme, since a perfect model (AUC = 1) will have only twice the weight of a random model (AUC = 0.5). Therefore, we ensured poorly performing algorithms did not disproportionately affect the weighted average by rejecting them from the ensemble.
Rejection was based on modified z-scores for their predictive AUC (Crosby, 1993) with algorithms with z < −1 being rejected.
The importance of each variable to model fitting was estimated through the BIOMOD default procedure (Thuiller et al., 2009(Thuiller et al., , 2016.  ground were equally or marginally more accurate than models using only the accessible background in four out of five species, and always performed better than models using only the unsuitable background (Table 1) able range of the species, and using the unsuitable background to identify conditions in which the species very rarely occurs. In some cases the models using combined accessible and unsuitable backgrounds yielded response curves that differed markedly from those of the accessible background models. This was most clearly seen in the responses of C. camphora and Lygodium japonicum to TA B L E 1 Cross-validated discrimination performance of ensemble model projections for the potential global distribution of five plant species developed using different background region specifications (A = accessible background, U = unsuitable background, A&U = combined accessible and unsuitable background). Discrimination performance is the cross-validated AUC (area under the receiveroperator curve) and its standard deviation in parentheses for model predictions on 20% of the occurrences, accessible background and unsuitable background that were reserved from model fitting. For presence-only data AUC is the probability that a species presence has a higher projected suitability than a background sample

| D ISCUSS I ON
Strategies for selecting background samples or pseudo-absences for presence-background SDMs have received a great deal of attention (e.g. Barbet-Massin et al., 2012;Barve et al., 2011;Chefaoui & Lobo, 2007;Thuiller et al., 2004;VanDerWal et al., 2009). The novel contribution of this study is to combine two different perspectives on defining the background region that have hitherto been considered separately. These perspectives are the accessible area (Barve et al., 2011) and the area outside the environmental range of the species, and therefore assumed to be unsuitable for the species (Thuiller et al., 2004). Previous work on modelling INNS has generally either emphasized the usefulness of the former for accommodating dispersal constraints (Mainali et al., 2015) or evaluated the latter as a way of boosting the discrimination between suitable and unsuitable habitat (Le Maitre et al., 2008). To our knowledge, the only previous attempt to jointly consider both perspectives did so in a more limited way than this study, by excluding parts of the accessible region that were outside the environmental range of the species (Senay et al., 2013). Here, we tested a new approach in which separate background samples were obtained from the accessible region, regardless of environmental values, and from an unsuitable region defined using prior biological knowledge. By modelling the global distributions of five invasive non-native plant species we conclude that the new strategy performed better for projection of regional and global potential distributions than when models were fitted with just the accessible region or just the unsuitable region.
This was evidenced by a consistent improvement in crossvalidated discrimination power when the modelling sampled from a background combining accessible and biologically informed unsuitable regions. This was most clearly seen in the global projections, where the combined background models always performed better than models using just the accessible or just the unsuitable background. For projections within the species' accessible range the combined background models gave consistently more accurate projections than models based only on the unsuitable background, and generally performed as well as or marginally better than models trained only on the accessible background. Our expectation was that the combined background modelling strategy would not improve discrimination within the range of a species over models trained on the accessible region. Indeed, previous studies have found that large geographical background domains increase the power of SDMs to model species' broad geographic ranges but decrease their representation of suitability gradients within the range (Thuiller et al., 2004;VanDerWal et al., 2009). Unlike previous studies, our approach may have resulted in marginally improved performance for both purposes because we explicitly tried to exclude "suitable-but-not-reached" locations from the larger background region by restricting it to locations considered environmentally unsuitable. As such, we suggest that biologically informed specification of a large modelling domain may reduce the trade-off between prediction of suitability gradients at large and small spatial scales. Further testing is required to determine whether a similar strategy would also benefit models for native as well as non-native SDMs, but in principle our new strategy should confer similar advantages.
The influence of the accessible and unsuitable backgrounds on species-environment relationships was clearly seen in the response curves and projections of the different models. The combination of unsuitable and accessible backgrounds had four clear effects, when compared to the models using only the accessible background. First, it "anchored" the curves by constraining the models to fit near-zero suitability where the climate variables exceeded the thresholds of the species, providing a more pronounced delineation of suitability gradients. Second, the response curves spanned a much wider range of environmental conditions than were found in the accessible background, which has previously been shown to be important for accurate spatial and temporal transfer of SDMs (Guevara, Gerstner, Kass, & Anderson, 2017). Sampling unsuitable conditions only from within the accessible part of the species range would therefore require a greater amount of model extrapolation than our strategy does. Third, the response curves were less complex or multi-modal than those from models using only the accessible background (see responses for high CMI), which is more consistent with niche theory (Austin, 2002). Fourth, the response curves generally reflected prior assumptions about environmental limitation of the species and as such were more consistent with ecological understanding of the species. For instance, combined background models for C. camphora, Lygodium japonicum and Triadica sebifera estimated a strong limitation by low moisture availability (CMI), precluding potential establishment in arid regions such as south west USA. These responses were not estimated by the model based only on the accessible background, but are consistent with empirical demonstrations of water stress reducing growth and survival of these species. For example, shoot growth of C. camphora is 30% lower at 40% field water capacity than at 80% (Zhao, Wang, Shen, Zhang, & Qiu, 2006), water restriction suppresses T. sebifera seedling growth by 30%-80% (Barrilleaux & Grace, 2000) and its seedlings wilt and die in arid western USA unless planted in moist micro-habitats such as river banks (Bower, Aslan, & Rejmánek, 2009). Similarly, combining the accessible and unsuitable backgrounds led to models that strongly limited suitability of Lespedeza cuneata by very cold winters, consistent with known frost sensitivity of the species especially in relation to late spring frosts (Gucker, 2010). The only case where the response curves did not always follow the rules defining unsuitability was for limitation by extremely high summer temperature. This may be because of a correlation between high summer and winter temperatures, the latter being limiting when high summer temperature was not. This suggests our approach may have sensitivity to collinearity in model predictors that requires further investigation (Dormann et al., 2012).
Nevertheless, the broader conclusion is that sampling from an unsuitable background, in addition to an accessible background, forces the statistical models to learn species-environment relationships that reflect the prior knowledge of the species' tolerances or niche requirements used to define the unsuitable domain. As such, our approach offers a simple way of incorporating prior biological knowledge into correlative SDMs, and as such can address the common criticism that they lack strong biological underpinning (Austin, 2002;Chapman et al., 2014;Dormann et al., 2011). While there are more sophisticated approaches available for doing this using Bayesian models in which prior estimates of niche parameters can be specified (Talluto et al., 2015), a major advantage of the approach developed here is that it is implemented by manipulating the input data to standard distribution model software such as bioMod (Thuiller et al., 2009) or Maxent (Phillips et al., 2008) and all regression and machine learning methods. As such it is simple to implement with techniques that most modellers are already familiar with and can quickly be applied in a standard way across species. This is especially useful when risk assessments are being performed across large numbers of INNS and require consistent judgements about establishment risk (Branquart et al., 2016;Tanner et al., 2017).
Sensitivity analyses suggested that our findings were not overly sensitive to the size of the accessible region, number of background samples or precise rules for determining unsuitable conditions (see Appendix S5). We recommend that similar sensitivity analyses are performed when applying our approach to other species, since previous studies have found these factors can strongly influence distribution model performance (Barbet-Massin et al., 2012;Barve et al., 2011). However, success of the modelling approach likely relies on careful selection of the appropriate environmental limits to define the unsuitable region in the modelling (Le Maitre et al., 2008). A strength of this study is that it was done in consultation with experts performing risk assessments for invasion of Europe by the species. These experts were able to provide guidance on the key limiting factors relevant for different parts of the invaded and native ranges of the species. Some of the species have been well-studied in their other invaded ranges and we were able to draw upon previous experimental studies that had determined tolerance thresholds for the species (see Appendix S2). Where this information was lacking, we used upper or lower bounds on the environmental values at the species presences to define thresholds for modelling. Even where empirical estimates of threshold values were available, we still recommend checking for consistency with environmental values at the distribution data, since speciesenvironment relationships are highly scale-dependent (Siefert et al., 2012) and species can occupy broadly unsuitable regions if suitable micro-habitats are available. Given the reliance on prior studies or expert judgement about species' limiting factors or tolerances, our methods are probably most suitable for relatively well known species and less applicable to species where knowledge of its environmental limits are lacking. However, regional risk assessments for emerging INNS generally prioritise species that behave invasively in other parts of the world (Branquart et al., 2016;Roy et al., 2014;Tanner et al., 2017) suggesting that our modelling approach might be widely applicable for species of concern.
Risk assessment is a critical tool in the management of emerging INNS and requires robust prediction of where is vulnerable to ongoing species establishment and spread Keller, Lodge, & Finnoff, 2007). This study shows that defining the model background to accommodate considerations of accessibility as well as prior biological knowledge of environmental unsuitability has the potential to improve global-scale presencebackground models for emerging INNS. The methods developed and tested here are fully implemented by manipulating the model input data, and as such they can be implemented simply using standard presence-background modelling software. Furthermore, they result in presence-background models that are more strongly underpinned by biological knowledge rather than being solely driven by distribution data, which are often incomplete and biased. As such, wider adoption of these approaches should improve global-scale modelling of INNS distributions, contributing to more accurate risk assessment and better management of their impacts.

This research was funded by European Union Life Programme
Preparatory project LIFE15 PRE/FR/000001. We thank the Expert Working Groups who performed EPPO Pest Risk Analyses for the five study species and provided us with data and species information to build our models.

DATA ACC E S S I B I L I T Y
A data file containing the 0.25 × 0.25 gridded data on climate, recording effort, species occurrence, accessibility and unsuitability is included in the Supporting Information.