Eﬀective use of evolutionary computation to parameterise an epidemiological model

. Predictive epidemiological models are able to be used most eﬀectively when they have ﬁrst been shown to ﬁt historical data. Finding the right parameters settings for a model is complex: the system is likely to be noisy, the data points may be sparse, and there may be many inter-related parameters. We apply computational intelligence and data mining techniques in novel ways to investigate this signiﬁcant problem. We construct an original computational model of human papilloma virus and cervical intraepithelial neoplasia with the ultimate aim of predicting the outcomes of varying control techniques (e.g. vaccination, screening, treatment, quarantine). Two computational intelligence techniques (genetic algorithms and particle swarm optimisation) are used over one-stage and two-stage optimisations for eight real-valued model parameters. Rigorous comparison over a variety of quantitative measures demonstrates the explorative nature of the genetic algorithm (useful in this parameter space to support the modeller). Correlations between parameters are drawn out that might otherwise be missed. Clustering highlights the uniformity of the best genetic algorithm results. Prediction of gender-neutral vaccination with the tuned model suggests elimination of the virus across vaccinated and cross-protected strains, supporting recent Scottish government policy. This preliminary study lays the foundation for more widespread use of computational intelligence techniques in epidemiological modelling.

small change to a parameter value or the model structure can be the difference between a large outbreak, endemic disease, or elimination of the disease. Computational intelligence approaches are highly suitable tools to support modellers to fit a model to historical data, as shown in our previous work [1], and provide a base for future predictions. We compare and contrast two such tools (genetic algorithm (GA) and particle swarm optimisation (PSO)) in a realistic case study of international importance. In addition, we extend our use of clustering techniques with process algebra [2] to investigate the quality of the results of the optimisation.
The Human Papillomavirus (HPV) is a sexually transmitted infection which affects the skin and moist membranes of humans. There are over 200 strains of this virus which are split into two subcategories known as oncogenic (high-risk) and non-oncogenic (low-risk). Oncogenic strains of this virus play a pivotal role in the development of various cell abnormalities including tumours [3]. Almost all cases of cervical cancer in women are attributed to high-risk HPV strains [4]. The World Health Organisation has made a global call to eliminate cervical cancer worldwide [5], driven through better understanding of HPV control measures.
This paper contributes to that understanding by presenting computational intelligence and data mining approaches to optimising a model of HPV and pre-cancerous stages. The model is fitted to Scottish data to demonstrate the techniques. According to Kavanagh et al. [6] 80% of cervical cancer cases in Scotland are attributed to HPV 16 & 18. Females aged 12-13 have received a bivalent vaccine against those strains since September 1st 2008 in Scotland. This programme has reduced HPV 16 & 18 prevalence from 30% to 4% in females [6]. The bivalent vaccine provides cross-protection against strains HPV 31, 33 & 45; however, that protection may reduce over time. Predictive modelling is used to investigate equal vaccination policies and the effect of waning cross-protective immunity.
As a starting point, we use a real-valued Genetic Algorithm (GA) as a general-purpose heuristic solver and Particle Swarm Optimisation (PSO) as an alternative solver that is useful for searching continuous parameter spaces. The model has eight parameters, which is a modest challenge for the chosen optimisation techniques but has produced some unexpected observations of the correlations between the parameters. The relationships between the optimal solutions within and between each approach are further explored through the use of data mining (clustering).
We use the optimised model to predict future outcomes of variants of the vaccination programme. We predict that a gender-neutral vaccination program can eradicate high-risk oncogenic strains of HPV even where immunity to those strains wanes. This work establishes a framework for applying computational intelligence to process algebra modelling, with effective and rigorous approaches to evaluating the resulting solution set.

Tools
The Evolving Process Algebra toolkit (EPA) [1] was developed with the goal of bridging the gap between evolutionary computation and the formal modelling domain, specifically using process algebra for modelling due to its parsimony and suitability for biological systems of interacting components. EPA is built on the ECJ Evolutionary Computation toolkit and the Bio-PEPA (deterministic or stochastic) simulation engine [7]. For this study, EPA was extended with a parameter sweep component and a particle swarm optimisation (PSO) module, complementing the existing GA approach to optimisation of numeric model parameters. EPA is able to optimise model structure and numeric parameters [1]. Here, the core aspects of the model are standard and only numeric parameters need to be optimised.

Model
The HPV model is a classic SEIR (Susceptible, Exposed, Infected, Recovered (Immune)) compartmental model [8] extended with binary gender and the potential precursor stages of cervical cancer (three stages of cervical intraepithelial neoplasia (CIN). Epidemiological studies [6,9] informed model development. Fig. 1 shows the compartments of the model and the available transitions between compartments. The Bio-PEPA code for the model is archived online [10].
There are over 200 strains of HPV. For this investigation our model groups together the two oncogenic strains HPV16 and HPV18 which are targeted by the bivalent vaccine. The cross-protected strains HPV 31, 33 and 45 (some level of protection is conferred by the bivalent vaccine) are identified as "other" in the model. Further strains of HPV have been omitted.
Vaccination has been added in a staged manner via a rolling vaccination programme for girls aged 12-13 and a targetted catch-up programme for older girls (14-18) (who may have already been exposed to HPV strains, so vaccination may be unsuccessful). The catch-up programme ran for 3 years; in 2014, the eligible routine age for vaccinations was amended to 11-12 year olds for logistical purposes. The model gradually increases both vaccine takeup and vaccine efficacy to capture these features. The timescale is one simulation tick per day, therefore rate constants are expressed as daily rates. The population is open, with births, deaths and immigration. The birth rate is chosen to balance up the death rate in the Scottish population (keeping the population constant for the duration of the study). Immigration (average number of imported infections per year) is given by the standard formula 0.02 √ N , where N is the total population [13]. For optimisation, the simulation time is seven years, matching the data of Kavanagh et al [6]. The vaccine is introduced after the first simulation year.
HPV is a sexually-transmitted infection. Couplings are simplified as being female-female, female-male, male-female, and male-male, in proportions reflecting sexual orientation demographics in Scotland  fixed, reflecting fluidity of orientation and behaviours. This simplification is a pragmatic modelling choice suiting the Markov chain semantics of Bio-PEPA. The two routes from susceptible to exposed in Fig. 1, e.g. fS nv to fExp reflect that the infection can come from a female or a male partner. HPV can be described as using frequency-dependent transmission: the number of infectious contacts an individual can make in a single day is proportional to SI/N , where S and I are the number of susceptible and infected individuals respectively and N is the total population. We assume an average of 3.5 days between exposure and becoming infectious [12].

Optimisation
The rates of progression and regression between the CIN stages is unknown, therefore these were targeted for optimisation together with the contact rate (Cr) and the rate of developing immunity to strains 16 & 18 (Immunity). Domain knowledge was used to set appropriate upper and lower bounds for these rates. The target data was extracted from Kavanagh et al. [6] (HPV, 7 years population sampling) and Pollock et al [9] (CIN, 5 years population sampling).
Four separate experiments were carried out: 1. two-stage optimisation where infection parameters Cr, Immunity were obtained by a parameter sweep followed by CIN parameters CIN1, CIN1r, CIN2, CIN2r CIN3, CIN3r obtained using a GA, 2. two-stage optimisation as 1. but using PSO, 3. one-stage optimisation of all eight unknown parameters using a GA, 4. one-stage optimisation as 3. using PSO.
Our goal is to explore the potential use of GA and PSO with algorithm parameters that allow for a balance of exploration versus exploitation and that could be completed on our local compute cluster in a reasonable time (e.g. 1 day). The GA used a population of 100 individuals over 100 generations, with tournament selection (5% tournament pool), 1% elitism, 1-point crossover (100%), 5% mutation rate, and generational replacement. The PSO used 100 epochs, 100 particles, 0.728844 inertia, 1.49 cognitive component and 1.49 social component. For this comparison standard parameter settings were used. Ideally we would adjust the GA/PSO parameters to make the search process more efficient. We have moderated for this by using long runs and repeating the runs. Given the number of independent valid solutions we generated that converged to similar fitness scores, we can be reasonably confident that our solution set is appropriate. Optimising the algorithm parameters first may have resulted in quicker convergence but this had to be balanced with the time taken to find these best-case algorithm parameters. If our solution set had been more diverse and had not been a good fit to the data then we would have refined our algorithm parameters. For both the GA and the PSO, inputs to be optimised were internally represented by real values as used by the HPV model. Prevalence data for HPV in Scotland [6] provides relevant target values at fixed points over time. For both optimisers, fitness was calculated as the Euclidean distance score between the data and the equivalent predicted model values. This was chosen as a straightforward standard linear measure for this data.

Parameter Optimisation
To illustrate optimisation results against historical data, the best match trace is shown in Figs. 2 and 3 (GA two-stage results). Mean, median and standard deviation for all eight parameters across twenty best-fit solutions are shown in Tab. 1, comparing GA and PSO results, and two-and one-stage processes.
Results are largely equivalent in terms of fit but there is considerable variation in some parameter values in these solutions. Distribution and correlation information (Figs. 4-7) draw out the model's susceptibility to variance in these parameters. Where there is a strong correlation, the model produces a consistent response and is sensitive to a parameter's value. Where there is little to no correlation, the model is insensitive to the parameter with respect to overall fitness.        In contrast, the PSO seems more sensitive to the Cr and Immunity parameters in the 2-stage process.

Clustering the Optimal Solutions
Several experiments were carried out using k-means clustering to further investigate the relationships between the produced optimal solutions. Clustering was applied to analyse the stability of the GA results only, and of the PSO results only. Finally, clustering was applied to GA and PSO results across the 1-stage or 2-stage process to demonstrate which technique produced more stable solutions. Values of k from 2-5 were tried. Euclidean distance was used as the distance function, with k=3 producing optimal clusters. Distribution of data points across clusters were monitored to provide insight into how k-means splits the data for each k and for each optimisation approach. Cluster sizes are more unequal for k=3. For the other values 2,4 and 5 the number of instances are more equally distributed across clusters and therefore do not provide any additional information.
Tab. 2 shows the results of using k-means clustering (k=3) on the combined GA and PSO results for 1-stage or 2-stage optimisation. Clustering across the approaches highlights the similarity of solutions produced within each method. All 20 GA results are in cluster 0 (1-stage optimisation). All 20 PSO instances are in cluster 0 (2-stage optimisation), highlighting sensitivity to Cr, Immunity.

Model Predictions
Having optimised the model to match data [6,9], predictions of vaccination policies can be made. For example, a gender-neutral policy of vaccination is clearly equitable. Fig. 8 shows the predicted results of the gender-neutral policy and how it produces elimination of the vaccine-specific virus strains, elimination of crossprotected strains, and subsequent reduction of CIN with concomitant decrease in cervical cancer. As gender-neutral vaccination was partly (due to  implemented in Scotland from 2019 these predictions will soon be able to be tested against observed data. With no vaccination, the model suggests half of the population will be infected with HPV strains 16 & 18, and 31, 33 & 45. Female-only vaccination, shown in Fig. 8 (a), reduces HPV 16 & 18 prevalence to 6% (female) and 15% (male) of the population. In contrast, Fig. 8 (c) shows these strains are virtually eliminated with equal vaccination. Fig. 8 (a, c) shows that strains HPV 31, 33 & 45 also reduce due to cross-protection from the bivalent vaccine. If that cross-protection wanes over time, e.g. Fig. 8 (b) shows 9.4 years protection, then strains 31, 33 & 45 become dominant in a female-only vaccination regime with 10% females and 23% males infected. Equal vaccination eliminates these strains, shown in Fig. 8 (d).

Conclusion
Two computational intelligence techniques were used to refine a hand-built model of HPV and subsequent CIN stages. We originally took a two-stage approach with two different optimisation algorithms to cross-compare results and parameter variance. A further one-stage 8-parameter optimisation run for the the GA and PSO approaches was performed at the end of the study to investigate if this would lead to similar results. Clustering was used to provide further insight into the results generated.
Comparing GA performance to PSO performance, we see, as expected, the GA is more explorative (slightly wider range of fit results in Tab. 1 and Figs. 4 and 5) and a good general-purpose heuristic. The one-stage process provides strong correlations between CIN progression and regression parameters. The GA pulled out an interesting and unexpected strong positive correlation between Immunity and CIN1r: the state fImmune can be reached either directly from fInf, or via CIN1. Clustering shows the similarity of top GA results for the one-stage process.
For PSO the solutions (Figs. 6 and 7) showed no clear correlations between parameters except Cr and Immunity in the one-stage optimisation. Variance was lower overall in PSO results between one-stage and two-stage experiments. Clustering shows that the top PSO results are similar for the two-stage process: PSO gives a more uniform solution set when additional constraints are applied.
In future work, a range of algorithm parameters could be explored to improve GA or PSO performance. GA performance could most obviously be improved by running for longer, and decreasing mutation as the number of generations increases. The current work sets a baseline for such experiments.
The refined model can now be used to make computational predictions from health care data sets relating to gender policies on the bivalent vaccine. Varying vaccine efficacy or uptake would determine potential critical vaccination thresholds to ensure long-term eradication of relevant cancers. To calibrate our model we have used data relating to HPV in Scotland; however, due to a worldwide effort to eliminate cervical cancer, several data sets are available to validate such models. Future work will develop our model to apply to a range of data sets, testing predictions across vaccination scenarios, allowing more robust statistical analysis of the results. This would provide a contrast with the results of Simms et al [14].
Much further predictive work is possible if the model is extended, for example, with additional HPV strains and different vaccines. In addition, there is evidence that HPV vaccination for males can reduce head and neck cancers. Given appropriate data, it would be of value to show the impact of vaccination and screening on both males and females with respect to a variety of cancers.