Gin: genetic improvement research made easy

Genetic improvement (GI) is a young field of research on the cusp of transforming software development. GI uses search to improve existing software. Researchers have already shown that GI can improve human-written code, ranging from program repair to optimising run-time, from reducing energy-consumption to the transplantation of new functionality. Much remains to be done. The cost of re-implementing GI to investigate new approaches is hindering progress. Therefore, we present Gin, an extensible and modifiable toolbox for GI experimentation, with a novel combination of features. Instantiated in Java and targeting the Java ecosystem, Gin automatically transforms, builds, and tests Java projects. Out of the box, Gin supports automated test-generation and source code profiling. We show, through examples and a case study, how Gin facilitates experimentation and will speed innovation in GI.


INTRODUCTION
Genetic improvement (GI) is a young ield of software engineering research that uses search to improve existing software. GI aims to improve both functional, notably bug ixing, and non-functional properties of software, such as runtime or energy consumption. The intersection of automated program repair (APR) and GI has had the greatest impact to date, from the release of the GI-based tool GenProg [27] to successful integration of APR into commercial development processes [19,20]. Non-functional improvement (NFI) is the branch of GI that, as its name suggests, improves non-functional properties without, in contrast to APR, needing an implicit speciication or a user-provided test oracle, since it can use its input program as its functional oracle. NFI has also had signiicant industrial impact Ð BaraCUDA, a widely used sequence mapping tool, accepted GI-evolved patches in 2015 [24].
GI abounds with open problems. GI searches the space of program variants created by applying mutation operators. The richness of this space depends on the power and expressivity of the mutation operators; we have not yet identiied mutation operators that simultaneously deine a rich and dense search space. Given a set of operators, the GI space is usually vast and sparsely populated by variants that meet a speciication or that a human might write. Eiciently traversing GI spaces under a resource bound remains an open problem. A key subproblem here is how to eiciently integrate testing program variants into the search.
Working to close these problems requires experiment-driven innovation; experimentation necessitates engineering, some novel, but much that is not. The time researchers currently take to build a GI substrate Ð either writing from scratch or inding, adapting, and binding together existing tools Ð involves reimplementing many wheels, like parsing and program transformation. This is because existing work relies on bespoke tools that are not designed for reuse or modiication. For example, some tools require expertise in programming languages, such as Lisp [37], that many software engineering researchers do not often use. The lack of shared tooling is hampering GI research, especially into NFI; it hinders reproducability and slows innovation.
The potential beneits of a shared, tooling substrate for GI experimentation are enormous. We need look no further than the impact such tooling has had on other areas of computer science. The Evolutionary Computation Library ECJ [29] is a general-purpose, extensible framework for evolutionary computation (EC); anecdotally, its release facilitated experimentation and reproduction in EC [29]. SimpleScalar [9] is an open source set of tools for simulation of modern processor architectures. Prototyping processors in hardware is simply prohibitive for most academics; SimpleScalar is the simple and eicient testbed on which academic research in computer architecture rested for over a decade [17]. A more recent example is Google's TensorFlow [5], a library for numerical computation and large-scale machine learning. It has democratized machine learning, leading to an explosion of papers 1 and, anecdotally [7], providing a key capability for some AI startups.
To reproduce this success and accelerate research in GI, we introduce Gin, an experimental substrate for GI. We have instantiated Gin in Java for the Java ecosystem. We chose Java to facilitate the application of GI to a prominent object-oriented language and because Java is a lingua franca for software engineers, so its adoption gives Gin a large set of potential contributors and users. Further, Java also allows Gin to leverage powerful of-the-shelf tooling, such as JavaParser [11], JUnit [12] and Sureire [13].
Gin is necessarily both extensible and modiiable because it must not constrain scientiic inquiry into GI. Thus, Gin is a toolbox, rather than a framework, which is only extensible. GI aspires to automate code improvement tasks. This goal, coupled with GI's open problems, has a number of immediate consequences for Gin's design: Gin must build and scale to industrial code and it must smoothly and easily support adding new search strategies, sampling strategies, and mutation operators.
To smoothly run tests on program variants, Gin understands the two currently dominant Java build systems Ð Maven and Gradle; as outlined in Section 2.6 it builds such projects automatically, obviating shell commands. To scale, Gin utilises dynamic compilation, which recompiles only changed classes and their dependencies, and online classloading. These features allow it to modify, recompile, and execute large-scale systems within a single virtual machine (see Section 2.3).
Out of the box, Gin supports an array of program transformations and two representations of code Ð ASTs and token streams. Both representations are extremely lexible and support operations at multiple granularities (subtrees or grammatical units); in contrast with other approaches, Gin presents the raw representation, without iltering, to a mutation operator. These uniltered representations free researchers to deine custom operators (see Section 3.1), like one that considers comments. Gin's design carefully separates search from applying transformations and evaluating itness, which involves building and testing a variant. As a consequence of this separation, one need only specify a sequence of transformations to deine a new search strategy. Gin also provides a sampling feature that reports the test case results for the single application of any operator (Section 3.3). With the notable exception of De Souza et al. [14], which uses dynamic analysis to consider intermediate execution state, all GI work thus far assumes Boolean test cases in their itness evaluation. Gin is the irst to record the expected and actual results, allowing researchers to deine more ine-grained itness functions and smooth the search space landscape. 1 Over 8000 citations in Google Scholar for the cited article.
In addition to its feature set for general GI, Gin is the irst to support two innovative features for NFI Ð built-in proiling and automated test generation. A key to scaling GI is narrowing its search to code fragments. APR has successfully used fault localisation for this purpose. In NFI, the natural analog is proiling. Thus, proiling [39] is integral to Gin, freeing researchers to narrow the search for nonfunctional improvements to, for instance, the most time-consuming methods. GI usually relies on testing to measure the itness of evolved software variants [32]. In NFI, one can always use the original program as a test oracle and use it to automate test case generation. Gin leverages this insight to be the irst GI tool to incorporate an automated test case generation tool (EvoSuite [15]).
A university course on GI has already been delivered using the irst release version of Gin, demonstrating Gin's ease of use and lexibility (Section 3.6). This paper makes two principle contributions: the design and architecture of Gin and its instantiation in Java for the Java ecosystem. Gin is open source and available online http://github.com/gintool/gin/. Figure 1 presents a high-level overview of Gin's two main pipelines, and a UML class diagram of the core classes is given in Figure 2. Gin's core functionality is divided between the manipulation of source code, and unit test execution. Tools that can be used independently of source editing and evaluation, such as test generation and proiling, are grouped together in the gin.util package and omitted from the class diagram.

ARCHITECTURE
The pipelines in Figure 1 give two example uses of Gin: preprocessing to identify code of interest within a project, and search space analysis. A complete proiling pipeline is provided by the gin .util.Profiler class, which will output 'hot methods' as suitable targets for improvement.
The analysis of GI search spaces is of increasing research interest [8,26,35] and Gin facilitates this process: the toolkit includes several examples of search space tools that sample and enumerate the space of program edits. Adding new edit types and reusing this code is straightforward. Gin will sample the patch space, running the speciied tests against each patch and record the result: whether the patch is valid, the result of compilation, the test output, run times, and error details. Test suites can be generated in any manner for use in Gin, provided they are in JUnit format. Most previous GI work only considered Boolean test case results during itness evaluation; by recording more detailed test output, Gin supports the implementation of more ine-grained itness functions.
A major use case of Gin is to apply GI to improve code: Gin deliberately delegates the design of search algorithms to the user, but a simple example of a local search algorithm is included. As the code examples in Section 3 shows, it is straightforward to incorporate Gin features into other search algorithms and applications.

Patch-Edit Model and Representation
Following standard practice in GI [28], the basic representation used by Gin is a patch to be applied to the source code. Each patch is a list of edits, and each edit is the application of a single operator to the target source code.  The original source code is loaded into a SourceFile object. There are two subclasses of SourceFile: SourceFileLine focused on linelevel edits and SourceFileTree focused on edits to the Abstract Syntax Tree (AST). Each line in the source ile and each node in the AST is allocated a unique ID; these IDs are referenced by edits, simplifying the problem of resolving patches containing multiple edits to the same location(s). For example, if an edit applies to a particular ID, but that ID no longer exists due to a previous delete, the edit will gracefully degrade to a no-op.
SourceFile is immutable: any methods that modify the source return a modiied copy rather than changing the internal state of the SourceFile. Thus a patch is a sequence of edits, each producing a new SourceFile, which simpliies the implementation of new edits: an edit must simply accept a SourceFile and return a new one with the edit applied.
Gin includes subclasses of the Edit class that implement line and node operators commonly used in the literature, and examples of more ine-grained operators that replicate operators commonly found in the mutation testing domain (see Section 2.2). Edits may be targeted to speciic locations. For example, a copy may copy a statement or line from anywhere in the source, but limit its target location to locations of a certain type, or within a given method. SourceFile can be instantiated with a target method or methods, and will then provide a list of locations limited to those methods.
SourceFile provides methods for manipulating source: Accessors return lists of IDs corresponding to a given language construct, e.g. an if statement or all block statements. Getters return a copy of a line or AST node speciied by an ID Setters update the source ile by deleting, inserting, or replacing at a speciied location. Convenience methods perform common tasks, for example selection of a random statement.
SourceFile also provides methods to generate the modiied Java source for compilation and execution.

Operator Sets
Gin currently implements four sets of Edit operators: (1) Line edits: Delete, Replace, Copy, Swap, Move. The irst two represent canonical transformations from the GI program optimisation and program repair literature respectively. The line edits can be found in the work of Petke and Langdon, particular in the GISMOE tool [24,25,33,34]. The statement edits were irst used in the seminal GenProg [27] automated program repair tool, and the others are proposed in this paper.
Constrained edits limit the canonical transformations to compatibility within the Java grammar: for example, swapping a 'do statement' with another 'do statement'; the intuition behind such operators is that they are more likely to make replace and swap operations between related program sites, and are less likely to lead to program disruption. A more reined analogue to constrained edits can be found in ARJA [42], a Java APR tool, which limits replacements to program elements that are both structurally and type-compatible. The fourth type are similar to the micro-mutations of [18], and numerous examples in the mutation testing literature (such as [30]). For example, binary operators replacement will consider replacing == with !=.
Providing implementations of all these operators within one toolkit simpliies experimental comparisons and analysis. Gin is designed so that adding new operators is simple: an example of one of the existing implementations is given Section 3.

Dynamic Class Loading and Test Execution
Once source code has been edited, it must be evaluated. Gin invokes test cases using JUnit and provides the information needed to target functional objectives and run-time performance. It reports the wall clock and CPU time of test execution over multiple measures, and returns details of the unit test outcome: whether the test passed, the expected and actual results, and details of any encountered errors, such as exceptions.
Compilation and test execution is performed entirely within memory to improve performance: there are no external command invocations and no JVMs are created. To achieve this, Gin uses a custom fork of the InMemoryCompilation project [40] to generate bytecode for the modiied class, before loading the class in a custom ClassLoader that łoverlaysž the existing class hierarchy so that JUnit loads the modiied class. This dynamic loading supports both individual source iles and iles contained within a larger project.
This complexity is hidden from the user, who instantiates and invokes the TestRunner with a patch, a reference to the original source ile, and a list of unit tests. A collection of UnitTestResult objects is then returned indicating the outcome of the tests and the execution time. The existing utility classes for sampling and local search demonstrate how this can be done in practice with just a few lines of code (examples in Section 3).

Test Suite Generation
Test suites play a critical role in determining the outcome of GI [38,41]. By standardising on JUnit for testing, Gin can exploit the unit test suite provided with a project; such suites usually provide good coverage, and are used by developers to test realistic use-cases for the code. In addition, automated test suite generation is provided via integration with the EvoSuite [2] tool. In the case of NFI, this test generation can be used to produce an independent oracle.
In order to facilitate experimentation, we have preconigured EvoSuite to produce deterministic results. Moreover, the implemented TestCaseGenerator works out-of-the-box for Maven projects, modifying the pom ile automatically to add necessary dependencies and modifying the output directory for Maven's test task. Semiautomated test case generation is supported for Gradle.

Proiling
The search space for software transformation is vast [26], and restricting the subspace explored by any improvement or repair algorithm is therefore critical in reducing search run-time. One of the main innovations of the GenProg repair tool [27] was to use fault localisation to reduce the size of the search space. Similarly, Gin provides a proiling capability to identify those parts of the software most exercised by the project's unit tests; we make the assumption that the provided unit tests are representative of real-world use, or at least they exercise the code where improvement is to be targeted. As Gin accepts a JUnit test suite as input, it is straightforward for a developer to provide a test suite that can guide Gin's improvements. For example, if a particular part of a project is known to be problematic, a small test suite can be provided to Gin that includes tests extensively targeting the problematic code surface. If reducing execution time is the goal, this may simply require providing the existing performance tests that many projects include.
As detailed in Section 2.6, Gin will automatically integrate with popular Java build tools, and the proiler gin.util.Profiler uses this facility to invoke and proile a project's unit tests. First, Gin invokes the entire test suite via the build tool's API, and parses the test reports to produce a list of tests, their containing classes, parameters, and whether they passed or failed. It then proiles individual tests by invoking them via the build tool API whilst enabling CPU sampling via the hprof proiler. The hprof proiler is somewhat dated, but it is suicient for most projects; it is included in the Java 8 SDK that Gin requires, and at run-time provides a sample of the call stack every 10ms, which enables Gin to provide a list of frequently used methods. We use hprof as opposed to VisualVM and other alternatives due to its simplicity and batch operation: VisualVM is an interactive tool but Gin's proiling is automated; alternative proilers either are similarly interactive or not freely available.
The proiles are parsed by Gin and combined into a simple CSV ile for use by researchers or later stages of Gin's pipeline; this component is standalone and can be used for projects outside of GI. For each method, a count giving the number of times the method is seen at the top of the call stack is provided, along with a list of all unit tests where the method was seen at the top of the call stack during proiling. In order to provide the list of calling unit tests, Gin proiles each method individually rather than the whole suite: this is a time-intensive process that can take many hours for very large projects, but need only be run a single time. A sample of a project's unit tests may be requested instead. The proiler is separate from the core of Gin and therefore easily bypassed by researchers who do not require it.

Build Tool Integration
One of the goals of Gin is to enable systematic experimentation on real-world code; this requires the ability to compile, package and test a diverse set of large projects. Fortunately, the Java ecosystem has converged to a small number of build tools that support these requirements and provide functionality through APIs. In particular, the Gradle [3] and Maven [1] build tools are very popular and used by over 95% of developers responding to one recent survey [4]; Gradle is the default build tool of the Android ecosystem, and almost all the GitHub projects we have examined during empirical work with Gin use one of the two tools. This standardisation enables Gin to accept most Java projects without modiication, and run tasks without resorting to simply invoking shell commands.
Despite their popularity, the documentation of the APIs for both Gradle and Maven is somewhat sparse, and requires a certain amount of experimentation and reverse engineering; most of what we learnt during the process has subsequently been captured in the gin.util.Project class, which can be used outside of Gin to examine and manipulate projects, lowering the overhead for other researchers. For example, the Project class will provide the classpath for a project, ind a particular source ile within a project's ile hierarchy, provide a standard method signature for a given method, provide a list of project tests, or run a unit test given its name. The Project class is used by the Profiler and other parts of Gin to interrogate and manipulate a project, and thus support for a new build tool can be added by modifying just this class.
Gin can infer the necessary classpath and dependencies for running unit tests from a Maven or Gradle project, or these can be speciied manually.

IN PRACTICE
We now demonstrate the simplicity and extensibility of Gin with code examples for common use-cases.

Implementing New Edits
Whilst Gin contains canonical edit operators from the literature and some novel operators, development of such operators remains an area of active research; implementation of new edit types in Gin is therefore made as simple as possible. Code for a ReplaceStatement is given in Listing 1. An edit must provide: • a constructor returning a random instance of the edit; we use methods in SourceFile to select two random statement IDs. The boolean argument to getRandomStatementID speciies whether the ID should be within the target method • an apply() method to apply the edit on a given SourceFile.
Here, the method replaces the statement at destinationID with a clone of the statement at sourceID.
Listing 2 shows an implementation of the matched equivalent of a replace statement edit. This extends the existing ReplaceStatement edit, constraining the source statement to be of the same type as the destination statement.

A Simple Search Algorithm
A condensed version of the local search example provided in Gin is given in Listing 3. The search starts with a single-edit random patch and at each step a random edit is removed or a new randomlygenerated edit is added. If the new patch ofers an improvement, it is retained and the process repeated. The only important code Additional arguments allow the user to specify more unit tests, a classpath, target methods, operators and so on. The search can also be invoked programmatically: a call to localSearch = new LocalSearch("examples/Example.java") will create the local search object for the speciied target source ile, and then Patch result = localSearch.search(); will run the search, returning a reference to a Patch object with the best patch found.
Extension of this search algorithm to a population-based evolutionary algorithm is simple. The only additions required are selection (which can use the existing time and unit test methods to rank solutions) and a concept of crossover, which can be performed at the Patch level, recombining diferent combinations of edits.

Sampling and Enumeration
Essential to search space analysis is the ability to systematically generate variants of the original program code. Gin gives examples for sampling and enumerating the search space and writing results to a comma-separated ile (e.g. Figure 3): the intention is that these can easily be modiied or extended to suit experimental needs. The user only needs to provide method names and associated unit tests in a ile, which could simply be the Proiler's output ile. We provide a helper abstract Sampler class for sampling and enumerating edits, as well as three sub-classes: EmptyPatchTester will run all unit tests through Gin, and save results to a ile. RandomSampler will make a number of random edits, test the resulting source, and return the result. DeleteSampler will enumerate all possible DeleteLine and DeleteStatement edits for a method, test the resulting source, and save results to a ile. Results are written to profiler_output.csv.

Implementing an Enumerator
Consider an enumerator to exhaustively apply an edit at every possible location in a code region, perhaps to perform landscape analysis. Taking the example of DeleteEdit, Listing 4 gives the requisite source code. The code here accepts a single class example program, but could be extended to large projects with a few lines specifying the working directory, classpath etc. In the example, we specify an array of UnitTests to be applied to the modiied code. We set a number of repeats for each test. We then create SourceFileTree and TestRunner objects to perform the analysis. We create an empty patch and test that as a baseline. Finally, we get a list of all statement IDs in the source, and enter a loop that creates and tests a DeleteStatement for each statement. The results are written to ile by an auxiliary method.

Case Study -An Application in Teaching
The ease with which Gin can be deployed and modiied has been demonstrated by its use in teaching. In 2017 and 2018 two of the authors used the irst release version of Gin as a vehicle to teach concepts in GI to two moderately sized classes of students (26 and 51 students respectively) in a fourth year Search Based Software Engineering course. In each class a group assignment 2 required students to: (1) Download, build and run Gin; (2) Run Gin using the LocalSearch method to improve the runtime of four example programs; Figure 3: Example output from a sampling run, split into three rows to save space Listing 4: Implementing a delete enumerator. This is the complete code excepting some straightforward processing in writeResults() to write out the results to a CSV ile.
(3) Write a qualitative and quantative analysis describing the type and distributions of patches in the best-performing programs; (4) Extend Gin to minimise the length of the best patches; and (5) Apply Gin to their own benchmark program and analyse the results.
Each group produced a report outlining the indings from steps 2-4. There were 12 group reports submitted for the irst cohort and 13 group reports for the second. All groups were able to quickly deploy Gin, run the four benchmarks in step 2 and reliably produce better variants of the example programs. In step 3 students were required to modify the Gin implementation. Students used a variety of approaches, ranging from brute-force enumeration of edit subsets to greedy algorithms through to search heuristics such as A*.
Students were able to modify Gin with some ease, with some groups simply extending the local search example code while others went so far as to implement patch minimisation. The extended implementations were able to verify both the preservation of code structure and application performance.
In step 4 groups used a variety of benchmarks and showed an awareness of code features that were amenable to the set of GI operators used in this assignment. Students sought out examples that were amenable to optimisations such as invariant hoisting and removal of redundant code. Some submissions also demonstrated the efectiveness of Gin in improving program performance when potentially useful raw materials (such as redundant conditionals) are introduced into code. In summary, Gin serves well in an educational setting because it presents so few barriers to experimentation.

RELATED WORK
Genetic improvement tools can be divided into two categories: those that focus on improvement of functional (FI), and non-functional software properties (NFI). The tools in the irst category mainly come from the ield of automated program repair (APR). The canonical example of these is GenProg [27], the irst GI tool scaling to large real-world instances.
Early work in APR focused on ixing C and C++ programs; only more recently have other languages, such as Java, been considered. For example, Martinez and Monperrus released ASTOR, a program repair library for Java that implements several program repair approaches [31]. It allows for the addition of new tools, but does not facilitate more ine-grained extension, such as the addition of a single mutation operator or search strategy.
Genetic improvement has also been used to add new features to software. Such work has also mostly focused on C code; for example, the FI tool used by Barr et al. [6] in their automated software transplantation work is available online.
Several NFI frameworks have been developed, though few have been made open source. One of the largest is by Langdon et al. [21ś 24], and focuses on runtime improvement of C and C++ programs. Depending on the particular variant of their framework, line-level or expression-level changes are possible. The locoGP [10] framework developed by Cody-Kenny et al. evolves entire Java AST's and acts as an of-the-shelf optimisation tool, while allowing limited customisation via its itness function.
Several attempts have been made to provide a more extensible set of tools for optimising software using GI. GrammaTech's software evolution library (SEL) [36] enables the programmatic modiication and evaluation of extant software. Its API deines software objects using the Common Lisp Object System (CLOS) to provide a uniform interface, allowing it to manipulate many software artifacts, ranging from C source code, compiled assembler to limited support for Java. It also allows for addition of new mutation operators. However, modifying the framework presents a steep learning curve, particularly for those not familiar with Lisp. A Genetic Programming microframework, MicroGP, has been used as a basis for a language-independent GI framework, but the source code is no longer available. Perhaps the work mostly closely sharing the goals of Gin is the Python GI framework PyGGI [16].

CONCLUSION
GI is a maturing research topic, with multiple examples of realworld deployment and a growing diversity of methods. We believe shared tooling is essential to further advance in this area. As such, we have described Gin, a platform for GI experimentation in Java.
Gin ofers great extensibility, yet remains simple to use. It integrates with the industry-standard Gradle and Maven build tools, allowing experimentation with real-world software projects. Gin also integrates with established tools such as EvoSuite to automatically generate unit tests, as well as SureFire and hprof for proiling. As a further contribution, we have captured the experience our team have gained in integrating with Gradle and Maven builds in gin.util.Project class, which can also be used in isolation for researchers interested in other aspects of software experimentation.
We now call for participation: researchers are encouraged to download the tool from http://github.com/gintool/gin/, and experiment with the example programs we have included. Anyone working in GI is also encouraged to report bugs, raise feature requests, and contribute documentation, examples and additional features to the platform via our GitHub project.
Immediate plans for future development of the platform include more seamless automated integration of generated tests with Gradle; further edit operators, search methods and objective functions; more landscape sampling and enumeration tools, and additional use-case scenarios.
Data Access Statement. The source code of Gin can be obtained from https://github.com/gintool/gin.