Heterogeneity in phenotype, disease progression and drug response in type 2 diabetes

Type 2 diabetes (T2D) is a complex chronic disease characterized by considerable phenotypic heterogeneity. In this study, we applied a reverse graph embedding method to routinely collected data from 23,137 Scottish patients with newly diagnosed diabetes to visualize this heterogeneity and used partitioned diabetes polygenic risk scores to gain insight into the underlying biological processes. Overlaying risk of progression to outcomes of insulin requirement, chronic kidney disease, referable diabetic retinopathy and major adverse cardiovascular events, we show how these risks differ by patient phenotype. For example, patients at risk of retinopathy are phenotypically different from those at risk of cardiovascular events. We replicated our findings in the UK Biobank and the ADOPT clinical trial, also showing that the pattern of diabetes drug monotherapy response differs for different drugs. Overall, our analysis highlights how, in a European population, underlying phenotypic variation drives T2D onset and affects subsequent diabetes outcomes and drug response, demonstrating the need to incorporate these factors into personalized treatment approaches for the management of T2D. A new analysis of patients newly diagnosed with diabetes uses a data dimenionality reduction approach to understand how phenotypic variation drives diseaese onset, clinical outcomes and responses to glycemic-lowering medications.

T 2D is a heterogenous progressive disease. The consensus from genetic studies of T2D is that development of T2D is influenced by multiple etiological processes 1 . Although some people may develop T2D because of extreme alteration in one or two molecular processes, in most, multiple pathways are involved, as described in the palette model of diabetes 1,2 . This considerable heterogeneity in the phenotype of T2D is largely ignored in individual patient management. In acknowledgement of this heterogeneity, attempts have been made to place patients into discrete phenotypic clusters [3][4][5][6][7] , which differ in terms of underlying genetic risk 8 and provide important insights into how phenotype and genetic etiology of T2D can vary. However, it can be argued that such approaches do not align with the molecular understanding of the disease architecture of T2D as a continuum of disease rather than discrete subtypes 1 . Furthermore, it has been established that allocating patients to subgroups reduces the power to predict the risk of complications or response to therapy compared to prediction based on continuous data 9 .
An alternative approach to evaluating variation in T2D phenotype is to consider genotypic variation 10,11 . T2D genetic risk variants are partitioned probabilistically into clusters based on common underlying etiological processes, where they are combined to yield process-specific partitioned polygenic scores (pPSs) 12 . One study used 94 established T2D genetic loci and 47 diabetes-related traits using Bayesian non-negative matrix factorization 13 and identified five pPSs. Two were related to beta-cell function but differed by proinsulin levels, and three were related to insulin resistance mediated through obesity, dyslipidemia and lipodystrophy 11 .
In recognition of the continuum of phenotypic and genetic heterogeneity seen in T2D, we applied a data dimensionality reduction approach using reverse graph embedding to routinely collected clinical data in patients close to the diagnosis of T2D. We aimed to reduce the complex phenotypic characteristics of T2D into a non-linear, two-dimensional tree structure, without allocating patients to subgroups, to visualize how diabetes outcomes and drug response vary across this phenotypic spectrum. We used pPSs to understand how the biological processes that contribute to diabetes risk differ between patients and associate with adverse outcomes. Our analysis highlights how underlying phenotypic variation drives T2D onset and affects subsequent diabetes outcomes and drug response, demonstrating the need to incorporate these factors into personalized treatment approaches for the management of T2D.

Results
We used electronic health record information of T2D cases from Tayside & Fife in Scotland 14 . To explore phenotypic heterogeneity, we used nine T2D-related phenotypes: HbA1c, body mass index (BMI), total cholesterol, high-density lipoprotein cholesterol (HDL-C), triglycerides, alanine aminotransferase (ALT), creatinine and systolic and diastolic blood pressure (SBP and DBP). The selection of the study sample is outlined in Extended Data Fig. 1a. In total, 23,137 patients were included in the analysis with complete data. The phenotypic characteristics of the study population at the T2D diagnosis are described in Supplementary Table 1.
To enable reduction of the phenotype data at diagnosis of T2D, we used the DDRTree algorithm 15,16 . DDRTree reduces Type 2 diabetes (T2D) is a complex chronic disease characterized by considerable phenotypic heterogeneity. In this study, we applied a reverse graph embedding method to routinely collected data from 23,137 Scottish patients with newly diagnosed diabetes to visualize this heterogeneity and used partitioned diabetes polygenic risk scores to gain insight into the underlying biological processes. Overlaying risk of progression to outcomes of insulin requirement, chronic kidney disease, referable diabetic retinopathy and major adverse cardiovascular events, we show how these risks differ by patient phenotype. For example, patients at risk of retinopathy are phenotypically different from those at risk of cardiovascular events. We replicated our findings in the UK Biobank and the ADOPT clinical trial, also showing that the pattern of diabetes drug monotherapy response differs for different drugs. Overall, our analysis highlights how, in a European population, underlying phenotypic variation drives T2D onset and affects subsequent diabetes outcomes and drug response, demonstrating the need to incorporate these factors into personalized treatment approaches for the management of T2D.
the multi-dimensional T2D phenotype data and projects into a low-dimensional space in the form of a tree structure with the distal ends of the tree branches having extreme phenotype and the tree center having mixed phenotype (Methods). The internal validation process resulted in an adjusted Rand index (ARI) 17 value of 0.79 (0.74-0.84) (median (interquartile range)), which is indicative of good stability of the algorithm. To enable external validation of the method and patient outcomes and to assess the contribution of genetic risk scores, we treated the tree derived from the Scottish data as a reference tree. We then developed a mapping function that maps individuals newly diagnosed with T2D to the reference tree using age of diagnosis, sex and the nine clinical characteristics that were used in the development of the reference tree. For external validation, we used UK Biobank (UKBB) primary care data 18 of patients with newly diagnosed T2D (n = 7,332; details of sample selection are provided in Extended Data Fig. 1b, and characteristics of the study population are provided in Supplementary Table 1) and the ADOPT (A Diabetes Outcome Progression Trial) randomized controlled trial of monotherapy with metformin, rosiglitazone and glibenclamide 19 (n = 4,150; details of sample selection are provided in Extended Data Fig. 1c, and characteristics of the study population are provided in Supplementary Table 1).
Phenotypic heterogeneity at T2D diagnosis. To visualize and characterize the phenotypic variation present across the tree, we overlaid the data for each of the nine separate clinical characteristics, as shown in Fig. 1a for the Scottish population. Other than the visual gradient, we used two metrics to assess the pattern of distribution of the phenotype across the tree: we regressed each phenotype against each of the two tree dimensions (Fig. 1b), and we plotted the global Moran's I-values for each phenotype representing the strength of spatial correlation across the tree (Fig. 1c). Based on all these measures, HDL-C, SBP and DBP were most strongly distributed across the tree, followed by total cholesterol and triglycerides and then HbA1c and BMI. There was minimal variation in creatinine or ALT across the tree. The individuals located in the left-upper part of the tree had elevated HDL-C levels; those in the right-upper tree had higher levels of blood pressure and cholesterol and moderate levels of hyperglycemia. The lower-right part of the tree contained patients who were obese, hyperglycemic, with high triglycerides levels and with low HDL-C. We saw no difference when we overlaid individuals whose blood pressure measures or lipid measures were untreated compared to individuals who were treated with statins and anti-hypertensives.
We allocated the individuals from the UKBB and the ADOPT study to the Scottish tree using the mapping function, which consists of two steps. In the first step, we used two models that predicted the x and y coordinates of the individuals in UKBB and ADOPT data. These models are described in Supplementary Table 2. We used the adjusted R 2 of the model as an indicator of model performance. The model used to predict Dimension 1 (x coordinate) had an adjusted R 2 of 0.76 and an adjusted R 2 of 0.86 for the model used to predict Dimension 2 (y coordinate). In the second step, using a distance estimating algorithm, we assigned the individual to reference tree. As expected, given that we mapped individuals to the Scottish reference tree, the phenotype distribution was the same for the UKBB participants (Extended Data Fig. 2) and the ADOPT clinical trial (Extended Data Fig. 3). However, patients recruited to the ADOPT trial did not map to the tree areas with particularly high HbA1c or BMI, as these patients were excluded from the trial.
A subset of Scottish individuals had measurements of C-peptide (n = 3,604), adiponectin (n = 965) and leptin (n = 742). Although these measurements were not made at diagnosis, when these data are overlaid on the tree (Extended Data Fig. 4a-d) they show that the adiponectin largely positively correlated with Dimension 2 (similar to HDL-C), whereas C-peptide and leptin were positively associated with Dimension 1 (similar to adiposity, hyperglycemia and dyslipidemia), with the highest C-peptide and leptin concentrations seen in the lower part of the tree. Fasting measures of beta-cell function and insulin resistance (HOMA B and HOMA IR) were available for the ADOPT trial at recruitment. When these data were overlayed on the tree (Extended Data Fig. 5), beta-cell function and insulin resistance showed a distribution opposite to HDL-C (with high HDL-C, lower HOMA B and lower HOMA IR in the top-left quadrant of the tree). Finally, for the Scottish individuals, we overlaid urinary microalbuminuria data on the tree (Extended Data Fig. 6) and show that this is slightly elevated at diagnosis of T2D in those individuals with high HbA1c and high blood pressure.
Heterogeneity of T2D outcomes. We then investigated how the phenotypic variation at diagnosis translates to variation in four diabetes outcomes: time to insulin requirement; time to severe or referable diabetic retinopathy (R3/R4) (DR); time to chronic kidney disease (CKD; estimated glomerular filtration rate (eGFR) <60); and time to major adverse cardiovascular event (MACE). For modeling, we used Cox proportional hazard models and competing risk (Fine-Gray) models 20 from diagnosis of T2D, with death as a competing risk and diabetes outcomes as events of interest. Using these models, we calculated the individual probabilities (event probabilities) of progression to each of these outcomes (over 5 years) for each patient using the coordinates of each individual in the two-dimensional space (that is, position in the tree). These probabilities for the discovery (Scottish, Tayside & Fife) cohort are overlaid on the tree in Fig. 2a-d. To assess the distribution of event probabilities, we report the sub-hazard ratio (sHR) from the competing risk model constructed with linear DDRTree dimensions (Fig. 2e); the strength of global spatial correlation Moran's I (Fig. 2f) showing strong significant spatial correlation (P < 0.001) of each outcome across the tree; and the local Moran's I (Extended Data Fig. 7) showing how strongly individuals within the tree are correlated with their neighbors for each outcome. The probabilities of insulin initiation, MACE and CKD had a very similar pattern across the tree, with the greatest risk observed in the most obese and dyslipidemic individuals (bottom right). By contrast, the risk of retinopathy showed a different pattern, with risk largely driven by the combination of increased blood pressure, hyperglycemia and dyslipidemia (top right). The use of a spline rather than a linear term of the DDRTree dimensions resulted in a similar global pattern but with a more marked gradient of event probability toward the periphery of the tree-ends (Extended Data Fig. 8). In addition, we constructed the competing risk models directly with age, sex, nine clinical characteristics (for example, HbA1c, BMI, HDL-C, etc.) and drug use as covariates and derived the event probabilities for each individual. These probabilities are overlaid on the tree (Extended Data Fig. 9a-f), with a similar pattern seen for those derived from models developed using the DDRTree dimensions. For validation, we derived MACE, CKD and insulin requirement outcomes for UKBB ( Fig. 3a-c) and CKD for ADOPT (Fig. 4a,e) and show a very similar risk distribution for these outcomes across the tree. Competing risk model summaries for each outcome in the Scottish data using the clinical characteristics as covariates are provided in Supplementary Table 3.
Heterogeneity in failure of anti-diabetic drug therapy. We then investigated how drug response varied with phenotypic variation.
Here, we focused on the ADOPT study, where patients were randomized soon after diagnosis to metformin, glibenclamide (sulphonylurea) or rosiglitazone (thiazolidinedione (TZD)). To assess the drug response, we developed three Cox proportional hazard models for each drug using ADOPT data, and we overlayed the predicted drug failure probability over the constructed tree. We show a difference in monotherapy failure for these drugs, with metformin and

Causal processes underlying phenotypic heterogeneity of T2D.
Having established a visual representation of the phenotypic variation at diagnosis and how this affects diabetes outcome and drug failure, we explored the heterogeneity in the genetic etiology of diabetes to provide insight into causal processes underlying the phenotypic heterogeneity represented by the tree. We constructed pPSs for beta-cell dysfunction (with high proinsulin), proinsulin (beta-cell dysfunction with low proinsulin secretion), BMI, lipodystrophy and liver and lipid, as described in Udler et al. 11 . We overlaid the pPSs over the tree structure where we had genetic data (Scottish, Tayside & Fife = 3,512 and UKBB = 7,145). Given the expected small genetic effect relative to the large variation in phenotype, there was no clear visible gradient of the pPSs across the tree. However, when regressing against the tree dimensions (Fig. 5a), Dimension 1 (x axis) was positively associated with obesity pPS and inversely associated with beta-cell, proinsulin and liver pPS, whereas Dimension 2 (y axis) was positively associated with beta-cell, proinsulin and liver pPS and inversely associated with lipodystrophy pPS. Figure 5b shows that the beta-cell pPS (Moran's I = 0.003, P = 0.004) and obesity pPS (Moran's I = 0.003, P = 0.009) were significantly spatially autocorrelated. As depicted in Fig. 5c (based on linear regression results), the lower right of the tree had higher genetic obesity, and the upper left of the tree had elevated genetic beta-cell dysfunction and increased diabetes risk mediated via liver and lipid-mediated insulin resistance. The bottom left was characterized by genetic lipodystrophy with lower genetic obesity; this is also the quadrant where TZDs were least durable.
Visualizing individual risk in relation to their phenotype. As a potential aid for assisting clinicians and their patients to demonstrate and visualize individual patient profiles at T2D diagnosis and how this indicates risks of disease progression and complications, we developed an app (https://atn-uod2018.shinyapps.io/Prediction_ diabetes_outcome_18082021/). Newly diagnosed patients can be placed in the diabetes continuum (tree structure) with their risk of microvascular and macrovascular complications predicted for a 5-year period, derived using age, sex and the nine other continuous phenotypic measures used to define the tree. Using these continuous measures, along with statin and anti-hypertensive drugs status and year of diagnosis, the estimated receiver operating characteristic area under the curve (ROC AUC) of the four outcome models (for time to insulin, referrable retinopathy, MACE and CKD) in internal validation ranged from 0.68 to 0.78, with external validation ROC AUC from 0.65 to 0.74 (Supplementary Table 4) with good calibration. The model performance using the DDRTree dimensions (rather than the nine clinical characteristics) with age and sex performed slightly less well than using all the continuous clinical data (ROC AUC 0.60-0.75); however, the model performance improved when the DDRTree dimensions were added to the continuous data with age and sex (ROC AUC 0.68-0.83), especially for CKD.
A graphical representation of the complete analysis is depicted in Extended Data Fig. 10.

Discussion
We applied a novel approach to data dimensionality reduction of health record data of 23,137 individuals with T2D with a complete set of clinically relevant phenotype data available at the time of diagnosis. The resulting tree structure provides a two-dimensional representation of the increasingly recognized complex phenotypic all predictions are from models with DDRTree dimensions. a, Predicted probability of insulin initiation (use of insulin for more than 6 months or a clinical requirement for insulin, indicated as two or more Hba1c readings of ≥8.5% more than 3 months apart while taking two or more oral anti-hyperglycemic agents) over a 5-year period from the diagnosis of T2D (n = 7,303). b, Probability of incident MaCEs (identified from SMR based on ICD-9 and ICD-10 codes) over a 5-year period (n = 6,258). c, Probability of incident CKD (eGFR ≤60 ml/min/1.73 m 2 on at least two readings that were 90 days apart) over a 5-year period. For all outcomes (a-c), probabilities were generated from a competing risk model constructed with DDRTree dimensions (n = 6,562). d, sHRs (95% CI) of DDRTree dimensions for each outcome from competing risk models. e, Spatial autocorrelation of three diabetes progression outcomes. Moran's I-statistic is shown on the x axis, with higher values representing phenotypes that are more strongly autocorrelated; all values were at P < 0.0001.
variation in patients with T2D and allows an overlay of clinically relevant phenotypes and complications. We validated the diabetes-related outcomes in two independent datasets. In addition, this analysis provides insights into heterogeneity in diabetes at the genetic level through the significant association of pPSs with the tree dimensions, which imply that some of the observed heterogeneity is mediated via causal etiological processes for T2D. We have demonstrated here how this approach enables a visual representation of how phenotypic variation translates to variation in drug response, glycemic deterioration and risk of microvascular and macrovascular disease. In this analysis, individuals with an elevated risk of glycemic deterioration, CKD and MACE were in the lower-right region of the phenotype tree (characterized by adiposity, dyslipidemia and high HbA1c), but, in contrast, those with a greater risk of referrable DR were in the right-upper region of the tree (characterized by high blood pressure with moderately elevated HbA1c). A similar distribution was observed when we used any retinopathy (including background retinopathy) as an endpoint. Consistent with these findings, the UK Prospective Diabetes Study (UKPDS) found that blood pressure lowering from 159 mmHg to 144 mmHg on average reduced the progression of retinopathy 21 , although this contrasts with the Action to Control Cardiovascular Risk in Diabetes (ACCORD) study where lowering blood pressure from 138 mmHg to 114 mmHg had little effect on retinopathy progression 22 . Scottish individuals in the top right of the tree had SBP of 160-180 mmHg, which is more in keeping with patients included in the UKPDS.
We showed that glycemic deterioration to insulin requirement is greatest toward the bottom right of the tree, an area where individuals are more insulin resistant (both genetically and measured), have higher HbA1c, are more obese and have low HDL but are not more genetically or phenotypically beta-cell deficient. Although some studies show that subtypes of patients with beta-cell deficiency progress rapidly to requiring insulin 3 , for patients with a clinical diagnosis of T2D included in our studies the results are consistent with work from the IMI-DIRECT consortium that shows that insulin resistance is the strongest determinant of glycemic deterioration (along with more weakly increased BMI and HbA1c and reduced HDL and beta-cell glucose sensitivity) 23 .
We also showed that anti-diabetic treatment response in ADOPT, assessed as time to monotherapy failure, differed across the phenotypic tree between those treated with metformin and sulphonylureas and those treated with TZDs. Individuals with greater failure of TZDs were located in the bottom left of the tree, an area associated with high lipodystrophy and low obesity pPSs. This might be explained by the mechanism of action of TZDs to increase subcutaneous fat mass. TZDs work well in obese women 24 where subcutaneous adipose tissue is prevalent and can readily expand, but they are a poor treatment for patients with monogenic lipodystrophy 25 who have reduced or absent subcutaneous fat, although we acknowledge that there are likely to be no such cases within our cohort.
Our approach differs in several ways from other studies that have derived subtypes of T2D 3,5,10,26,27 . First, the phenotypic data used in deriving the tree were adjusted for age and sex. We chose to do this as age, in particular, is a major driver of adverse outcomes, especially for risk of insulin initiation, MACEs and CKD. Thus, the overlay of MACE and CKD risk on the tree is not mediated by an underlying distribution of age or sex. Second, although the DDRTree method groups together patients who are similar within the tree branches, it is clear from the figures that the phenotype and the associated risks are distributed continuously along the whole tree. This highlights, based on the phenotypes included in the study, how any discrete binning into subgroups will lose information, as demonstrated by Dennis et al. 9 , and misrepresent the continuum that is T2D. An analysis of the Swedish National Diabetes Register with 114,231 individuals with newly diagnosed T2D cases also reported the use of prediction models with simple clinical phenotypes to outperform the models with cluster labels and failed to identify T2D clusters while using nine continuous phenotypes at diagnosis 28 . We are not advocating the use of DDRTree to improve prediction but, rather, to reduce a complex multi-dimensional disease into a simpler, understandable two-dimensional model that can be readily visualized and used to enhance the therapeutic process between clinicians and individual patients, to see how their personal T2D profile compares to others of similar age and sex. The web application enables the exploration of how lifestyle or pharmacological intervention to improve modifiable risk factors, such as blood pressure, lipids, BMI and HbA1c, might influence a patient's position in the tree and their subsequent risk of complications. Current clinical guidelines 29 for the management of T2D generally do not consider individual patient phenotype when considering what is the optimal treatment or what are the risks of progression to insulin or microvascular disease; however, here, we highlight how outcome and drug failure vary considerably across the complex phenotypic spectrum. Although the demonstration of the clinical value of such information will require validation in a prospective trial, our data support the concept that the management of individual patients with T2D should be informed by their specific phenotypic profile-for example, with respect to retinal screening intervals, monitoring of HbA1c and consideration of optimal diabetes treatment and patient behavioral choices. The incorporation of individual phenotypic variation into clinical practice has clear potential to make a substantial contribution to a precision approach to the management of T2D.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41591-022-01790-7.   14 . We excluded individuals who were known to be GAD positive and those with an age of T2D diagnosis before 35 years. We used 11 phenotypes that are well-captured in routine clinical care and reflect a range of measures known to be altered in T2D and associated with variation in risk of adverse diabetes outcomes, which include age of diagnosis, sex, HbA1c, BMI, HDL-C, triglycerides, total cholesterol, ALT, creatinine and SBP and DBP at diagnosis. All covariates were measured within 1 year from the date of diagnosis, and, if multiple recordings were available, the measurement closest to the date of diagnosis was used. All phenotypes were normalized and residualized for age and sex before dimensionality reduction. For a subpopulation, C-peptide, adiponectin and leptin measurements were available; these measurements were taken at recruitment rather than at the diagnosis of diabetes. A subset had also consented for genetic analysis (genome-wide genotyping) as part of the GoDARTS study (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) 30 , and this was used for derivation of pPSs.
UKBB primary care data. The UKBB is a large prospective epidemiological resource with consented data of 500,000 individuals recruited from 2006 to 2010. The UKBB contains information from electronic health records as well as from interviews and questionnaires, and a proportion of individuals who consented were also genotyped. The UKBB resource was designed to improve the prevention, diagnosis and treatment of non-communicable diseases, including cardiovascular diseases, diabetes and cancer 18,31 . Longitudinal primary care data of around 45% of the UKBB cohort have been made available to facilitate disease progression-based analysis 32 .
We identified T2D cases from the longitudinal and cross-sectional data based on their diagnosis labels and anti-hyperglycemic prescriptions, and we collected genetic data for these T2D cases. The phenotype at diagnosis was defined as recorded values in the primary care data set 1 year before or after the diagnosis of T2D. The genetic data available on all UKBB participants were used to derive pPSs.
ADOPT trial data. ADOPT was a multi-center randomized controlled trial of rosiglitazone, glibenclamide (glyburide) and metformin in recently diagnosed cases with T2D 19 . This trial had a multi-ethnic study population with the majority (88.3%) being Caucasian. The trial follow-up period was 4 years (2002)(2003)(2004)(2005)(2006), and inclusion criteria for the trial were the age of T2D diagnosis between 30 and 75 years with fasting glucose levels of 7-13 mmol L −1 . During the follow-up period, HbA1c, change in beta-cell function and insulin sensitivity and other diabetes-related phenotypes were assessed at regular time points.
Statistical analysis. DDRTree. DDRTree is a dimensionality reduction approach with a reverse graph structure embedding algorithm and a mapping function. It has been used for the analysis of high-dimensional single-cell transcriptomics data to understand the role of gene regulation in cell fate 16,33 . We used the implementation of DDRTree from Monocle package from the Bioconductor repository 15,34,35 . In this analysis, data dimensionality reduction was done with DDRTree, which reduces multi-dimensional data and projects into a low-dimensional space in the form of a minimal spanning tree structure. A principal tree trunk is at the center of these data, with the algorithm identifying a branch clustering structure of the data points in the reduced dimension. Each resulting branch will include individuals with an increasingly similar and distinctive pattern of phenotypes extending to the tip, whereas individuals located more proximally in the principal tree trunk will have mixed characteristics. Although data are continuous across the tree, consistent with the palette model of T2D 1 , the patients with the most extreme distinct phenotypes are found at the distal end of each branch. The DDRTree method showed better accuracy and normalized mutual information compared to local linear embedding and principal component analysis with different datasets 16 .
We excluded outliers from data based on 5-s.d. values, and data were transformed with rank normalization. In a second step, we residualized each phenotype for age and sex using linear regression analysis. This ageand sex-residualized matrix of phenotypes was entered into the algorithm without any initial dimensionality reduction. The DDRTree algorithm reduces high-dimensional data (residualized nine phenotypes) and projects it into a two-dimensional space. Later, a smoothed graph structure is constructed from the reduced data, and then similar data points are positioned adjacently to obtain a tree structure with tree branches, and a group of individuals located in the tree trunk is considered as the center of the data.
To validate DDRTree, we estimated the ARI, which is a measure of agreement for positioning the individual in the same tree branch at different executions of the algorithm. The range of ARI varies between −1 and 1, and a value close to 1 shows perfect similarity in positioning one individual in each execution. We considered an ARI of values above 0.75 as an indication of the stability of the method. We ran the DDRTree algorithm on 10,000 randomly selected individuals' phenotype data and re-ran it for 500 times, excluding 10% of individuals randomly in each run. While estimating the ARI, we excluded the contribution from individuals located in the principal tree trunk or the center of the tree.
We used R version 3.5.2 for all data management and statistical analysis 36 and the 'monocle' package from Bioconductor for implementing the DDRTree algorithm.
Endpoint definitions. Insulin initiation: This was defined as sustained use of insulin for more than 6 months or a clinical requirement for insulin, indicated as two or more HbA1c readings of ≥8.5% more than 3 months apart while taking two or more oral anti-hyperglycaemic agents 37 .
DR: We used the Scottish Diabetic Retinopathy Grading Scheme, and a grade of R3 or R4 (severe or referrable DR) was considered as the incidence of DR 38 .
MACE: We identified MACEs from Scottish Morbidity Records (SMR) and the General Register Office (GRO) for the Scottish cohort and hospital episode statistics (for the UKBB) using ICD-10 and ICD-9 codes (Supplementary Table 5).
CKD: CKD was identified from the electronic health records based on an eGFR of ≤60 ml/min/1.73 m 2 on at least two readings that were 90 days apart 39 . eGFR was calculated from serum creatinine using the CKD-EPI equation 40 .
Drug failure: In the ADOPT trial, monotherapy failure was indicated by fasting plasma glucose (FPG) >180 mg dl −1 (>10 mmol L −1 ) after at least 6 weeks of initiation of treatment with a tolerated dose of anti-hyperglycemic drugs.
Modeling of risk of glycemic deterioration, microvascular and macrovascular complications and drug failure. We assessed the risk of diabetes progression using a competing risk model (Fine-Gray model) and derived sHRs with death as a competing event in the Scottish cohort and the UKBB 41 . For this purpose, we developed four competing risk models, one for each diabetes-related outcome: insulin initiation, referrable DR, MACE and CKD. In all competing risk models, death was considered as a 'competing event' that hinders the incidence of the 'event of interest' (insulin initiation or DR or MACE or CKD).
In all models, we considered T2D diagnosis as the baseline, and we excluded individuals who did not have the follow-up data or who had already experienced the event of interest (insulin initiation or DR or MACE or CKD) from the corresponding competing risk models. Individuals who did not develop the event of interest and died in the follow-up period of 10 years were considered to have experienced a 'competing event' , and those who moved out of the study area or reached the 10-year follow-up were considered as censored. We then constructed three Fine-Gray models for each outcome-the first one using the tree dimensions for each individual from DDRTree; the second one using DDRTree dimensions with spline function as covariates; and the third one using age, sex, nine clinical characteristics, year of T2D diagnosis and drug use as covariates in the model. To assess the time to CKD and drug failure in ADOPT trial data, we used the Cox proportional hazard method. In these Cox models, the baseline was the date of recruitment into the ADOPT study, and individuals were censored at the end of the trial period.
To obtain the individual probability for developing each diabetes outcome (for example, insulin initiation or DR or MACE or CKD) for each study participant at a 5-year follow-up period, we used the previously constructed competing risk models with dimensions from DDRTree as covariates. For example, to estimate the probability of incidence of CKD for a study participant, we used the CKD competing risk model (event of interest: CKD; competing event: death) constructed with DDRTree dimensions as covariates. Using this model and each individual's position in reduced space, we predicted event probability for that individual. Similarly, we calculated individual-level probability for the incidence of insulin initiation, MACE and DR using corresponding competing risk models. These event probabilities were overlaid on the tree diagram to visualize the heterogeneity in diabetes progression. A similar approach was followed in ADOPT data while using a Cox proportional hazard model for deriving event probabilities.
External validation of analysis. Development of a mapping function. To map individuals with newly diagnosed T2D to the Scottish tree, we constructed a 'mapping function' . This function uses the 11 diabetes-related phenotypes (age of T2D diagnosis, sex, HbA1c, BMI, HDL-C, total cholesterol, triglycerides, SBP, DBP, ALT and creatinine) and assigns each individual a position in the Scottish tree. The mapping function is built with two components: (1) two generalized additive models with smooth terms fitted with cubic regression splines that will predict the DDRTree dimensions (Dimension 1 and Dimension 2, the coordinates in the reduced two-dimensional space) with 11 phenotypes as covariates (Supplementary Table 2); and (2) a distance-estimating algorithm, which will estimate the Euclidean distance between two points in two-dimensional space. So, when all phenotypes are given for a newly diagnosed T2D, the trained spline model will predict Dimension 1 and Dimension 2 for the individual, which is a provisional position for the individual. In the second step, a distance-estimating algorithm was used to calculate the distance between the provisional position of the individual and all the positions in the reference tree. Then, the mapping function will re-assign a new position to the individual in the reference tree, which will have the shortest distance to the provisional point. In this way, we identify the individual (from 23,137 people) who is most similar to the new individual being mapped.
We used the UKBB primary care data for the external validation. By using age of diagnosis, sex and the other nine phenotypes, we mapped the individuals in the UKBB data to the reference tree. Later, we overlayed the nine phenotypes over this tree and assessed the distribution of phenotype across the tree and compared it with the reference tree. To assess diabetes progression, we used three endpoints available in UKBB data: (1) time to insulin initiation, (2) time to CKD and (3) time to MACE. We followed similar definitions for the diabetes outcomes as described earlier. We constructed competing risk models with death as a competing event and used Fine-Gray models as before. The predicted probability of each event was overlayed on the tree. Retinopathy was not included as an endpoint in UKBB as this was poorly recorded.
We applied the 'mapping function' for ADOPT trial participants and followed the same steps as for UKBB data. In addition to nine phenotypes, we assessed the distribution of beta-cell function and insulin resistance over the ADOPT tree. In ADOPT, we then assessed the time to CKD and time to metformin, sulphonylurea and TZD failure using Cox proportional hazard models and plotted the predicted event probabilities from these models over the tree. The duration of ADOPT meant that we were unable to include MACE and retinopathy endpoints.
The model performance of competing risk models developed in Tayside & Fife data was assessed with internal validation (insulin initiation, MACE, DR and CKD) by using bootstrap samples with replacement. External validation was completed by fitting these models (insulin initiation, MACE and CKD) in UKBB primary care data. The performance of these competing risk models was assessed at 5 years with the ROC AUC, which is a measure of discrimination, with higher values indicating better model discrimination and calibration plots, and which shows the accuracy of the predicted risk 42,43 .

Statistical evaluation of phenotypes or outcome distribution across the tree.
Although most phenotypes and outcomes could clearly be seen to vary across the tree, we quantified this variation in two ways. First, for phenotype distribution, we regressed each phenotype against the tree dimensions and plotted the regression coefficient and 95% confidence interval (CI). To assess distribution for each diabetes outcome probability, we used the sHRs from the competing risk model. From each competing risk model, we derived sHR with 95% CI, and this sHR indicated how the risk of diabetes outcomes (probability of insulin initiation/DR/ MACE/CKD) varied across the two-dimensional space (x axis and y axis). Second, we undertook spatial autocorrelation analysis using global and local Moran's I-statistics. For this, we considered each individual's location (with their coordinate in space) and assessed its relationship with the phenotypic or event probability values associated with these locations 44 . A positive value of Moran's I suggests that similar values (phenotype/pPS/event probabilities) are located together and close, whereas a negative value indicates that dissimilar values are located close. For the estimation of local Moran's I, we considered 1,000 neighbors, and individual-level Moran's I was calculated, and it was visualized by considering local Moran's I as a continuous variable.
Genetic analysis. Based on Udler et al., we constructed five pPSs in Scottish (GoDARTS) patients and UKBB patients as previously defined and labeled as beta-cell dysfunction, proinsulin, BMI, lipodystrophy and liver/lipid. We assessed the relationship between these five pPS and DDRTree dimensions (Dimension 1 and Dimension 2) with linear regression models in both GoDARTS and UKBB data. A significant positive relation with pPS as a dimension indicates that individuals further along the axis have greater genetic risk mediated by that pPS. In GoDARTS data, we also assessed the spatial distribution of five pPSs using the spatial autocorrelation method (Moran's I).
Development of the R Shiny app and prediction models for microvascular and macrovascular complications. As we described above, we use the 'mapping function' to find the position of a newly diagnosed T2D patient in the tree. To predict the risk of microvascular and macrovascular complications, we used the Fine-Gray models constructed with all continuous variables (age of diagnosis, sex, HbA1c, HDL-C, BMI, triglycerides, total cholesterol, ALT, creatinine, SBP and DBP and year of T2D diagnosis) for each outcome. For each model, we assessed the shape of the relationship between phenotype and outcome and undertook transformations as required. Similarly, the time-varying effects of the phenotype were assessed, and those with time-varying effects were included in the models. In addition to the phenotypes, we assessed the use of statins or anti-hypertensives at the time of diabetes diagnosis and used them in the model-building process. Then, we predicted the event probabilities for our event of interest at each of multiple time points from diagnosis through to 5 years after T2D diagnosis.
Ethics declarations. Data provision and linkage were carried out by the University of Dundee Health Informatics Centre (HIC), with analysis of anonymized data performed in an ISO27001 and Scottish Government-accredited secure safehaven. HIC standard operating procedures were reviewed and approved by the NHS East of Scotland Research Ethics Service, and consent for this study was obtained from the NHS Fife Caldicott Guardian. This population-level database contains longitudinal data, including demographic, clinical and biochemical, linked by a unique identifier (PROCHI) from the Scottish Care Information-Diabetes Collaboration (electronic repository with details of all patients with diabetes in Scotland). Data for the ADOPT trial were accessed through the Clinical Trial Data Transparency Portal under approval from GlaxoSmithKline (proposal 930). Data from the UKBB were analyzed under project approval 20405.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The data that support the findings of this study are from anonymized real-world medical records available through the Scottish Care Information-Diabetes Collaboration, Tayside & Fife, Scotland unit (https://www.sci-diabetes.scot. nhs.uk/). UKBB primary care data are not publicly available but are accessible for research on approval from the UKBB (https://www.ukbiobank.ac.uk/ enable-your-research). ADOPT trial data are not publicly available due to governance limitations but are available for research by approval from GlaxoSmithKline.

Code availability
All the R code that supports this analysis is specific to the Scottish Care Information-Diabetes Collaboration data and UKBB and ADOPT trial variables; thus, data fields are not made available. Codes used for implementing the DDRTree (version 0.1.5) algorithm are available publicly in the 'monocle' package in the Bioconductor repository (https://www.bioconductor.org/packages/release/bioc/ html/monocle.html). Fig. 2 | A visual representation of the phenotypic characteristics of (n = 7332) patients at diagnosis of T2D from uKBB. a mapping function was used to position individuals in UKBB (n = 7332) onto the Scottish (reference tree) tree. The phenotype values are overlaid on the tree structure to visualise the distribution of nine phenotypes (Hba1c, BMI, HDL-c, TC, TG, aLT, creatinine, and SBP and DBP) over the reduced tree structure. Each point in the figure represents one individual. The magenta colour of the point indicates a higher value of the phenotypic variable for that individual and the green colour indicates lower values. Fig. 4 | Distribution of C-peptide, adiponectin, and leptin over the tree structure. For a sub-population of Scottish patients, c-peptide (a), adiponectin (b) and leptin (c) measurements were available and are overlaid on the tree. The magenta color indicates higher values of the markers, and the green color indicates lower values (a-C). (d) shows the beta for the linear regression (with 95% CI) of C-peptide against the tree dimensions. (e) shows the beta for the linear regression (with 95% CI) of adiponectin against the tree dimensions. (f) shows the beta for the linear regression (with 95% CI) of leptin against the tree dimensions. C-peptide and leptin showed a significant positive correlation with dimension 1 and negative correlation with dimension 2 whilst adiponectin was significantly positively correlated with dimension 2 and negatively correlated with dimension 1. Fig. 8 | Visualizing the heterogeneity in diabetes progression in Scottish patients with T2DM. all predictions are from models with DDRTree dimensions fitted with a spline function. a. Predicted probability of insulin initiation (use of insulin for more than 6 months or a clinical requirement for insulin, indicated as two or more Hba1c reading > =8.5% more than three months apart while taking two or more oral antihyperglycaemic agents) at 5 year from the diagnosis of T2D. b. Probability of severe or referrable (R3/R4) diabetic retinopathy at 5-year period. c. Probability of incident major adverse cardiac events (identified from SMR based on ICD 9 and ICD 10 codes) at 5-year period. d. Probability of incident chronic kidney disease (eGFR < = 60 ml/min/1.73m2 on at least 2 readings which were 90 days apart) at 5-year period. all outcomes (a-D) probabilities were generated from a competing risk model constructed with the spline function of DDRTree dimensions. Fig. 9 | Pattern of distribution of event probabilities derived from competing risk models constructed with continuous variables as covariates. a. Predicted probability of insulin initiation (use of insulin for more than 6 months or a clinical requirement for insulin, indicated as two or more Hba1c reading > =8.5% more than three months apart while taking two or more oral antidiabetic agents) at 5-year period from the diagnosis of T2D (n = 22595). b. Probability of incident diabetic retinopathy (R3/R4) at 5-year period (n = 22759). c. Probability of incident major adverse cardiac events (identified from SMR and GRO based on ICD 9 and ICD 10 codes) at 5-year period (n = 18239). d. Probability of incident chronic kidney disease (eGFR < = 60 ml/min/1.73m2 on at least 2 readings which were 90 days apart) at 5-year period (n = 19956). For all outcomes (a-D) probabilities were generated from a competing risk model constructed with continuous variables (age of diagnosis, sex, Hba1c, BMI, HDL-C, TG, TC, aLT, BP, and Creatinine) and competing risk of death. e. Linear regression estimates (with 95% CI) between the DDRTree dimensions and the four diabetes outcomes probability f. Spatial autocorrelation diabetes outcome probability; The Moran's I statistic is shown on the X-axis, with higher values representing phenotypes that are more strongly autocorrelated; all values are p < 0.001.