Multivariate techniques enable a biochemical classification of children with autism spectrum disorder versus typically‐developing peers: A comparison and validation study

Abstract Autism spectrum disorder (ASD) is a developmental disorder which is currently only diagnosed through behavioral testing. Impaired folate‐dependent one carbon metabolism (FOCM) and transsulfuration (TS) pathways have been implicated in ASD, and recently a study involving multivariate analysis based upon Fisher Discriminant Analysis returned very promising results for predicting an ASD diagnosis. This article takes another step toward the goal of developing a biochemical diagnostic for ASD by comparing five classification algorithms on existing data of FOCM/TS metabolites, and also validating the classification results with new data from an ASD cohort. The comparison results indicate a high sensitivity and specificity for the original data set and up to a 88% correct classification of the ASD cohort at an expected 5% misclassification rate for typically‐developing controls. These results form the foundation for the development of a biochemical test for ASD which promises to aid diagnosis of ASD and provide biochemical understanding of the disease, applicable to at least a subset of the ASD population.

the United Kingdom, the average age of diagnosis is estimated to be 55 months with no evidence of decreasing. 6 Part of this discrepancy can be attributed to the difference in positions on national screening between the two countries: in 2007, the American Academy of Pediatrics in the U.S. called for general screening followed by a comprehensive evaluation for ASD by 24 months 7,8 while the National Health Service in the U.K. advocates against universal screening. 9 Camarata 10 indicates that this difference in recommendations largely lies in the reliability of ASD diagnosis at 24 months 11,12 and that the stimulus behind the case for universal screening lies in early intervention. Numerous studies (e.g., Refs. [13][14][15][16][17] have related improved clinical outcomes to early intervention, providing motivation for diagnosing individuals as early as accurate diagnosis is possible.
The "spectrum" nature of ASD coupled with the rapid, highly variable development processes present early in life elevates the challenges of early diagnosis of ASD. Miller et al. 18 illustrate this point by comparing two hypothetical children: "An active, verbose child who speaks primarily in stereotyped phrases and is preoccupied with train schedules might be immediately recognized as autistic.
Likewise, a child who is nonverbal, does not respond to his name despite normal hearing, and who spins things repetitively might also be immediately recognized as autistic. Both children have underlying impairments in social communication and restricted interests, but the surface presentation is quite different." A wealth of psychometric tools available to healthcare professionals aid in diagnosing ASD; however, a biological signature of the disorder promises to lower the age of diagnosis without the challenges associated with a behavioral diagnosis of a developmental disorder.
Heterogeneity in typical development patterns limits the earliest age at which ASD is reliably diagnosed; prospective studies on social behaviors such as gaze to faces, shared smiles, and vocalizations to others found that these behaviors were not different at 6 months of age, but group differences began to appear at 12 months. 19 However, biomarker signatures of ASD, such as imaging of white matter tract organization 20 and EEG complexity, 21 have been observed as early as 6 months of age. Recent reports from NeuroPointDX suggest amino acid panels are predictive of ASD status in children aged 4-6 years 22 and 18-48 months 23,24 and these promising results suggest extensions down to even younger participants to evaluate the earliest age at which these signatures appear. Such biomarker-based metrics have been shown to aid in the diagnosis of other disorders traditionally solely diagnosed by behavioral observations such as major depressive disorder. 25 Biomarkers come with their own set of challenges before they reach clinical translation: less than 0.1% of cancer biomarkers reported in the literature ever enter clinical practice 26 and scores of genome wide association studies seeking predictive ASD biomarkers have found few significant findings, most of which are specific to individual studies. 27 Reconciling the "holy grail" potential of successful biomarkers with the poor predictive power among those reported in the literature draws into question the manner in which biomarkers are identified. A better framework is clearly needed for identifying predictive biomarkers that can accurately and differentially diagnose ASD.
Classic biomarker development measures a plethora of candidate biomarkers but evaluates this panel by a series of univariate tests that considers each measurement as independent from all others. Furthermore, the population-level hypothesis tests that are almost synonymous with this approach are ill-suited for quantifying the separation of two or more groups. 28 Multivariate biomarkers evaluated by separating individuals (e.g., C-statistic, confusion matrix, etc.) have become increasingly popular since they can incorporate many pieces of information to arrive at a diagnosis. However, since they require more parameters than their univariate counterparts, special attention has to be paid to avoid overfitting when investigating multivariate biomarkers.
The folate-dependent one-carbon metabolism (FOCM) and transsulfuration (TS) pathways comprise a promising source for a multivariate biomarker for ASD. These pathways incorporate both genetic and environmental factors linked to ASD liability. 29

| Training data
The training data used in this study have been published previously. 29 Briefly, data come from the Arkansas Children's Hospital Research Institute's autism IMAGE study and detailed study design, inclusion/ exclusion criteria, and demographic information have been published elsewhere. 30 Children between the ages of 3 and 10 years were recruited locally and enrolled to assess levels of oxidative stress. ASD was assessed by a diagnosis of "Autistic Disorder" as defined in the

| Validation data
The validation data are taken at baseline from three previously published studies investigating pharmaceutical interventions to normalize metabolic abnormalities of children with ASD 31 : (1) a combination of methylcobalamin and low dose folinic acid 32,33 (2) high dose folinic acid, 34 and (3) sapropternin. 35 Given that these studies all focused on evaluating treatment strategies for ASD, all participants had a confirmed diagnosis of ASD. FOCM/TS metabolites were available for 154 (76% male) participants with ASD with a mean age of 8.8 years (range 2-17 years). These ages are different than reported by Delhey et al. 34 because this study only required that measurements be available at baseline, rather than both at baseline as well as the conclusion of the treatment phase. Furthermore, stratifying patients by age or gender did not reveal any differences in the univariate metabolite distributions.
The first two studies were approved by the IRB at the University of Arkansas for Medical Sciences and the third study was approved by the IRB at the University of Texas Health Science Center at Houston.
All parents gave written, signed consent and patients provided assent when appropriate.

| Metabolites
The metabolites under investigation are presented in Table 1 and additional details of these measurements and derivations are presented in Melnyk,et al. 30 This is only a subset of the measurements investigated previously 29 because "% DNA methylation" and "8-OHG" were absent from the validation data set and were therefore removed from this study to ensure that a consistent set of metabolites are used for training and testing.

| Kernel density estimation
Kernel density estimation (KDE) is a nonparametric density estimation technique that overcomes many shortcomings of the common histogram, including discontinuities at bin boundaries, sensitivity with respect to the origin, and zero-valued outside of a certain range. 36 In this work, all KDE procedures use Gaussian kernels. The probability density function (PDF) estimates provided by KDE are then used to evaluate both the C-statistic and misclassification errors at specific one-sided thresholds on the p value for membership in the ASD class to characterize the various statistical models described below.

| Statistical techniques
Multivariate classification for ASD diagnostic status was explored through classification and regression trees, principal component analysis, fisher discriminant analysis, and logistic regression. The presented techniques can be extended to classification tasks with more than two classes; however, only binary classification (i.e., classification into two different groups) will be discussed below. Sample x can belong to one of two classes P 1 and P 2 .

| Univariate classification
Perhaps the simplest way to develop a classifier for a diagnostic biomarker is to place a simple threshold on a single measurement. For multivariate data, the modeler would then evaluate each measurement independently and choose the measurement with the best discriminating power. In this work, single measurements are mean-centered and normalized to unit variance before estimating the PDFs of the ASD and TD groups.

| Classification and regression trees
When univariate techniques fall short, the modeler must turn to multivariate techniques (i.e., techniques that incorporate multiple features to determine the classification). One intuitive extension from the simple univariate classification scheme is to sequentially place thresholds on many variables in the data set. The most common application of this principle is through recursive partitioning via the classification and regression tree (CART) methodology. 37 Since sequential thresholds are placed on variables and multiple thresholds on the same variable are permitted, CART-based classifiers are generally nonlinear.
The tree-growing process begins with a node s and a node impurity function i s ð Þ. A proposed split s generates two daughter nodes s L and s R that contain p L and p R proportions of the samples in s. Defining the node impurity function i s ð Þ to be the conditional probability that a sample is in P 1 , the change in impurity is given by and the split with the greatest reduction in impurity over all variables and all thresholds is chosen. This procedure is repeated for each node until each node contains fewer than some minimum splitting threshold.
Each terminal node (i.e., a node with no daughter nodes) is associated with either the ASD or TD class. Next, the tree is pruned upwards by estimating the misclassification rate or risk R s ð Þ of the entire tree versus subtrees with one terminal node removed, regularized by the number of terminal nodes T s via parameter a: With a chosen via 10-fold cross-validation (see section "Avoiding Overfitting: Cross-validation"), the tree that minimizes R a s ð Þ is chosen

| Principal component analysis
Rather than making many sequential decisions through CART methodology, the original data can be projected onto a line and a single binary threshold can be applied to the resulting univariate score.
Under the naïve assumption that the most favorable projection for separating the two classes coincides with the projection with maxi-

| Fisher discriminant analysis
Since the data contains class labels (i.e., ASD or TD) for each panel of measurements, a potentially better way to determine the projection direction would directly use the class membership information to maximally separate the distance between the two classes of data. The FDA analysis was conducted through routines developed in-house in MATLAB.

| Logistic regression
Using a probabilistic approach, the conditional probability of membership in class i given the data point is given as p P i jx ð Þ. The odds ratio that P 1 is the correct class is then Logistic regression (LR) then assumes that the logarithm of this odds ratio can be modeled as a linear function of x where w and w 0 are estimated through maximum likelihood estimation (MLE). Then, the probability distributions can be directly determined as There are many theoretical and experimental studies comparing FDA and LR, resulting in the following outcomes: (a) LR performs better than FDA for non-normal data, 41 (b) LR requires more data to achieve the same asymptotic error rate as achieved by FDA, 42 though it is possible for LR to achieve its asymptotic error rate with less training data than FDA, 43 and (c) MLE of parameters in LR is unstable for separable data, requiring regularization approaches or alternatives to MLE. In practice, these algorithms can be compared on specific data sets to determine the best algorithm for each scenario. LR analysis was conducted using the "glm" function in R.

| Cross-validation
Since initial clinical investigations aiming to uncover diagnostic bio- | 159 groups available for model training. Then the model is validated on the group that is left out and the procedure is repeated such that every group is successively left out, so every sample is validated in a statistically independent manner. Here, a variation of k-fold cross-validation known as leave-one-out cross-validation is used such that k is equal to the number of samples, and this cross-validation strategy is used to estimate the model's predictive power within the training data set.

| Binary classification for projection-based methods
The PCA, FDA, and LR models all define a projection direction, but classification requires transforming the continuous-valued scores into a single ASD or TD classification. Throughout this work the threshold b is chosen to fix the estimated probability that a TD participant will be misclassified as ASD.

| R E SU LTS
The different classification methods are first compared with regard to their performance on the training data. Variable selection is used in some cases to determine final FDA and LR models. Once these final models are identified, the models with the highest classification accuracy are evaluated on the validation data.

| Performance of projection methods on the training data
Next, multiple measurements were combined in linear, projection- Using the group membership in developing a linear, multivariate classifier through FDA or LR promises to further enhance the separation and these methods provide a more solid statistical background for choosing the projection direction than using the direction obtained from PCA for classification. Using all variables, both the FDA and LR models achieve a fitted C-statistic of > 0.99 (Table 1; Figure 1c,d).
These results suggest that including the group membership in determining the separation direction improves the classification performance, as expected, which is also reflected by the low misclassification numbers of 7/159 and 13/159 at a threshold of b50:05 for the FDAall and LR-all models, respectively.

| Classification trees
Instead of investigating a single binary threshold on single variables or scores, multiple thresholds on multiple variables were investigated

| Variable selection for the FDA and LR models
Since previous analysis and the CART results suggest that only a subset of variables is needed to effectively classify the participants into ASD and TD cohorts, variable selection was performed for both the FDA and LR models. All variables combinations were evaluated for the fitted C-statistic. The C-statistic began to saturate at five variables for the  including too many variables can lead to overfitting of these models to the training data. Note that the validation data, "VAL," only consists of data from a set of children with an ASD diagnosis and therefore significant overlap between VAL and ASD is expected and desired we did not find advantages of using nonlinear classification techniques.

| Validation performance
Aside from the comparison of the algorithms it is equally important to discuss the finding of which metabolites where identified as contributing the most to the predictive performance. Univariate analysis identified x 22 , "% oxidized glutathione," as the most important variable. This variable also appears in the FDA-sub, LR-sub, and CART models further highlighting the importance of the contribution of "% oxidized glutathione" even when multivariate analysis is used. In addition to this variable the subsets chosen for FDA-sub, LR-sub, and CART also have three other variables in common: x 4 : "SAM/SAH," x 8 : "Glu-Cys," and x 21 : "fCystine/fCysteine." Furthermore, "% oxidized glutathione" highlights the importance of oxidative stress for classification while the "SAM/SAH" ratio is directly linked to DNA methylation and epigenetic components. As suggested previously, 29,44 it is important to include variables that account for both FOCM (DNA methylation) and TS (oxidative stress) pathways in separating ASD from TD cohorts and this is what the performed analysis returned regardless of which technique was used.
Although the results of this study are promising, there are several limitations that should be considered in future studies.
1. Including the same variables in the training and validation data.
Previous analyses found that "% DNA methylation" and "8-OHG" were two of the most important variables for separation, 29,44 but these data were not present in the validation set. Future studies should include these variables to allow for the highest possible classification accuracy as these classifiers are considered for clinical translation into a diagnostic test.
2. Including both ASD and TD populations. The validation set included in this study provides a first attempt to validate previous findings with a new data set of similar size, but it only includes ASD participants. While it would be preferable to have a validation set that includes measurements from ASD and TD cohorts, as compared to only from an ASD cohort, these data do not currently exist aside from the one which was used for training here. Future studies should collect additional data and evaluate both ASD and TD populations to confirm separation of these two groups. Finally, the slight improvements in classification obtained by the FDA-sub classifier in comparison with the other methods tested herein should be reaffirmed after evaluating these classifiers on additional TD data.
3. Analyzing younger participants. The training and validation set comprise cohorts of 3-10 years and 2-17 years, respectively.
However, a young cohort comprised mainly of participants younger than about 3 years would provide more compelling evidence toward using classifiers such as these to aid in the diagnosis of ASD.
Successfully addressing these limitations would help to solidify the ability of multivariate statistical tools based on FOCM/TS measurements to accurately separate ASD and TD participants and ultimately allow these classifiers to be translated into the clinic.