Multivariate statistical data analysis of cell-free protein synthesis toward monitoring and control

The optimization and control of cell free protein synthesis (CFPS) presents an ongoing challenge due to the complex synergies and nonlinearities that cannot be fully explained in first principle models. This article explores the use of multivariate statistical tools for analyzing data sets collected from the CFPS of Cereulide monoclonal antibodies. During the collection of these data sets, several of the process parameters were modified to investigate their effect on the end-point product (yield). Through the application of principal component analysis and partial least squares (PLS), important correlations in the process could be identified. For example, yield had a positive correlation with pH and NH 3 and a negative correlation with CO 2 and dissolved oxygen. It was also found that PLS was able to provide a long-term prediction of product yield. The presented work illustrates that multivariate statistical techniques provide important insights that can help support the operation and control of CFPS processes.


| INTRODUCTION
The integration of machine learning techniques into the operation and control of cell free protein synthesis (CFPS) systems 1 offers significant potential for improving productivity and the quality of materials manufactured using this relatively new processing technique. CFPS offers advantages over in-vivo protein production for applications that require more precise control of product physiochemical properties, such as bispecific antibodies, antibody drug conjugates, vaccines, and membrane proteins. 2 It also offers direct access and control of the synthesis environment, giving potential for the development of highly productive CFPS platforms, for the rapid and efficient production of recombinant proteins. 3 Guidelines proposed by pharmaceutical regulatory authorities, such as those published by the International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use (ICH), emphasize the importance of identification, control, design space, and process validation in the development and manufacture of pharmaceutical products. 4 In particular, Q8, Q9, Q10, and Q11 from the ICH highlight the importance of process modeling as a tool to implement quality by design.
Models used to describe CFPS production processes and in particular its application to quality by design can only be useful in practice if sufficient process knowledge is available to explain the effect of critical process parameters on critical quality attributes. In this respect, mechanistic models offer great value in determining causality to support the optimization of CFPS processes 5,6 ; however, the development of such models requires significant time and resources which typically make them impractical. 7 In addition, CFPS relies on a complex network of interacting reactions, reactants, and enzymatic catalysts, which are not yet fully understood. Although there are continuous efforts to improve CFPS mechanistic models such as less data intensive models based on flux balance analysis, 8 models developed from first principles are still scarcely being used to optimize the processes 9 and hence empirical modeling techniques have been investigated as a possible alternative. Unfortunately, challenges associated with the multidimensionality of the data being analyzed, as well as variations introduced by disturbance factors such as experimental error and noise can lead to problems for many empirical modeling techniques. However, multivariate data analysis techniques have been applied to many other processes where studies have demonstrated them to be capable of overcoming these challenges, allowing models to be developed that can be used to support process optimization. 10 One approach to modeling multidimensional data is to use artificial neural networks (ANNs). Multiple publications on bioprocess optimization, prediction, delivery and prognoses have shown the capabilities of ANNs, particularly their ability to predict behavior in CFPS processes. 11 The main advantage offered with ANNs is that they are better at capturing nonlinear relationships when compared with traditional statistical modeling techniques, such as partial least squares (PLS). However, they require greater computational resources and have a complex structure requiring model developers to explicitly identify possible causal relationships between process variables. In addition, model developers need to go through an empirical process of performing sensitivity analyses on parameters such as learning rates, momentum terms, and model structure. 12 As a result, ANNs models tend to be opaque and difficult to thoroughly validate and verify.
An alternative approach to modeling multidimensional data is to use multivariate statistical models, such as those commonly referred to as multivariate statistical data analysis (MSDA). MSDA techniques have been applied successfully to pharmaceutical production processes for optimization, monitoring, online control and detection of sensor faults in seed, batch, and fed-batch cultivations. 10,[13][14][15][16][17][18][19][20][21] For example, the capabilities of MSDA in a cell-culture process for small scale (2 L) and large scale (2000 L) batches were published in Kirdar et al. 22 This work aimed to evaluate whether MSDA was able to characterize the process through the analysis of process parameters such as CO 2 , O 2 , glucose, pH, lactate, ammonium ions, purity, viable cell density, viability, and osmolality. The proposed methodology included analyzing various control charts to identify fault conditions, with their results demonstrating that MSDA could be used as a tool for extracting knowledge from such processes.
There has been very little work published that has focused on the analysis of cell free expression systems using MSDA techniques. In Reference 23, the authors assessed the differences in the metabolite profiles of four lysates by analyzing the data collected from the process using principal component analysis (PCA), with the aim being to standardize lysate activity and to design an improved cell free expression system. The lack of published research in this area indicates a clear knowledge gap regarding the potential benefits of using MSDA tools to optimize and control such processes.
The aim of the present work was, therefore, to apply MSDA techniques to experimental data collected from CFPS processes to determine which process parameters, including pH, temperature, and O 2 have the most significant effect on end-point qualities, such as yield and aggregation. In addition, the work also aimed to identify the suitability of using MSDA techniques to provide long-term predictions of end-point quality metrics during operation of the process. The specific case study for the analyses presented in this article is the scalable cell-free synthesis of monoclonal antibodies (mAb), using the cell-free lysate system developed by Sutro Biopharma. 24,25 Additionally, the present work aims to provide a generic methodology that allows rapid process characterization and optimization to the manufacturing of other therapeutic proteins and biopolymers by developing data-based relationships between process parameters and process outputs such as yield, which are likely to be highly dependent on the cell extract type, target protein, construct design, and so forth. 26 Section 2 describes the methodology that was followed when applying the MSDA techniques, with Section 3 providing the methodology for operating the laboratory equipment and collecting data. Section 4 presents the results and provides a discussion and finally, Five sets of experimental data were analyzed in this work and a summary of these date are provided in Table 1. Four of these data sets were collected from experiments completed at UCL, D1-D4, with the fifth, D5, collected from a laboratory operated by Sutro Biopharma, Inc. All experiments were undertaken by the same person, Experimenter A, with the exception of D4 which was completed by Experimenter B. Each data set consisted of the mean process measurements collected every minute from observations from each well on a 24-well plate. The data sets were averaged to hourly data to make them robust to instrumental noise and to enable simpler analysis. In some wells, the reaction failed and for these wells, the data were discarded. The observations column provides the number of wells that were used in each data set after failures were discarded.
The values for each of the control parameters: temperature, run length and pH, for the five data sets (D1-D5) are shown in Table 1. Set-points for the temperature and pH controllers were held constant for D1, D2, and D5. However, after analyzing the data from these experiments and observing findings in relevant literature, 27 -pH was seen to be a key factor impacting the reaction yield and to analyze its effect further the set-point was varied through experiments D3 by introducing low-frequency pseudo random binary signals (PRBS). PRBS are often used in system identification to excite the process at different times and frequencies of the input variables trajectories such that it produces small changes in the output. Finally, the setpoints for pH during D4 were specified so that the effect that a predefined pH trajectory had on the reaction could be compared to the case were pH was maintained at a fixed level in reactions operating over a longer duration (12 h).

| Data preprocessing
Data were arranged into matrices of process measurements, X, and matrices of end-point qualities, Y. The matrices were then normalized prior to model identification by subtracting the mean and dividing by the standard deviation of each variable. For data analysis and predictions that considered temporal information, the matrix X, which contained three-dimensional information (variables of size J, time intervals of size K, and observations or repeats of size I) was unfolded into a two-dimensional array using the technique referred to as multiway unfolding 29 as shown in Figure 1(A). This technique has been used in several previous studies for monitoring batch processes. 30 In this article, the letter "M" will be used as a prefix of the model to indicate multiway models (e.g., MPLS will refer to multiway PLS).
In some cases, the total reaction length, and therefore the number of observations recorded during the run, or batch, varied, and as a result, the unfolded data matrix X was incomplete because of the difference in vector sizes. Uneven vector lengths creates difficulties when using multiway unfolding as it requires all vectors to be of equal length. To address this issue three approaches were used: Approach 1 compiled the unfolded X matrix using only those measurements that were recorded up to the shortest reaction length. Approach 2 compiled the unfolded X matrix using only the measurements that were recorded in the last few hours of each reaction, which all the observations had in common and Approach 3 compiled the unfolded X matrix using missing data (MD) techniques. Specifically, the technique of Projection to the Modal Plane 31 was used to estimate the progress of each run had the reaction been allowed to continue. Figure 1(B) shows the three approaches used to address uneven vectors.

| Methodology of the exploratory data analysis
The initial data analysis was provided using the relatively standard techniques of PCA and PLS and their multiway counterparts (MPCA and MPLS). Further details of these techniques can be found in Appendix B.
PCA was initially applied to the five data sets to help in the identification of critical process parameters that could potentially be used in the future to control the cell-free synthesis reaction. In addition, PCA was used to identify similarities and differences between the the matrix X included the initial and final measurements from the process variables and the end-point, quality variable(s). PCA models were then identified for this matrix using the NIPALS algorithm. 29 PLS and MPLS were also used for exploratory data analysis to observe the effect that each process variable had on the quality variable(s). The data used to construct the X matrix for PLS and MPLS were the process variable measurements through each batch. For the PLS model, the quality variable(s) was used to construct the Y matrix.
The PLS and MPLS models were identified using the SIMPLS algorithm 32 and the number of latent variables used in the models was chosen as being that which minimized the root mean square error of cross validation when using 10-fold cross validation. • Ordinary least squares (OLS) and multiway OLS (MOLS) regression.

| Methodology of the prediction assessment
• PLS and MPLS regressions.

| Methodology of monitoring using MSDA control charts
The most frequently used control charts for monitoring purposes are based on two statistics: the Hotelling's statistic, T 2 , which provides a measure of the deviation of an observation from the region covered by the identification data set and the squared prediction error (SPE) of x, SPE x , which is the error between the data vector of an observation, x, and its reconstructed value obtained using the MSDA model. The confidence limits for the charts were calculated using the bootstrapresampling technique used in Duran-Villalobos et al, 16 which infers confidence intervals from an empirical distribution function. Raw materials for the CFPS reaction were kindly provided by Sutro Biopharma, Inc. and this included a cell-free extract (XtractCF), a reaction mix containing amino acids, nucleic acids, salts and energy source (2x Supermix), plasmids for the heavy chain and light chain of a mAb and T7 RNA polymerase (prepared in-house and in the form of an Escherichia coli lysate). None of these components are commercially available. All materials were stored frozen at −80 C and sufficient quantities for each experiment thawed at room temperature immediately prior to use.

| METHODOLOGY OF CFPS
The laboratory procedures that were followed for the CFPS reaction were as follows 24,25 :    ters whose error bars cross the axis indicate that either the parameters were not significant or were unreliable for yield prediction. Additionally, the PLS regression coefficients for monomer % and aggregate % estimation are shown in Figure 4. In these experiments (D1, D2, and D5), the PI controller set-points for pH and temperature were kept constant using different experimental configurations; hence, "sp" was used instead of "end" following the process variable names.
A summary of the main correlation identified from the PCA and PLS models are shown in Table 2.  and/or pH rather than with yield; or that either the PCA or the PLS model was not able to capture the correlations of these process variables. Since CO 2 and NH 3 were used to control pH as the reaction proceeded, it is logical that there was a relationship between these three factors. The correlations shown in this article must be interpreted with caution as the impact of pH and temperature is expected to vary to some degree from one target product to another and extract type to another. Simplistically, the ideal temperature and pH will be that which allows the optimal conformation of the protein or proteins which catalyze the rate limiting reactions. Additionally, the impact of pH on aggregation will likely vary depending on the product isoelectric point (pI). 35  mRNA accumulates. 36 Furthermore, there is evidence that the optimum conditions for transcription and translation are different. 37 Added to this, exhaustion of resources, in terms of macromolecular building blocks will differ from one product or DNA construct to another, while with a change in reaction mix composition and/or extract the rate at which inhibitors accumulate will also differ, and thus so may the impact of reaction length. In this context, the use of the MSDA tools proposed in this work provides a generic methodology to help identify optimal conditions.

| Prediction assessment
The accuracy with which the various MSDA modeling techniques described in Section 2.4 were able to predict the end-point quality variables from the CFPS process was compared using the symmetric mean absolute percentage error 38 between the estimated and the measured end-point quality using leave-one-out predictions. Table 3 shows the results of the comparison of the prediction errors from different modeling techniques, with the lowest errors highlighted by intensity in green and the highest errors highlighted by intensity in red for each identification data set.
The results for the two-dimensional data in Table 3, which contained the least complex data sets, show that the performance of the OLS models for the prediction of a single variable was better than any other compared technique followed closely by that of the PLS models. On the other hand, the results for three-dimensional data, where the measurements collected throughout the reaction process are considered, show that MOLS models had the largest prediction errors. The reason for this is that the additional measurements included in the X matrix are highly correlated and this has an adverse impact on the accuracy of OLS models. 39 With respect to the computational speed of the studied modeling techniques, the difference in computing time was negligible relative to the long reaction times encountered in CFPS so the use of any of these techniques for predictive control should be feasible. Table 4 shows the results of the comparison of the prediction errors using the approaches to address uneven vectors mentioned in Section 2.2, with the lowest errors highlighted in green and the highest errors highlighted in red for each identification data set. The best results in Table 4 were obtained using the first approach, which consisted on using measurements that were recorded up to the shortest reaction length, whereas the worst results were obtained using the third approach which consisted on using MD algorithms.
Models that use the length of reaction as an input, such as those shown in Tables 3 and 4, can be used for optimization but they are not useful for online prediction of end-point parameters (yield). In contrast, the process measurements used to identify the models from Table 5 did not need to consider the time of reaction as an input to the model; hence, the model can be used for online monitoring and control. Table 5 shows a comparison of the prediction errors when MPLS models were used to provide long-term predictions of yield during the reaction, with the lowest errors highlighted by intensity in green and the highest errors highlighted by intensity in red for each identification data set. In each case, the errors that are reported are the errors when the end-point quality is predicted after the reaction had proceeded for a length of time, for example, after 1h, 2 h, and so forth. As discussed in Section 2.2, this introduces problems for some of the predictions later in the runs, as the lengths of the vectors for each run will vary depending on the length of time of each reaction run. Two techniques were applied to address this. The first technique identified multiple models as the reaction progressed. During the first hour of reaction, the model was identified using only measurements collected during the first hour of the reaction; during the second hour, measurements collected during the first and second hour of reaction were used to identify the model and so on. Data collected from the runs with shorter reaction lengths was then discarded as their reaction lengths were exceeded. Hence, this approach utilized a different model to predict end-point yield for each hour of reaction.
The second approach calculated a single model with all the available measurements from all the observations in the data set and used MD techniques to estimate the "missing" future values of measurements for the shorter reactions and at each new "online" observation, as described in Section 2.2 (these are labeled MD in Table 5).
The lowest prediction errors in Table 5 were observed later in the reaction, rather than near the start. This is to be expected as later in the run, more information is available regarding how the reaction has Another important observation from Table 5 is that predictions obtained using only data from the first hour of the reaction provided useful predictions of final yield using the multiple model approach. In contrast, predictions using MD algorithms seemed to be accurate only after about 3 h of reaction. However, the advantage of the MD procedure is that only one model was required.

| Postbatch analysis and monitoring using PLS charts
This section provides an example of CFPS batch analysis and online monitoring with MPLS statistical charts, with the intention being to  for the faulty and the validation observations. In both charts, the "faulty" batches exceeded the 95% confidence limit, whereas the validation observations stayed within the 95% confidence interval limit, suggesting that the technique could be used to characterize different batches. These charts also show that the confidence limit in T 2 is lower relative to the confidence limit for SPE. This difference was also observed in another study, 41 where the author suggested that limits on SPE may reduce Type I and Type II errors compared with limits on T 2 . Figure 6 shows that B1 and B2 are outside the 95% confidence limit. After further analysis of the individual contributions of each variable to the SPE in B1 and B2, it was observed that faulty observations had larger values corresponding to measurements of pH and Temperature along the entire reaction. These large individual SPE values corresponded to measurements of unusually high temperature and lower than average pH compared to those in the identification data set. B1 and B2 were obtained from D3; hence, it is likely that the PRBS in the DOE caused these two observations to have a low pH during critical times of the reaction.
A useful approach for online monitoring of product quality is the use of cumulative contributions of process variables over time to SPE and T 2 . These charts allow engineers to set operating windows for process measurement deviations, which if violated would suggest an adverse effect on CFPS performance. Figure 7 shows the cumulative SPE of faulty and validation observations through the batch, normalized to the magnitude of the confidence limits at each hour.
From the results in the chart, the SPE from B1 and B2 were consistently outside the confidence limit. On the other hand, the SPE of the validation observations stayed within the confidence limits, except the first hour of G3. A way to reduce false positives is to use more relaxed confidence limits (e.g., 99%), particularly in the early stages of the reaction. It is also important to consider that the observations used in this example are from experimental data in a research environment, and that the variability encountered among the observations is likely to be larger than that encountered in observations from standardized production.  experimental results showed that when temperature and pH were held constant throughout the reaction using a feedback controller, pH and NH 3 had a positive correlation with yield and CO 2 and DO had a negative correlation. This was not necessarily the case for experiments where the controller set-points varied through the reaction, where the effect of process variables on yield was shown to vary as the reaction proceeded. In addition, the length of reaction was found to have a significant, positive impact on yield; although this impact and the impact of most of the other process variables were found to only be relevant in the first 10 h of reaction. Temperature, in the range of values used in the experiments, did not have a statistically significant impact on yield. However, temperature was found to have a strong relationship with monomer % and aggregates % estimation.
Multivariate regression models, and in particular PLS, were shown to be able to provide accurate prediction of end-point product quality during the CFPS process. This offers future potential for the development of feedback predictive control systems able to regulate endpoint quality metrics by manipulating process variables, such as temperature and pH and is the subject of ongoing work.
Overall, the results presented in this article demonstrate that MSDA techniques offer a useful tool for extracting information from experimental data sets in CFPS and can be used for process characterization and identifying optimal experimental conditions. Moreover, it has been shown that these techniques can be effective in predicting and monitoring end-point quality parameters from measurements of process variables, even within the first hour of the synthesis reaction, which could be useful to improve large-scale manufacturing.
The applications for MSDA tools in CFPS range from process characterization to online automated multivariable control, regardless of the mode of operation (fed-batch or continuous). However, to effectively apply this technology, the implementation should consider which data from different unit operations are collected and analyzed to provide a robust support tool. In addition, if modeling techniques are to be successfully applied, automation and standardization are necessary since it is suspected that the largest sources of variation found in this study and others 42,43 were caused by the change of the operators, equipment and operating environment.