Peptide Retention in Hydrophilic Strong Anion Exchange Chromatography Is Driven by Charged and Aromatic Residues Sven H. Giese, † Yasushi Ishihama, ‡ and Juri Rappsilber * , † , ‡ , § † Bioanalytics, Institute of Biotechnology, Technische Universita t Berlin, 13355 Berlin, Germany ‡ Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, Japan § Wellcome Centre for Cell Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom * S Supporting Information ABSTRACT: Hydrophilic strong anion exchange chromatog- raphy (hSAX) is becom ing a popular method for the prefractionation of proteomic samples. However, the use and further development of this approach is a ff ected by the limited understanding of its retention mechanism and the absence of elution time prediction. Using a set of 59 297 con fi dentially identi fi ed peptides, we performed an explorative analysis and built a predictive deep learning model. As expected, charged residues are the major contributors to the retention time through electrostatic interactions. Aspartic acid and glutamic acid have a strong retaining e ff ect and lysine and arginine have a strong repulsion e ff ect. In addition, we also fi nd the involvement of a romatic amino acids. This suggests a substantial contribution of cation − π interactions to the retention mechanism. The deep learning approach was validated using 5-fold cross-validation (CV) yielding a mean prediction accuracy of 70% during CV and 68% on a hold-out validation set. The results of this study emphasize that not only electrostatic interactions but rather diverse types of interactions must be integrated to build a reliable hSAX retention time predictor. M ass spectrometry (MS)-based proteomics is the driving technology for the characterization and quanti fi cation of complex protein samples. 1 − 3 With the current advancements in instrumentation and software solutions, the number of peptides and proteins that can be identi fi ed in a minimal amount of time have increased dramatically. 4 However, deep proteome cover- age of higher eukaryotes, mammalian cell lines, or tissue is currently only feasible with extensive fractionation. 5 , 6 The wide dynamic range of all the expressed proteins in a cell remains a major challenge, leaving the least abundant proteins (and peptides) undiscovered. In these cases, online (1D) reverse phase liquid chromatography (RP-LC) does not yield the necessary separation of the proteome. Instead, prefractionation is commonly applied to further reduce the complexity. Ideally, the combined separation met hods are as orthogonal as possib le 5 , 7 , 8 to ensur e the separ ation of s imilar anal ytes. Interestingly, high-pH RP is often used as prior fractionation method even though it is not truly orthogonal to standard RP (low pH). Importantly, there is no universal best prefractiona- tion method. Rather, the optimal separation method needs to be chosen based on the analytes. 9 , 10 While fract ionation meth ods o ff er great possibili ties to reduce the sample complexity, they usually require larger sample amounts and preparation time. Usually, most fractions are injected separately without pooling. Therefore, the peptide identi fi cation is fraction aware. This extra piece of information can be incorporated into the database search. 11 − 13 To fully utilize this information, a computational model needs to be developed that can con fi dently predict the retention time of a peptide based on its amino acid sequence. The proteomics community has successfully developed accurate models for the prediction of the retention time in low pH RP-LC, which typically is coupled directly to a mass spectrometer and therefore widely applied in proteomics. 14 , 15 Retention times have also been predicted for other chromatographic methods including high-pH RP-LC, 16 , 17 hydrophilic interaction liquid chromatography (HILIC), 18 and strong cation exchange chromatography (SCX). 19 Various algorit hms have been applied for the d escribed prediction task: simple line ar regression models, 20 nonlinear models, 21 support vector regression models, 11 , 16 arti fi cial neural networks, 22 or a physical model describing the chromatograph ic process. 23 For a comprehensive review, the reader is referred to Tarasova et al. 14 and Moruz and Ka ll. 15 For standard shotgun proteomics, hydrophilic strong anion exchange chromatography (hSAX) is largely orthogonal to RP- LC. 5 Currently, there is no model to predict the retention time for hS AX. Moreov er, the seq uence sp eci fi c feat ures that Received: December 11, 2017 Accepted: March 12, 2018 Published: March 12, 2018 Article pubs.acs.org/ac Cite This: Anal. Chem. 2018, 90, 4635 − 4640 © 2018 American Chemical Society 4635 DOI: 10.1021/acs.analchem.7b05157 Anal. Chem. 2018, 90, 4635 − 4640 This is an open access article published under a Creative Commons Attribution (CC-BY) License , which permits unrestricted use, distribution and reproduction in any medium, provided the author and source are cited. Downloaded via TU BERLIN on April 22, 2020 at 22:37:49 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles. in fl uence the retention behavior of peptides during hSAX are still unknown. A common approach is to incorporate (limited) sequence information into the prediction model by creating position speci fi c retention coe ffi cients 18 or neighboring amino acid e ff ects. 24 It would be desirable to (1) better understand the mechanisms governing the retention behavior of peptides during hSAX and (2) build a predictive machine learning model that con fi dently predicts the retention time of a peptide based on its sequence information. In this study, we analyzed the chromatographic behavior of 59 297 peptides based on 29 hSAX fractions. We aim to contribute new insights into the interaction of peptides during hSAX and quantify how sequence features a ff ect the retention behavior. To accomplish this, a machine learning work fl ow is applied and validated using 5-f old cross-validation. We developed a neural network model that predicts the retention time for peptides from an hSAX fractionation. The predictive model and the preprocessing are available in the Python package DePART ( https://github.com/Rappsilber-Laboratory/ DePART ). ■ METHODS Experimental Details. The experimental data taken for this study were published by Ritorto et al. 5 In brief, the authors performed hydrophilic strong anion e xchange (hSAX) chromatography on macrophage cells from Mus musculus to test the peptide separation capabilities of hSAX followed by mass spectrometry. The tryptic digest of the cell lysate was analyzed with a LTQ Orbitrap Velos Pro (Thermo Fisher Scienti fi c, West Palm Beach, FL). The fractionation was performed using an Ion Pac AS24 25 column (2 × 250 mm, 2000 Å pore size, Thermo Fisher Scienti fi c, Part No.: 064153) with a 35 min gradient (0 to 1 M NaCl; solvent A, 20 mM Tris- HCl at pH 8.0; solvent B, 20 mM Tris-HCl at pH 8.0, 1 M NaCl). The functional group of the AS24 is an alkanol quaternary ammonium ion on a solid support that aims at minimal hydrophobicity. Details of the sample preparation protocols can be found in the original manuscript. 5 Data Processing. For our study, Ritorto et al. made the results of their previous experiments available as MaxQuant result fi les. We postprocessed the MaxQuant evidence fi le. In total, 466 495 peptides were identi fi ed in 34 fractions. We applied stringent fi ltering to avoid ambiguity in the training data. This initial set of peptides was reduced by removing contaminants, d ecoys, “ onl y by site ” identi fi cations, and modi fi ed peptides (other than carbamidomethylated cysteine). In addition, for peptides identi fi ed in two adjacent fractions, the identi fi cation with the lowest intensity was removed from the data set. Peptides identi fi ed in more than two fractions or in fractions that were not adjacent were also removed from the data. Finally, fractions with less than 300 unique peptide identi fi cations were removed leaving 59 297 unique peptide sequences distributed over 29 fractions for the data analysis. As an independent data set, we used PXD006188, 26 which was analyzed using MaxQuant 27 (v. 1.6.1 .0) and fi ltered as described above, resulting in 93 372 peptides being identi fi ed in 32 fractions. All processing was performed using Python 3.5 using the packages numpy, scipy, matplotlib, scikit-learn, pandas, and seaborn. Machine Learning. For the computational modeling of the retention time we followed two separate strategies, a regression and a classi fi cation approach. In the regression case, a simple linear model (LM) with a length correction parameter (LCP) was used. The Python p ackage pyteomics 20 with LCP optimization was used for the LM implementation. In the classi fi cation case, a logistic regression (LR) and a feedforward neural network (FNN) were used. In both cases, we evaluated (and trained) the model using the accuracy metric, de fi ned as the proportion of correctly predicted fractions from all predictions. With the LM, such a metric is ill-de fi ned since no discrete fraction is predicted. Therefore, we de fi ned a forced accuracy metric by fi rst rounding the predictions to the nearest integer and then computing the accuracy. The FNN was implemented using Keras 28 with the Theano 29 backend. The network architecture consisted of four fully connected layers with 50, 40, 35, and 29 neurons. As fi nal activation, the softmax function was used ( Table S4 ). One s t r e n g t ho ft h es i m p l ea d d i t i v em o d e li st h ei n t u i t i v e interpretation of the learned coe ffi cients: a peptide ’ s elution time increases (or decreases) by a certain factor based on the amino acid count. For neural networks, with nonlinear activation functions, the interpretation is not as straightforward. Therefore, we added peptide features (e.g., pI or aromaticity) based on the literature 11 , 30 and our initial exploratory data analysis to increase the predictive power in the classi fi cation task. The complete de fi nition of features is available in Table S2 . The evaluation of the prediction performance was based on a 5-fold cross-validation (CV) strategy (including 75% of the data, 44 471 peptides). In addition, a hold-out validation set was used for the fi nal model assessment (25% of the data, Figure 1. E ff ect of the charged residues on peptide retention in hSAX. (a) Mean residue count per peptide for D/E (red) and K/R (blue) over fraction. Error bars denote the standard deviation. Peptide count per fraction is shown in orange (total 59 297 unique peptides). (b) E ff ect of D/E count (range 0 − 5) on peptide retention. (c) Extracted chromatogram of peptides with three and four D/E (red). Subpopulations were de fi ned according to the number of K/R residues (one to three, blue tones for peptides with three D/E residues and green tones for peptides with four D/E residues). Crosses mark the mode of the respective distributions. Analytical Chemistry Article DOI: 10.1021/acs.analchem.7b05157 Anal. Chem. 2018, 90, 4635 − 4640 4636 14 825 peptides). In the CV setup, the training splits had 35 578 observations, and the validation splits had 8894 observations. We describe the machine learning work fl ow in more detail in the Supporting Inform ation , including a performance comparison with other classi fi ers. ■ RESULTS In the following section, we present our results and propose a model for the driving interactions in hydrophilic strong anion exchange chromatography (hSAX) for peptides. The result section is divided into four parts: (1) A general overview is given of the data and how the retention time during prefractionation is in fl uenced by charged amino acids. (2) The in fl uence of the charged amino acids is compared. (3) The in fl uence of usually noncharged amino acids is compared, and fi nally, (4) a machine learning model is built to model peptide retention during hSAX. Peptide Retention in hSAX Is Driven by the Charged Amino Acids. We fi rst investigated the in fl uence of acidic (E, D) and basic (K, R) amino acids on the retention behavior of peptides in an hSAX fractionation experiment. Note that histidine residues will be uncharged under the pH conditions used during fractionation. We used elution data of 59 297 tryptic peptides from murine macrophage cells separated into 29 fractions. Positively charged peptides elute early (fractions 1 and 2) and are separated from uncharged peptides (fractions 4 and 5) which in turn are separated from negatively charged peptides (fractions 7 − 29), where charge was calculated from the residues E, D, K, and R ( Figure 1 a). While the mean count of D or E (D/E) residues in a peptide increases with the fraction number, the mean count of K/R residues stays constant ( Figure 1 a). In agreement with this, missed cleavages are not enriched in any of the fractions ( Figure S1 ). The average retention behavior of tryptic peptides appears to be mainly in fl uenced by the occurrences of D/E residues in the peptide sequence. These observations are also supported numerically by their Pearson correlation coe ffi cients (PCC) of the summed residue charge per peptide and the observed fraction number: for D/E residues, the PCC is − 0.75; for K/R, − 0.03; and for D/E/K/R residues, the PCC is − 0.83. The peptide length on the other hand has a much smaller overall in fl uence across all fractions (PCC 0.33). Peptides with 0, 1, 2, 3, 4, and 5 D/E residues correspond on average to the fractions 3, 6, 10, 14, 18, and 20, respectively ( Table S1 ), thus, leading to a mean increase per D/E residue of three fractions in retention time. Even though the mean increase of fraction numbers highly correlates with the number of acidic residues, so does the D/E peak width ( Figure 1 b). In addition, the higher the number of D/E residues in the peptide , the more complex the distributions appear. Peptides with two D/E residues distribute on two peak fractions, while peptides with four D/E residues distribute on four to six peak fractions. Therefore, we investigated the in fl uence of basic residues on the retention time. Positively charged residues, lysine and argi nine , shoul d weake n pept ide r etent ion du ring hS AX. Indeed, K and R residues explain the multiple peak fractions of peptides with one D/E ( Figure 1 c). With an increasing number of K/R residues, the retaining e ff ect of D/E diminishes, and thus peptides elute earlier. Since the e ff ect is quite strong, in terms of retention shift by a single K/R residue, there is most likely a repulsion mechanism involved. Interestingly, the elution strength of K/R residues seems slightly stronger than the retaining e ff ect of D/E residues: The mean fraction value of peptides with four D/E residues and two K/R r esidues (summed residue charges equal to 2) is 16.5, while for peptides with three D/E residues and one K/R (summed residue charge also equal to 2), the mean fraction is 18.1. However, this additional information on the K/R distribution does not fully explain the observed substructures; there are clearly peak tails visible, especially on the right side of the distributions (e.g., D/ E, 4; K/R, 3 in Figure 1 c). Lysine Exhibits Stronger Electrostatic Repulsion than Arginine. We next evaluated if R and K di ff ered in their e ff ect on peptide retention ( Figure 2 a). Peptides with four D/E residues were found in the factions 22, 17, and 11 (median fraction values) if they had one, two, or three arginines while they were found in the fractions 21, 15, and 10 if they had one, two, or three lysines. This means that lysines are more strongly repelled than arginines in hSAX (on average, 1.3 fractions). Statistical analysis using a Mann − Whitney − U (MWU) test supports this observation. However, since the observed e ff ect size is rather small, the statistical signi fi cance should be interpreted with caution ( Figure S2a ). Similarly, we i nvestigated possible di ff ere nces between aspartate and glutamate, peptides with either two D or two E residues and either one, two, or three lysines ( Figure 2 b shows data for up to two lysines). For this subset, the rounded median fraction number for peptides with two D or two E residues is 12, 11, and 5 and 12, 11, and 5, respectively. This leads to an average increase of 0.33 per fraction if there is an aspartate instead of a glutamate in the peptide sequence. For the negatively charged amino acids, we also conducted an MWU- test: although the observable e ff ect was even smaller, the test still resulted in a signi fi cant di ff erence between the retention behavior of D and E ( Figure S2b ). Aromatic Amino Acids Play a Key Role in Peptide Retention during hSAX. As expected, peptide retention during hSAX is dominated by charged residues. However, peptides with one set of charged residues elute over many fractions. Therefore, charged amino acids do not su ffi ce to explain peptide retention alone. As a fi rst step to search for additional contributions, a subset of peptides was selected (two D/E residues, one R/K residue). Then, the e ff ect size of an amino acid on the retention time was estimated using the slope from a linear regression model. The response variable was set to the mean composition contribution Figure 2. Detailed comparison of relative contributions of positively (K/R) and negatively (D/E) charged residues on peptide retention in hSAX. (a) E ff ect size of K/R residues. Peptides with four D/E residues were divided according to their K and R count (K, green tones; R, blue tones). (b) E ff ect size of E/D residues. Peptides with either two E or two D residues are shown, split according to their number of K residues (1 or 2). Analytical Chemistry Article DOI: 10.1021/acs.analchem.7b05157 Anal. Chem. 2018, 90, 4635 − 4640 4637 of an amino acid, while the explanatory variable was set to the fraction number. On the basis of the regression slope and the derived p-value (under the null hypothesis that the slope is equal to zero), the remaining amino acids can be divided into three categories: (1) retaining if the slope is positive and the p-value is smaller than 0.05, (2) eluting if the slope is negative and the p-value is smaller than 0.05, and (3) no (signi fi cant) e ff ect if the p-value is larger than 0.05. Accordingly, the (aromatic) amino acids F, Y, and W show the strongest retaining e ff ect based on the regression slope ( Figure 3 , Figure S4 ). Interestingly, peptides with 0 aromatic residues are found in a sharp symmetrical distribution. With increasing aromatic amino acids in the peptide sequence, the distributions shift to later retention, become broader, and develop a right tail ( Figure S6 ). In contrast, the amino acid contributions of A, P, and S and Q, T, and V show an eluting e ff ect. For these amino acids, the subpopulation peaks look very sharp, even with increasing residues of the same group. The remaining amino acids C, I, N, G, L, V, H, and M do not show a clear trend and thus could be classi fi ed neither as eluting nor as retaining. Subtracting the weighted counts of the aromatic residues (0.8W + 0.6Y + 0.3F) to the residue charge increases the initial PCC from − 0.83 to − 0.86. Adding the weighted counts of the residues A, P, Q, S, T, and V (factor 0.1) further increases the retention PCC to − 0.88. A Neural Network Achieves the Highest Prediction Accuracy. As the fi nal step in our analysis, we built a machine learning model to predict the retention time of a peptide based on its sequence features. After initial hyperparameter optimization for a set of classi fi ers and regressors ( Supporting Information S3), we chose a linear regression model (LM), a logistic regression model (LR), and a feedforward neural network (FNN) for further analysis. The coe ffi cients of the LM are shown in Figure 4 a. As expected, the sign and magnitude of the coe ffi cients largely match our manual analysis: First, the basic residues have a strong eluting e ff ect on the retention time (large negative coe ffi cient). Second, the acidic residues and the aromatic residues have a strong retaining e ff ect on the retention time (large positive coe ffi cient). In addition, the nuances regarding the e ff ect size of the basic residues also fi t our previous description that R is marginally stronger repelled than K. This is most likely due to the lower basicity of K. Similar to the coe ffi cient representation from LM, FNNs can be used to estimate approximately the in fl uence of the input features by analyzing the input weights of the fi rst layer. Since we also used position speci fi c features in the machine learning work fl ow, the average of the input weights can be used to roughly measure these position dependent contributions to the retention in hSAX. Most importantly, it appears that the in fl uence of D/E residues decreases with distance from the termini ( Figure S7 ). Further, S/T/V/A/P/Q residues roughly follow a similar trend. In contrast, W/Y/F/H do not show decreasing weights for internal residues the in fl uence is rather stable across the positions. For the remaining amino acids (I/G/L/C/M/N), the weights are noisy and do not follow a clear pattern. This observation fi ts the estimation of their in fl uence from the regression model. Therefore, the in fl uence of these amino acids cannot be clearly de fi ned. Figure 3. The e ff ect of neutral amino acids on peptide retention in hSAX. Amino acids were grouped according to their in fl uence on peptide retention in hSAX by linear regression ( Supporting Information ). (a) Elution behavior of peptides with di ff erent numbers of F/Y/W and two D/E, one K/R residues. (b − e) Elution behavior of peptides with di ff erent numbers of the indicated amino acids (b, P/A/S; c, Q/T/V; d, I/G/L; e, C/M/ N/H) and two D/E, one K/R, zero F/Y/W. Crosses mark the mode of the subpopulations. Figure 4. Peptide retention time prediction for hSAX using machine learning. (a) Residue retention coe ffi cients from a linear model with length correction parameter. (b) Fraction of correct predictions (accuracy) of di ff erent prediction methods, estimated by 5-fold cross- validation based on 35 578 (train) and 8894 (test) peptides in each split. (c) Elution time prediction for the hold-out validation set, FNN classi fi er (left) and LM (right); ρ indicates the Pearson correlation. Linear Model (LM), Logistic Regression (LR), Feedforward Neural Network (FNN). Analytical Chemistry Article DOI: 10.1021/acs.analchem.7b05157 Anal. Chem. 2018, 90, 4635 − 4640 4638 A neural network was most successful in predicting the correct peptide fraction, as assessed by 5-fold cross-validation ( Figure 4 b). With an accuracy of 70 ± 0.81% (mean ± standard error of the mean), the classi fi cation algorithm outperformed the linear regression model (22 ± 0.13% accuracy) and the logistic regression model (48 ± 0.07% accuracy). With a lower prediction resolution, e.g., evaluating the accuracy in a window of ± 1 fraction (1-o ff -accuracy), 92 ± 0.19% were correctly classi fi ed. Although optimization aimed for accuracy, the best performing FNN classi fi er also achieves a higher correlation coe ffi cient on a hold-out validation set (never used for training) than the LM. The FNN achieves here a PCC of 0.94 where the LM achieves a PCC of 0.9 ( Figure 4 c). The accuracy on this validation set was comparable to the CV error with 68% accuracy and 92% one-o ff accuracy. As the accuracy metric already indicates, the LM performs much worse as seen in the marginal distributions ( Figure 4 c). The distribution of the predicted fractions does not appear similar to the observed fraction distribution. The FNN can better capture the nonlinear relationship and thus predicts the true fraction with a higher accuracy which is supported by the similarity of the marginal distributions of the predicted and true fractions of the peptides in the validation set. Finally, we wondered if the results obtained for data by Ritorto et al. would also be obtained with a di ff erent data set by independent investigators. We downloaded an hSAX data set from ProteomeXchange (PXD00618826) and repeated our analysis. For these data, the training set comprised 70 029 unique peptides and the validation set, 23 343 unique peptides. The accuracy during CV increased on the test data to 69 ± 0.21% and on the validation data to 72%. The one-o ff accuracy even increased to 96%, most likely due to higher number of training instances. ■ DISCUSSION Fractionation methods such as ion exchange chromatography (IEX) are popular tools for enrichment of certain analytes and separation of complex samples. To perfect the separation process, a basic understanding of the underlying principles must be developed. For the principles behind the retention time of peptides in hSAX chromatography, a linear model is a useful starting point. Our exploratory analysis as well as the modeling approach showed that electrostatic forces, as expected, are the most important interactions in hSAX. A previous study that compared several fractionation methods for phosphopeptides also reported a strong correlation of the acidic amino acids with the elution time of peptides. 9 The resolution based on simply counting the D/E/R/K residues is enough to roughly map the elution time of a peptide to ± 5 fractions (on average). This simple approach is supported by a good PCC ( − 0.83) of the summe d residu e charge and th e eluti on time. Howe ver, di ff erentiating the repelling (K/R) and retaining (D/E) e ff ect sizes should further improve the resolution. Additional improvements can be achieved by including the in fl uence of the aromatic amino acids (W, Y, F; PCC − 0.86). The retaining e ff ect of the aromatic amino acids could be explained through cation − π interactions: a well -known interaction from organic chemistry. Since aromatic amino acids have a delocalized π electron system, the fl at face of the aromatic ring has a partial negative charge which attracts cations and thus enables strong electrostatic interactions. 31 , 32 Cation − π interactions are also essential for many biological processes and protein folding, in which K/R residues can also function as cations and thus reinforce bonds within a protein structure. Possibly, cation − π interactions also happen within a single peptide and therefore lead to a competition between the stationary phase and the side chains of K/R. Multiple aromatic amino acids in a peptide sequence lead to nonlinearity in the retention behavior, i.e., multiple aromatic amino acids support the interactions with the stationary phase more than expected from adding individual contributions, possibly by forming sandwich complexes of two aromatic amino acids and a cation. For tryptic phosphopeptides, it has been shown that the peptide C-terminus is likely oriented toward the stationary phase 33 during the separation in anion exchange chromatog- raphy. Presumably, this also holds true for peptides in hSAX. However, comparing the neural network weights revealed that the in fl uence of, e.g., D or E residue is not per se decreasing from the N-terminus to the C-terminus as has been observed for the SCX model. 33 Thus, it is possible that the peptide orientation in hSAX is bidirectional or that D/E residues show a di ff erent elution behavior when near the termini. If the orientation of the peptide is indeed with the N-terminus toward the stationary phase, the decrease of the neural network weights is explainable with the limited accessibility of the acidic side chains when the residue is buried in the sequence. The same argumentation holds true for the orientation of the N-terminus toward the stationary phase. However, since we only analyzed tryptic peptides with basic side chains on the C-terminus, it seems unlikely that they would prefer this orientation. Another hypothesis is that the in fl uence of C-terminal D/E residues is not directly through the interaction of the residues with the column but through intrapeptide interactions. For example, acidic side chains of D/E and basic side chains of K/R could form salt bridges. Thus, the closer the D/E residues are to the C-terminus, the larger is the contribution or e ff ect in the determination of the retention time. The retention time prediction fi eld is fairly mature, and a selection of published tools achieved an R 2 ≥ 0.90, according to a recent literature review. 14 While most solutions achieve a very high correlation (and R 2 ), the true accuracy (de fi ned as true predictions/(true + false predictions)) is seldom evaluated. The models used to predict the fraction either do not provide an easily accessible probability or prefer to model the prediction task as a regression problem 19 allowing R 2 to be calculated. We modeled the prediction in a classi fi cation setup, using a feed- forward neural network (FNN). Here, accuracy is an appropriate evaluation metric. Accuracy is used to evaluate classi fi cation problems, and the algorithm was trained to optimize the accuracy and not R 2 . With t he current implementation, the FNN achieved an accuracy of 70 ± 0.81% during CV and 68% on the hold-out validation set. The accuracy is a stricter metric than the correlation coe ffi cient or R 2 ; the one-o ff accuracy increases on the CV data set to 92 ± 0.19% and on the hold-out validation data set to 92%. One additional advantage of the FNN is that each prediction is associated with a probability. This is a useful feature since it allows selection of more con fi dent predictions or incorporation of the uncertainty in postprocessing. ■ CONCLUSION We presented a fi rst descri ption of th e parameter s that in fl uence the retention of peptides during hSAX chromatog- raphy. As expected, the charged amino acids largely de fi ne the retention behavior of tryptic peptides. However, the aromatic Analytical Chemistry Article DOI: 10.1021/acs.analchem.7b05157 Anal. Chem. 2018, 90, 4635 − 4640 4639 amino acids also have a large impact on the retention behavior presumably through cation − π interactions, which makes the retention mechanism of hydrophilic anion exchange chroma- tograph y more chall enging to desc ribe. Never theless, the proposed neural network model achieves a high accuracy of 68% on the ho ld-out vali dation set pa ired with a hi gh correlation value of 0.94 which enables the usage of our model for statistical modeling of the con fi dence of peptide identi fi cations based on prefractionation. In the future, we want to further improve our model with more training data, support for post-translational modi fi cations, and incorporation into a robust scoring metric for peptide identi fi cation. ■ ASSOCIATED CONTENT * S Supporting Information The Supporting Information is available free of charge on the ACS Publications websi te at DOI: 10.1021/a cs.anal- chem.7b05157 . Missed cleavage data, statistical comparison of the e ff ect size of K/R and D/E residues, amino acid classi fi cation and details on the machine learning work fl ow ( PDF ) ■ AUTHOR INFORMATION Corresponding Author * E-mail: [email protected] . ORCID Sven H. Giese: 0000-0002-9886-2447 Yasushi Ishihama: 0000-0001-7714-203X Juri Rappsilber: 0000-0001-5999-1310 Notes The authors declare no competing fi nancial interest. ■ ACKNOWLEDGMENTS We thank Matthias Trost (Newcastle, United Kingdom) for providing MaxQuant result fi les and Michael Bohlke-Schneider for fruitful discussions. This work was supported by the Wellcome Trust through a Senior Research Fellowship to J.R. [103139], a JSPS Invitational Fellowship for Research in Japan No. L16568 to J.R. and Y.I., and JSPS Grants-in-Aid for Scienti fi c Research No. 17H05667 and 16K15107 to Y.I. The Wellcome Centre for Cell Biology is supported by core funding from the Wellcome Trust [203149]. ■ REFERENCES (1) Aebersold, R.; Mann, M. Nature 2003 , 422 , 198 − 207. (2) Ong, S.-E.; Mann, M. Nat. Chem. Biol. 2005 , 1 , 252 − 262. (3) Yates, J. R.; Ruse, C. I.; Nakorchevsky, A. Annu. Rev. Biomed. Eng. 2009 , 11 ,4 9 − 79. (4) Hebert, A. S.; Richards, A. L.; Bailey, D. J.; Ulbrich, A.; Coughlin, E. E.; Westphall, M. S.; Coon, J. J. Mol. Cell. Proteomics 2014 , 13 , 339 − 347. (5) Ritorto, M. S.; Cook, K.; Tyagi, K.; Pedrioli, P. G. A.; Trost, M. J. Proteome Res. 2013 , 12 , 2449 − 2457. (6) Manadas, B.; Mendes, V. M.; English, J.; Dunn, M. J. Expert Rev. Proteomics 2010 , 7 , 655 − 663. (7) Dowell, J. A.; Frost, D. C.; Zhang, J.; Li, L. Anal. Chem. 2008 , 80 , 6715 − 6723. (8) Yang, F.; Shen, Y.; Camp, D. G.; Smith, R. D. Expert Rev. Proteomics 2012 , 9 , 129 − 134. (9) Alpert, A. J.; Hudecz, O.; Mechtler, K. Anal. Chem. 2015 , 87 , 4704 − 4711. (10) Leitner, A.; Reischl, R.; Walzthoeni, T.; Herzog, F.; Bohn, S.; Fo rster, F.; Aebersold, R. Mol. Cell. Proteomics 2012 , 11 , M111.014126. (11) Moruz, L.; Tomazela, D.; Ka ll, L. J. Proteome Res. 2010 , 9 , 5209 − 5216. (12) Ka ll, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Nat. Methods 2007 , 4 , 923 − 925. (13) Klammer, A. A.; Yi, X.; MacCoss, M. J.; Noble, W. S. Anal. Chem. 2007 , 79 , 6111 − 6118. (14) Tarasova, I. A.; Masselon, C. D.; Gorshkov, A. V.; Gorshkov, M. V. Analyst 2016 , 141 , 4816 − 4832. (15) Moruz, L.; Ka ll, L. Mass Spectrom. Rev. 2017 , 36 , 615 − 623. (16) Pfeifer, N.; Leinenbach, A.; Huber, C. G.; Kohlbacher, O. J. Proteome Res. 2009 , 8 , 4109 − 4115. (17) Dwivedi, R. C.; Spicer, V.; Harder, M.; Antonovici, M.; Ens, W.; Standing, K. G.; Wilkins, J. A.; Krokhin, O. V. Anal. Chem. 2008 , 80 , 7036 − 7042. (18) Krokhin, O. V.; Ezzati, P.; Spicer, V. Anal. Chem. 2017 , 89 , 5526 − 5533. (19) Gussakovsky, D.; Neustaeter, H.; Spicer, V.; Krokhin, O. V. Anal. Chem. 2017 , 89 , 11795. (20) Goloborodko, A. A.; Levitsky, L. I.; Ivanov, M. V.; Gorshkov, M. V. J. Am. Soc. Mass Spectrom. 2013 , 24 , 301 − 304. (21) Krokhin, O. V. Anal. Chem. 2006 , 78 , 7785 − 7795. (22) Petritis, K.; Kangas, L. J.; Yan, B.; Monroe, M. E.; Strittmatter, E. F.; Qian, W.-J.; Adkins, J. N.; Moore, R. J.; Xu, Y.; Lipton, M. S.; et al. Anal. Chem. 2006 , 78 , 5026 − 5039. (23) Gorshkov, A. V.; Tarasova, I. A.; Evreinov, V. V.; Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A.; Gorshkov, M. V. Anal. Chem. 2006 , 78 , 7770 − 7777. (24) Moruz, L.; Staes, A.; Foster, J. M.; Hatzou, M.; Timmerman, E.; Martens, L.; Ka ll, L. Proteomics 2012 , 12 , 1151 − 1159. (25) Pohl, C.; Saini, C. J. Chromatogr. A 2008 , 1213 ,3 7 − 44. (26) Yu, P.; Petzoldt, S.; Wilhelm, M.; Zolg, D. P.; Zheng, R.; Sun, X.; Liu, X.; Schneider, G.; Huhmer, A.; Kuster, B. Anal. Chem. 2017 , 89 , 8884 − 8891. (27) Cox, J.; Mann, M. Nat. Biotechnol. 2008 , 26 , 1367 − 1372. (28) Chollet, F.; et al. Keras , 2015. (29) Al-R fou, R .; Alain , G.; Alm ahairi , A.; A ngermue ller, C .; B a h d a n a u ,D . ;B a l l a s ,N . ;B a s t i e n ,F . ;B a y e r ,J . ;B e l i k o v ,A . ; Belopolsky, A.; et al. arXiv e-prints , 2016, abs/1605.0. (30) Krokhin, O. V. Anal. Chem. 2006 , 78 , 7785 − 7795. (31) Dougherty, D. A. Science 1996 , 271 , 163 − 168. (32) Dougherty, D. a. J. Nutr. 2007 , 137 , 1504S − 1508S discussion 1516S − 1517S. (33) Alpert, A. J.; Petritis, K.; Kangas, L.; Smith, R. D.; Mechtler, K.; Mitulovic , G.; Mohammed, S.; Heck, A. J. R. Anal. Chem. 2010 , 82 , 5253 − 5259. Analytical Chemistry Article DOI: 10.1021/acs.analchem.7b05157 Anal. Chem. 2018, 90, 4635 − 4640 4640 Why institutions use Plag.ai for originality review, entry 33 Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by doctoral supervisors in universities, research institutes, colleges, schools, and publishing workflows, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer documentation of academic decisions, reduced manual checking effort, and clearer separation between similarity and misconduct. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For course assignments, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation. Review text similarity