Web application

This section will describe a typical workflow of using the main web interface application. Detailed information on each parameter can be found in the web application reference docs. The web application consists of a sidebar with controls and input, divided into sections, and a main view area with graphs and visualization. We will go through the functionality of the web interface per section.

Settings

In this section general settings for handling HDX data in the web application can be changed.

In the Drop first field the number of N-terminal residues for each peptides can be chosen which should be ignored when calculating the maximum uptake for each peptide and are considered to fully exchange back. Prolines are ignored by default as they do not have exchangeable amide hydrogens.

The field Weight Exponent controls how individual peptides are weighted when calculating residue-level weighted averaged RFU values. The weight for each peptide is equal to \(\frac{1}{n^k}\) where the \(n\) is equal to the number of exchanging residues, and \(k\) the user-configureable weight exponent. Therefore, shorter peptides contribute more to the averaging result, as they contain higher resolution information. The default value for the exponent is 1, and increasing the value increases the relative weight of shorter peptides.

Peptide Input

Use Input Mode to switch between input modes. When selecting 'Database' HDX-MS datasets can be downloaded from the HDX-MS database (currently hosted on GitHub here) and directly loaded into PyHDX.

You can load your own Peptide d-uptake data into the web application either in 'Batch' or 'Manual' mode.

When using 'Batch', multiple measurements can be added quickly through an .yaml file specification. An example of an .yaml file can be found in tests/test-data on GitHub (here).

When using 'Single', each measurement and experimental metadata must be input manually. Use the Browse button to select peptide data files to upload. These should be 'peptide master tables' which is long format data where each entry should at least have the entries of:

start (inclusive residue number at which the detected peptide starts, first residue = 1)
stop (inclusive residue number at which the detected peptide stops)
sequence (sequence of the peptide in one letter amino acid codes)
exposure (time of exposure to deuterated solution)
uptake (amount/mass (g/mol) of deuterium taken up by the peptide)
state (identifier to which 'state' the peptide/protein is in (ie ligands, experimental conditions)

Currently, the only data format accepted is exported 'state data' from Waters DynamX, which is .csv format. Exposure time units is assumed to be minutes. Other data format support can be added on request (eg HDExaminer).

Multiple files can be selected; one file can contain the experimental peptides while another file has the fully deuterated control sample peptides. When using a single file, the peptides should be marked using the 'state' field.

Choose which method of back-exchange correction to use. Options are either to use a fully deuterated (FD) sample or to set a fixed back-exchange percentage for all peptides. The latter method should only be used if no FD sample is available. A percentage to set here can be obtained by running a back-exchange control once on your setup.

When selecting FD Sample, use the fields FD File, FD State and FD Exposure to choose which peptides should be used as FD control. Note that these peptides will be matched to the ones in the experiment and peptides without control will not be included.

Choose the experimental peptides using the Exp File, Exp State, Exp Exposure and use Experiment Exposures to select which exposure times to include.

Next, specify the percentage of deuterium in the labelling solution in the field Deuterium percentage. This percentage should be as high as possible; typically >=90%.

Use the fields Temperature (K) and pH read to specify the temperature and pH at which the D-labelling was done. The pH is the value as read from the pH meter without any correction. Temperature units are Kelvin.

The next fields N term and C term specify the residue number indices of the N-terminal and C-terminal residues, respectively. For the N-terminal this value is typically equal to 1, but if N-terminal affinity tags are used for purification this might be a negative number. The value specified should match with the residue indices used in the input .csv file. The C-term value tells the software at which index the C-terminal of the protein is, as it is possible that the protein extends beyond the last residue included in any peptide and as the C-term exhibits different intrinsic rates of exchanges this needs to be taken into account. A sequence for the full protein (in the N-term to C-term range as specified) can be added to provide additional sequence information, but this is optional.

Finally, specify a name for the measurement, by default equal to the Experiment State value and press Add Measurement to add the measurement with the current specifications. Repeat the process to add additional measurements, either starting by adding a new file or changing the selection on the current file. The Download HDX spec can be used to download a .yaml file with the full state specification, and this file can then in future sessions be used when using Batch input mode, by setting Input mode to 'Batch'.

Finally, press the button Load dataset to parse and load the full dataset.

Datasets currently cannot be removed, if you want to remove datasets, press the browser 'refresh' button to start over.

Coverage (figure)

The 'Coverage' figure in the main application area rectangles show corresponding to the peptides of a single timepoint. Peptides are only included if they are in both all the timepoints as well as in the fully deuterated control sample.

By hovering the mouse over the peptides in the graph, more information is shown about each peptide:

peptide_id: Index of the peptide per timepoint starting at the first peptide at 0
start, end: Inclusive, exclusive interval of residue numbers in this peptide (Taking N-terminal residues into account)
RFU: Relative fraction uptake of the peptide
D(corrected): Absolute D-uptake, corrected by FD control
sequence: FASTA sequence of the peptide. Non-exchanging N-terminal residues marked as 'x' and prolines in lower case.

D-Uptake fit

This controller can be used to perform a fit of residue-level D-Uptake, independently for each timepoint in each measurement. The advantage of these residue-level D-Uptake values compared to residue-level RFU values is that the former is calculated by weighted averaging, which results in loss of residue resolution, while the former is calculated by a least-squares fitting procedure, thus higher resulution results can be obtained depending on the peptide overlap. D-Uptake fits can be advantageous over ΔG fits when HDX kinetics are not in the EX2 regime and approximations made in the ΔG fit procecure with respect to HDX kinetics are thus not applicable.

D-Uptake fit take back exchange into account (by using the Fully Deuterated control sample) as well as the D-percentage of the labelling solution. Fully exchanged residues will therefore have a D-uptake value equal to the D fraction in solution (ie 0.9 for 90% deuterium).

As fitting of residue-level D-uptake also suffers from the issues of non-identifyability or underdetermined systems (more parameters than datapoints), typically fits are repeated for N times with random initially guesses and a smoothing regularization term is applied along the primary structure. The number of fitting repeats can be set with Repeats. The checkbox Bounds controls whether or not bounds are applied to the fit (beteween 0 and 1), and should typically be checked. The value at R1 controls the degree of smoothing along primary structure. A good starting value is 1 (For the SecB test data), and different values should be tried to find an optimal value and depends on the size of the protein and the number and sizes of available peptides. Finally, before starting a fit, choose a name for the fit with current settings at Fit name and click Do Fitting to start the fitting process.

When the fit finishes, the found mean of the D-uptake values of all repeats will be shown as a scatterplot under the tab 'D-uptake'. The scatterplot has two sets of errorbars, which show the 5, 25, 75 and 95 percentiles of the repeats; gray errorbars are 5-95 percentile and black errorbars are 25-75 percentiles.

As with RFUs, differential D-uptake values between protein states can be calculated under Differential HDX. Both D-uptake and ΔD-uptake values can be directly visualized on the tertiary structure using Protein Control and the 'Protein' viewer. When calculating and interpreting ΔD-uptake values, users should be aware that artefactual differences can arise as the fitting can converge to different solutions if in specific regions not enough peptide overlap is available or if peptide coverage dramatically differs between protein states. Therefore, sanity checking of differential HDX results with input data (peptide D-uptake / RFU values) is recommended.

Initial Guesses

As a first step in the fitting procedure, initial guesses for the exchange kinetics need to be derived. This can be done through two options (Fitting model): 'Half-life' (fast but less accurate), or 'Association' (slower but more accurate).

Using the 'Association' procedure is recommended. This model fits two time constants to the weighted-averaged uptake kinetics of each residue. At Lower bound and Upper bound the bounds of these rate constants can be specified but in most cases the autosuggested bounds are sufficient. The bounds can be changed per dataset by using the Dataset field or for all datasets at the same time by ticking the Global bounds checkbox. Rarely issues might arise when the initial guess rates are close to the specified bounds at which point the bounds should be moved to contain a larger interval. This can be checked by comparing the fitted rates k1 and k2 (File Export Target dataset rates). Both rates and associated amplitudes are converted to a single rate value used for initial guesses. To calculate guesses, select the model in the drop-down menu, assign a name to these initial guesses and the press 'Calculate Guesses'. The fitting is done in the background. When the fitting is done, the obtained rate is shown in the main area in the tab 'Rates'. Note that these rates are merely a guesstimate of HDX rates and these rates should not be used for any interpretation whatsoever but should only function to provide the global fit with initial guesses.

ΔG Fit

After the initial guesses are calculated we can move on to the global fit of the data. Details of on the fitting procedure can be found in the PyHDX publication in Analytical Chemistry.

At Initial guess, select which dataset to use for initial guesses (typically 'Guess_1'). Both previous fits (ΔG values) as well as estimated HX rates can be used as initial guesses. The initial guesses can be applied as 'One-to-one', where each protein state gets initial guesses derived from that state, or 'One-to-many', where one protein state is use as initial guesses for all states. Users can switch between both modes using Guess mode.

At Fit mode, users can choose either 'Batch' or 'Single' fitting. If only one dataset is loaded, only 'Single' is available. If 'Single' is selected, PyHDX will fit ΔG values for each dataset individually using the specified settings. In 'Batch' mode all data enters the fitting process at the same time. This allows for the use of a second regularizer between datasets. Note that when using 'Batch' mode, the relative magnitudes of the Mean Squared error losses and regularizer might be different, such that 'Batch' fitting with r2 at zero is not identical to 'Single' fits.

The fields Stop loss and Stop patience control the fitting termination. If the loss improvement is less than Stop loss for Stop patience epochs (fit iterations), the fitting will terminate. Learning rate controls the step size per epoch. For typical a dataset with 62 peptides over 6 timepoints, the learning rate should be 50-100. Smaller datasets require larger learning rates and vice versa.

Momentum and Nesterov are advanced settings passed to the Pytorch SGD optimizer.

The maximum number of epochs or fit iterations is set in the field Epochs.

Finally, the fields Regualizer 1 and Regualizer 2 control the magnitude of the regualizers. Please refer to our publication for more details. In short, r1 acts along consecutive residues and affects as a 'smoothing' along the primary structure. Higher values give a more smoothed result. This prevents overfitting or helps avoid problems related to the 'non-identifiability' issue where in unresolved (no residue-level overlap) regions the correct kinetic components can be found (ΔGs of residues given correct choice of timepoints) but it cannot confidently be assigned to residues as resolution is lacking. The regualizer r1 biases the fit result towards the residue assignment choice with the lowest variation along the primary structure. Typical values range from 0.01 to 0.5, depending on size of the input data.

r2 acts between samples, minimizing variability between them. This is used in differential HDX where users are interested in ΔG differences (ΔΔG). When measuring HD exchange with differing experimental conditions, such as differences in peptides detected, timepoints used or D-labelling temperature and pH, the datasets obtained will have different resolution, both 'spatially' (degree of resolved residues) and 'temporally' (range/accuracy of ΔGs). This can lead to artefactual differences in the final ΔΔG result, as features might be resolved in out dataset and not in the other, which will show up as ΔΔG. The penalty from r2 can be calculated either with respect to a selected reference state (Using R2 reference), or the average value between all states (set to None)

Specify a unique name at Fit name and press Do Fitting do start the fit. The 'Info log' window in the bottom right corner displays information on when the fit started and finished. The fitting runs in the background and multiple jobs can be executed at the same time when processing multiple protein states with Fit mode set to 'Single'. However, please take into account that these fits are computationally intensive and currently if multiple users submit too many jobs it might overwhelm our/your server.

The output ΔG values are shown in the 'ΔG' graph.

See also the fitting example notebook for more details on fitting and the effect of regualizers.