# Fitting of ΔGs#

[10]:

import proplot as pplt
from pyhdx import read_dynamx, HDXMeasurement
from pyhdx.fitting import fit_rates_half_time_interpolate, fit_rates_weighted_average, fit_gibbs_global
from pyhdx.process import filter_peptides, apply_control, correct_d_uptake
from pathlib import Path
from dask.distributed import Client


We load the sample SecB dataset, apply the control, and create an HDXMeasurement object.

[5]:

fpath = Path() / '..' / '..' / 'tests' / 'test_data' / 'input' / 'ecSecB_apo.csv'

fd = {'state': 'Full deuteration control',
'exposure': {'value': 0.167, 'unit': 'min'}}

fd_df = filter_peptides(data, **fd)
peptides = filter_peptides(data, state="SecB WT apo")  # , query=["exposure != 0."])
peptides_control = apply_control(peptides, fd_df)
peptides_corrected = correct_d_uptake(peptides_control)

hdxm = HDXMeasurement(peptides_corrected, temperature=303.15, pH=8., sequence=sequence)


The first step is to obtain initial guesses for the observed rate of D-exchange. By determining the timepoint at which 0.5 RFU (relative fractional uptake) is reached, and subsequently converting to rate, a rough estimate of exchange rate per residue can be obtained. Here, RFU values are mapped from peptides to individual residues by weighted averaging (where the weight is the inverse of peptide length).

[6]:

fr_half_time = fit_rates_half_time_interpolate(hdxm)
fr_half_time.output

[6]:

rate
r_number
10 0.054455
11 0.054455
12 0.035301
13 0.035301
14 0.035301
... ...
151 0.136605
152 0.136460
153 0.136460
154 0.136460
155 0.135486

146 rows × 1 columns

A more accurate result can be obtained by fitting the per-residue/timepoint RFU values to a biexponential association curve. This process is more time-consuming and can optionally be processed in parallel by Dask.

[7]:

client = Client()
fr_wt_avg = fit_rates_weighted_average(hdxm, client=client)


The return value is a KineticsFitResult object. This object has a list of models and corresponding intervals to where the protein sequence to which these models apply, and their corresponding symfit fit result with parameter values. The effective exchange rate (units $$s^{-1}$$) can be extracted, as well as other fit parameters, from this object:

[8]:

fr_wt_avg.output

C:\Users\jhsmi\pp\PyHDX\pyhdx\fit_models.py:229: RuntimeWarning: overflow encountered in exp
t = np.exp(t_log)

[8]:

rate k1 k2 r
r_number
10 0.064856 0.189182 0.001485 0.568473
11 0.064856 0.189182 0.001485 0.568473
12 0.055144 0.190785 0.001574 0.540079
13 0.055144 0.190785 0.001574 0.540079
14 0.055144 0.190785 0.001574 0.540079
... ... ... ... ...
151 0.828595 0.843758 0.000235 0.987547
152 1.948120 1.988972 0.000223 0.985773
153 1.948120 1.988972 0.000223 0.985773
154 1.948120 1.988972 0.000223 0.985773
155 0.570613 0.585851 0.000296 0.981978

146 rows × 4 columns

[13]:

fig, ax = pplt.subplots(aspect=3, axwidth='150mm')
ax.set_yscale('log')
ax.scatter(fr_half_time.output.index, fr_half_time.output['rate'], label='half-time')
ax.scatter(fr_wt_avg.output.index, fr_wt_avg.output['rate'], label='fit')

ax.set_xlabel('Residue number')
ax.set_ylabel('Rate (1/s)')
ax.legend()

C:\Users\jhsmi\pp\PyHDX\pyhdx\fit_models.py:229: RuntimeWarning: overflow encountered in exp
t = np.exp(t_log)

[13]:

<matplotlib.legend.Legend at 0x1f990e64460>


We can now use the guessed rates to obtain guesses for the Gibbs free energy. Units of Gibbs free energy are J/mol.

[14]:

gibbs_guess = hdxm.guess_deltaG(fr_wt_avg.output['rate'])
gibbs_guess

C:\Users\jhsmi\pp\PyHDX\pyhdx\fit_models.py:229: RuntimeWarning: overflow encountered in exp
t = np.exp(t_log)

[14]:

r_number
1               NaN
2               NaN
3               NaN
4               NaN
5               NaN
...
151    11573.755738
152     9971.367773
153    12396.595500
154    12676.112582
155    12676.112582
Length: 155, dtype: float64


To perform the global fit (all peptides and timepoints) use fit_gibbs_global. The number of epochs is set to 1000 here for demonstration but for actually fitting the values should be ~100000.

[15]:

gibbs_result = fit_gibbs_global(hdxm, gibbs_guess, epochs=1000)
gibbs_output = gibbs_result.output
gibbs_output

100%|██████████| 1000/1000 [00:01<00:00, 858.45it/s]

[15]:

state SecB WT apo
quantity sequence _dG dG pfact k_obs covariance
r_number
10 T 19766.954080 19766.954080 2546.262948 0.064621 1.408825e+03
11 F 19344.174152 19344.174152 2153.064540 0.063561 1.403107e+03
12 Q 20662.217289 20662.217289 3632.115795 0.054472 7.839711e+02
13 I 16943.348023 16943.348023 830.592061 0.053277 7.793776e+02
14 Q 19006.477914 19006.477914 1883.089717 0.053870 7.814748e+02
... ... ... ... ... ... ...
151 E 11573.542263 11573.542263 98.663097 0.828664 8.599279e+03
152 H 9998.940179 9998.940179 52.825821 1.927319 8.492284e+05
153 Q 12396.595493 12396.595493 136.763180 1.948120 9.218343e+05
154 D 12648.553955 12648.553955 151.141024 1.969396 1.013470e+06
155 A 12647.407114 12647.407114 151.072270 0.010342 5.208880e+03

146 rows × 6 columns

Along with ΔG the fit result returns covariances of ΔG and protection factors. The column k_obs is the observed rate of exchange without taking into account the intrinsic exchange rate of each residue. If users want to obtain a result truly independent of the intrinsic rate of exchange, the regularization value r1 should be set to zero (as this works on ΔG, which is obtained by taking intrinsic rate of exchange into account) and users should provide their own initial guesses for ΔG (as determination of initial guesses also uses intrinsic rates of exchange).

[23]:

gibbs_output['SecB WT apo']['dG']

[23]:

r_number
10     19766.954080
11     19344.174152
12     20662.217289
13     16943.348023
14     19006.477914
...
151    11573.542263
152     9998.940179
153    12396.595493
154    12648.553955
155    12647.407114
Name: dG, Length: 146, dtype: float64

[34]:

fig, ax = pplt.subplots(aspect=3, axwidth='150mm')
ax.scatter(gibbs_output.index, gibbs_output['SecB WT apo']['dG']*1e-3)
ax.set_xlabel('Residue number')
ax.set_ylabel('ΔG (kJ/mol)')
None


## Number of epochs and overfitting#

The returned fit result object also has information on the losses of each epoch of the fittng process. These are stored as a pd.DataFrame in the losses attribute. During a successful fitting run, the losses should decrease sharply and then flatten out.

[35]:

fig, ax = pplt.subplots(aspect=3, axwidth='150mm')
gibbs_result.losses.plot(ax=ax)

[35]:

CartesianAxesSubplot(index=(0, 0), number=1)


In the figure above, mse_loss is the loss resulting from differences in calculated D-uptake and measured D-uptake (mean squared error). The reg_1 is the loss resulting from the regualizer.

If the losses do not decrease, this is likely due to a too low number of epochs or a too low learning rate. Lets tune the fit parameters such that we obtain the desired result.

[36]:

gibbs_result_updated = fit_gibbs_global(hdxm, gibbs_guess, epochs=40000)

100%|██████████| 40000/40000 [00:41<00:00, 959.75it/s]

[45]:

fig, axes = pplt.subplots(ncols=2, refaspect=1.6, axwidth='70mm', sharey=False)
axes[0].scatter(gibbs_result_updated.output.index, gibbs_result_updated.output['SecB WT apo']['dG']*1e-3, s=10)
axes[0].set_xlabel('Residue number')
axes[0].set_ylabel('ΔG (kJ/mol)')
gibbs_result_updated.losses.plot(ax=axes[1])
axes[1].set_xlabel('Epochs')
axes[1].set_ylabel('Loss')

[45]:

Text(0, 0.5, 'Loss')


By increasing the learning rate and the number of epochs, our result improves as the final MSE loss is lower.

With stop_loss at 1E-6 and patience at 50 (=default), the fitting will not terminate until for 50 epochs the progress (loss decrease) has been less than 1E-6.

The default value of stop_loss is 1E-5. This will ensure optimization will stop when not enough progress is being made and thereby prevent overfitting.

Users can keep track of the ΔG values per epoch by using Checkpoint callbacks (See templates/04; advanced/experimental usage).

[40]:

gibbs_result_updated = fit_gibbs_global(hdxm, gibbs_guess, epochs=60000, lr=1e5, stop_loss=1e-6, patience=50)

100%|██████████| 60000/60000 [01:05<00:00, 916.75it/s]

[14]:

gibbs_result_updated.regularization_percentage

[14]:

57.09043167835134

[44]:

fig, axes = pplt.subplots(ncols=2, refaspect=1.6, axwidth='70mm', sharey=False)
axes[0].scatter(gibbs_result_updated.output.index, gibbs_result_updated.output['SecB WT apo']['dG']*1e-3, s=10)
axes[0].set_xlabel('Residue number')
axes[0].set_ylabel('ΔG (kJ/mol)')
gibbs_result_updated.losses.plot(ax=axes[1])
ax = axes[1].twinx()
ax.set_ylabel('R1 percentage')
percentage = 100 * gibbs_result_updated.losses['reg_1'] / (gibbs_result_updated.losses.sum(axis=1))
ax.plot(percentage, color='k')
axes[1].set_xlabel('Epochs')
axes[1].set_ylabel('Loss')

[44]:

Text(0, 0.5, 'Loss')

[46]:

gibbs_result_updated.losses.sum(axis=1)

[46]:

epoch
1        4.822009
2        4.819686
3        4.816977
4        4.814078
5        4.811085
...
59996    0.400785
59997    0.400782
59998    0.400781
59999    0.400780
60000    0.400776
Length: 60000, dtype: float64


With these settings, the losses and the result become highly influenced by the regularizer r1, which dampens the result and removes scatter in ΔG values.

## The choice of regularizer value r1#

The regularizer r1 acts between subsequent residues minimizing differences between residues, unless data support these differences. Higher values flatten out the ΔG values along residues, while lower values allow for more variability.

The main function of r1 is to mitigate the non-identifiability problem, where if multiple effective exchange rates (ΔG) values are found within a peptide, it is impossible to know which rate should be assigned to which residue. Among the possible choices, the regularizer r1 will bias the result towards the choice of rate per residue assignment with the least variability along residues.

The ‘best’ value of r1 depends on the size of the protein and the coverage, (the number/size of peptides). Below is an example of the differences with regularizer values 0.1, 2 and 5. In this dataset, despite the fact that for r1=2 contribution reg_1 is 50% of total_loss, most features in ΔG are still resolved. In this case, it is recommended to try different values of r1 (starting low and increasing) and find the optimal value based on the ΔG result and fit metrics (Total mse loss at fit termination, $$\chi^2$$ per peptide and fit curves per peptide (available in web interface)).

[47]:

r1_vals = [0.1, 2, 5]
results_dict = {}
for r1 in r1_vals:
print(r1)
result = fit_gibbs_global(hdxm, gibbs_guess, epochs=60000, lr=1e5, stop_loss=1e-5, patience=50, r1=r1)
results_dict[r1] = result

0.1

 32%|███▏      | 19452/60000 [00:20<00:42, 945.26it/s]

2

 35%|███▍      | 20734/60000 [00:20<00:39, 994.79it/s]

5

100%|██████████| 60000/60000 [01:03<00:00, 937.88it/s]

[52]:

colors = iter(['r', 'b', 'g'])
fig, axes = pplt.subplots(ncols=2, refaspect=1.6, axwidth='70mm', sharey=False, wratios=[2, 1])
for k, v in results_dict.items():
print(v.regularization_percentage)
color = next(colors)
axes[0].scatter(v.output.index, v.output['SecB WT apo']['dG']*1e-3, color=color, label=f'r1: {k}', s=10)
axes[0].set_xlabel('Residue number')
axes[0].set_ylabel('ΔG (kJ/mol)')
axes[1].plot(v.losses['mse_loss'], color=color)
axes[1].plot(v.losses['reg_1'], color=color, linestyle='--')
axes[1].set_xlabel('Epochs')
axes[1].set_ylabel('Loss')

axes[0].legend(loc='r', ncols=1)

10.073528769656814
55.81450645691346
62.08606140072798

[52]:

<matplotlib.legend.Legend at 0x1f9b14da250>


## Fit result covariances#

Covariances on ΔG are estimated from the Hessian matrix $$\mathcal{H}$$. This matrix describes the local curvature of $$\chi^2$$ (sum of squared residuals) at each residue for the set of ΔG values obtained from the fit. From the Hessian matrix, covariances are determined by taking the diagonal elements of the square root of the inverse of the negative Hessian ($$\sqrt{\left|(-\mathcal{H})^{-1}_{ii}\right|}$$).

When these covariances are high, the optimization landscape is flat and therefore it presents a higher difficulty to finding the exact minimum. High covariances frequently signal that the obtained ΔG values are at the extreme ends of the time resolution of the experiment. In these regions, the ΔG values found represent an upper or lower limit and in order to improve the ΔG range of the experiment shorter or longer timepoints need to be added.

Covariances are stored in the fit result and we can plot these on top of the ΔG values to see which regions are within the range of the experiment.

[57]:

colors = iter(['r', 'b', 'g'])
fig, ax = pplt.subplots(ncols=1, width='180mm', aspect=4)
v = results_dict[2]

ax.scatter(v.output.index, v.output['SecB WT apo']['dG']*1e-3, s=10)
ylim = ax.get_ylim()
ax.errorbar(v.output.index, v.output['SecB WT apo']['dG']*1e-3, yerr=v.output['SecB WT apo']['covariance']*1e-3, linestyle='none', color='k', zorder=-1)
ax.set_ylim(*ylim)
ax.set_xlabel('Residue number')
ax.set_ylabel('ΔG (kJ/mol)')

[57]:

Text(0, 0.5, 'ΔG (kJ/mol)')


In the graph above, covariances are high at SecB’s disordered c-tail, highlighting that ΔG values obtained here represent an upper limit where the actual ΔG values are likely to be lower. Likewise, high ΔG values also show high covariances which can be improved by adding longer D-exposure datapoints. Shorter D-exposure datapoints can be added in principle, but regions with low ΔG are more likely to exchagne via EX1 kinetics, where PyHDX kinetics approximations break down.

Finally, the obtained ΔG values can be plotted with PyHDX default setings and colors using dG_scatter_figure. More plotting options are shown in the next chapter.

[62]:

from pyhdx.plot import dG_scatter_figure
dG_scatter_figure(v.output)

[62]:

(Figure(nrows=1, ncols=1, refaspect=2.5, figwidth=6.3),
SubplotGrid(nrows=1, ncols=1, length=1),
[<matplotlib.colorbar.Colorbar at 0x1f9b8af3850>])