Fitting of ΔGs¶
import proplot as pplt
from pyhdx import read_dynamx, HDXMeasurement
from pyhdx.fitting import (
fit_rates_half_time_interpolate,
fit_rates_weighted_average,
fit_gibbs_global,
)
from pyhdx.process import filter_peptides, apply_control, correct_d_uptake
from pathlib import Path
from dask.distributed import Client
We load the sample SecB dataset, apply the control, and create an HDXMeasurement
object.
fpath = Path() / ".." / ".." / "tests" / "test_data" / "input" / "ecSecB_apo.csv"
data = read_dynamx(fpath)
fd = {"state": "Full deuteration control", "exposure": {"value": 0.167, "unit": "min"}}
fd_df = filter_peptides(data, **fd)
peptides = filter_peptides(data, state="SecB WT apo") # , query=["exposure != 0."])
peptides_control = apply_control(peptides, fd_df)
peptides_corrected = correct_d_uptake(peptides_control)
sequence = "MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQKDWQPEVKLDLDTASSQLADDVYEVVLRVTVTASLGEETAFLCEVQQGGIFSIAGIEGTQMAHCLGAYCPNILFPYARECITSMVSRGTFPQLNLAPVNFDALFMNYLQQQAGEGTEEHQDA"
hdxm = HDXMeasurement(peptides_corrected, temperature=303.15, pH=8.0, sequence=sequence)
The first step is to obtain initial guesses for the observed rate of D-exchange. By determining the timepoint at which 0.5 RFU (relative fractional uptake) is reached, and subsequently converting to rate, a rough estimate of exchange rate per residue can be obtained. Here, RFU values are mapped from peptides to individual residues by weighted averaging (where the weight is the inverse of peptide length).
fr_half_time = fit_rates_half_time_interpolate(hdxm)
fr_half_time.output
rate | |
---|---|
r_number | |
10 | 0.054455 |
11 | 0.054455 |
12 | 0.035301 |
13 | 0.035301 |
14 | 0.035301 |
... | ... |
151 | 0.136605 |
152 | 0.136460 |
153 | 0.136460 |
154 | 0.136460 |
155 | 0.135486 |
146 rows × 1 columns
A more accurate result can be obtained by fitting the per-residue/timepoint RFU values to a biexponential association curve. This process is more time-consuming and can optionally be processed in parallel by Dask.
client = Client()
fr_wt_avg = fit_rates_weighted_average(hdxm, client=client)
The return value is a KineticsFitResult
object. This object has a list of models and corresponding intervals to where the protein
sequence to which these models apply, and their corresponding symfit
fit result with parameter values. The effective
exchange rate (units $s^{-1}$) can be extracted, as well as other fit parameters, from this object:
fr_wt_avg.output
C:\Users\jhsmi\pp\PyHDX\pyhdx\fit_models.py:229: RuntimeWarning: overflow encountered in exp t = np.exp(t_log)
rate | k1 | k2 | r | |
---|---|---|---|---|
r_number | ||||
10 | 0.064856 | 0.189182 | 0.001485 | 0.568473 |
11 | 0.064856 | 0.189182 | 0.001485 | 0.568473 |
12 | 0.055144 | 0.190785 | 0.001574 | 0.540079 |
13 | 0.055144 | 0.190785 | 0.001574 | 0.540079 |
14 | 0.055144 | 0.190785 | 0.001574 | 0.540079 |
... | ... | ... | ... | ... |
151 | 0.828595 | 0.843758 | 0.000235 | 0.987547 |
152 | 1.948120 | 1.988972 | 0.000223 | 0.985773 |
153 | 1.948120 | 1.988972 | 0.000223 | 0.985773 |
154 | 1.948120 | 1.988972 | 0.000223 | 0.985773 |
155 | 0.570613 | 0.585851 | 0.000296 | 0.981978 |
146 rows × 4 columns
fig, ax = pplt.subplots(aspect=3, axwidth="150mm")
ax.set_yscale("log")
ax.scatter(fr_half_time.output.index, fr_half_time.output["rate"], label="half-time")
ax.scatter(fr_wt_avg.output.index, fr_wt_avg.output["rate"], label="fit")
ax.set_xlabel("Residue number")
ax.set_ylabel("Rate (1/s)")
ax.legend()
C:\Users\jhsmi\pp\PyHDX\pyhdx\fit_models.py:229: RuntimeWarning: overflow encountered in exp t = np.exp(t_log)
<matplotlib.legend.Legend at 0x1f990e64460>
We can now use the guessed rates to obtain guesses for the Gibbs free energy. Units of Gibbs free energy are J/mol.
gibbs_guess = hdxm.guess_deltaG(fr_wt_avg.output["rate"])
gibbs_guess
C:\Users\jhsmi\pp\PyHDX\pyhdx\fit_models.py:229: RuntimeWarning: overflow encountered in exp t = np.exp(t_log)
r_number 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN ... 151 11573.755738 152 9971.367773 153 12396.595500 154 12676.112582 155 12676.112582 Length: 155, dtype: float64
To perform the global fit (all peptides and timepoints) use fit_gibbs_global
. The number of epochs is set to 1000
here for demonstration but for actually fitting the values should be ~100000.
gibbs_result = fit_gibbs_global(hdxm, gibbs_guess, epochs=1000)
gibbs_output = gibbs_result.output
gibbs_output
100%|██████████| 1000/1000 [00:01<00:00, 858.45it/s]
state | SecB WT apo | |||||
---|---|---|---|---|---|---|
quantity | sequence | _dG | dG | pfact | k_obs | covariance |
r_number | ||||||
10 | T | 19766.954080 | 19766.954080 | 2546.262948 | 0.064621 | 1.408825e+03 |
11 | F | 19344.174152 | 19344.174152 | 2153.064540 | 0.063561 | 1.403107e+03 |
12 | Q | 20662.217289 | 20662.217289 | 3632.115795 | 0.054472 | 7.839711e+02 |
13 | I | 16943.348023 | 16943.348023 | 830.592061 | 0.053277 | 7.793776e+02 |
14 | Q | 19006.477914 | 19006.477914 | 1883.089717 | 0.053870 | 7.814748e+02 |
... | ... | ... | ... | ... | ... | ... |
151 | E | 11573.542263 | 11573.542263 | 98.663097 | 0.828664 | 8.599279e+03 |
152 | H | 9998.940179 | 9998.940179 | 52.825821 | 1.927319 | 8.492284e+05 |
153 | Q | 12396.595493 | 12396.595493 | 136.763180 | 1.948120 | 9.218343e+05 |
154 | D | 12648.553955 | 12648.553955 | 151.141024 | 1.969396 | 1.013470e+06 |
155 | A | 12647.407114 | 12647.407114 | 151.072270 | 0.010342 | 5.208880e+03 |
146 rows × 6 columns
Along with ΔG the fit result returns covariances of ΔG and protection factors. The column k_obs
is the observed rate of
exchange without taking into account the intrinsic exchange rate of each residue. If users want to obtain a result truly
independent of the intrinsic rate of exchange, the regularization value r1 should be set to zero (as this works on ΔG,
which is obtained by taking intrinsic rate of exchange into account) and users should provide their own initial guesses
for ΔG (as determination of initial guesses also uses intrinsic rates of exchange).
gibbs_output["SecB WT apo"]["dG"]
r_number 10 19766.954080 11 19344.174152 12 20662.217289 13 16943.348023 14 19006.477914 ... 151 11573.542263 152 9998.940179 153 12396.595493 154 12648.553955 155 12647.407114 Name: dG, Length: 146, dtype: float64
fig, ax = pplt.subplots(aspect=3, axwidth="150mm")
ax.scatter(gibbs_output.index, gibbs_output["SecB WT apo"]["dG"] * 1e-3)
ax.set_xlabel("Residue number")
ax.set_ylabel("ΔG (kJ/mol)")
None
Number of epochs and overfitting¶
The returned fit result object also has information on the losses of each epoch of the fittng process. These are
stored as a pd.DataFrame
in the losses
attribute. During a successful fitting run, the losses should decrease sharply
and then flatten out.
fig, ax = pplt.subplots(aspect=3, axwidth="150mm")
gibbs_result.losses.plot(ax=ax)
CartesianAxesSubplot(index=(0, 0), number=1)
In the figure above, mse_loss
is the loss resulting from differences in calculated D-uptake and measured D-uptake
(mean squared error). The reg_1
is the loss resulting from the regualizer.
If the losses do not decrease, this is likely due to a too low number of epochs or a too low learning rate. Lets tune the fit parameters such that we obtain the desired result.
gibbs_result_updated = fit_gibbs_global(hdxm, gibbs_guess, epochs=40000)
100%|██████████| 40000/40000 [00:41<00:00, 959.75it/s]
fig, axes = pplt.subplots(ncols=2, refaspect=1.6, axwidth="70mm", sharey=False)
axes[0].scatter(
gibbs_result_updated.output.index, gibbs_result_updated.output["SecB WT apo"]["dG"] * 1e-3, s=10
)
axes[0].set_xlabel("Residue number")
axes[0].set_ylabel("ΔG (kJ/mol)")
gibbs_result_updated.losses.plot(ax=axes[1])
axes[1].set_xlabel("Epochs")
axes[1].set_ylabel("Loss")
Text(0, 0.5, 'Loss')
By increasing the learning rate and the number of epochs, our result improves as the final MSE loss is lower.
With stop_loss
at 1E-6 and patience
at 50 (=default), the fitting will not terminate until for 50
epochs the progress (loss decrease) has been less than 1E-6.
The default value of stop_loss
is 1E-5. This will ensure optimization will stop when not enough progress is being made
and thereby prevent overfitting.
Users can keep track of the ΔG values per epoch by using Checkpoint
callbacks (See templates/04; advanced/experimental usage).
gibbs_result_updated = fit_gibbs_global(
hdxm, gibbs_guess, epochs=60000, lr=1e5, stop_loss=1e-6, patience=50
)
100%|██████████| 60000/60000 [01:05<00:00, 916.75it/s]
gibbs_result_updated.regularization_percentage
57.09043167835134
fig, axes = pplt.subplots(ncols=2, refaspect=1.6, axwidth="70mm", sharey=False)
axes[0].scatter(
gibbs_result_updated.output.index, gibbs_result_updated.output["SecB WT apo"]["dG"] * 1e-3, s=10
)
axes[0].set_xlabel("Residue number")
axes[0].set_ylabel("ΔG (kJ/mol)")
gibbs_result_updated.losses.plot(ax=axes[1])
ax = axes[1].twinx()
ax.set_ylabel("R1 percentage")
percentage = 100 * gibbs_result_updated.losses["reg_1"] / (gibbs_result_updated.losses.sum(axis=1))
ax.plot(percentage, color="k")
axes[1].set_xlabel("Epochs")
axes[1].set_ylabel("Loss")
Text(0, 0.5, 'Loss')
gibbs_result_updated.losses.sum(axis=1)
epoch 1 4.822009 2 4.819686 3 4.816977 4 4.814078 5 4.811085 ... 59996 0.400785 59997 0.400782 59998 0.400781 59999 0.400780 60000 0.400776 Length: 60000, dtype: float64
With these settings, the losses and the result become highly influenced by the regularizer r1
, which dampens the result
and removes scatter in ΔG values.
The choice of regularizer value r1¶
The regularizer r1
acts between subsequent residues minimizing differences between residues, unless data support
these differences. Higher values flatten out the ΔG values along residues, while lower values allow for more
variability.
The main function of r1
is to mitigate the non-identifiability problem, where if multiple effective exchange rates (ΔG)
values are found within a peptide, it is impossible to know which rate should be assigned to which residue. Among the
possible choices, the regularizer r1
will bias the result towards the choice of rate per residue assignment with the
least variability along residues.
The 'best' value of r1 depends on the size of the protein and the coverage, (the number/size of peptides). Below is an
example of the differences with regularizer values 0.1, 2 and 5.
In this dataset, despite the fact that for r1=2
contribution reg_1
is 50% of total_loss
, most features in ΔG are
still resolved. In this case, it is recommended to try different values of r1
(starting low and increasing) and find
the optimal value based on the ΔG result and fit metrics (Total mse loss at fit termination, $\chi^2$ per peptide
and fit curves per peptide (available in web interface)).
r1_vals = [0.1, 2, 5]
results_dict = {}
for r1 in r1_vals:
print(r1)
result = fit_gibbs_global(
hdxm, gibbs_guess, epochs=60000, lr=1e5, stop_loss=1e-5, patience=50, r1=r1
)
results_dict[r1] = result
0.1
32%|███▏ | 19452/60000 [00:20<00:42, 945.26it/s]
2
35%|███▍ | 20734/60000 [00:20<00:39, 994.79it/s]
5
100%|██████████| 60000/60000 [01:03<00:00, 937.88it/s]
colors = iter(["r", "b", "g"])
fig, axes = pplt.subplots(ncols=2, refaspect=1.6, axwidth="70mm", sharey=False, wratios=[2, 1])
for k, v in results_dict.items():
print(v.regularization_percentage)
color = next(colors)
axes[0].scatter(
v.output.index, v.output["SecB WT apo"]["dG"] * 1e-3, color=color, label=f"r1: {k}", s=10
)
axes[0].set_xlabel("Residue number")
axes[0].set_ylabel("ΔG (kJ/mol)")
axes[1].plot(v.losses["mse_loss"], color=color)
axes[1].plot(v.losses["reg_1"], color=color, linestyle="--")
axes[1].set_xlabel("Epochs")
axes[1].set_ylabel("Loss")
axes[0].legend(loc="r", ncols=1)
10.073528769656814 55.81450645691346 62.08606140072798
<matplotlib.legend.Legend at 0x1f9b14da250>
Fit result covariances¶
Covariances on ΔG are estimated from the Hessian matrix $\mathcal{H}$. This matrix describes the local curvature of $\chi^2$ (sum of squared residuals) at each residue for the set of ΔG values obtained from the fit. From the Hessian matrix, covariances are determined by taking the diagonal elements of the square root of the inverse of the negative Hessian ($\sqrt{\left|(-\mathcal{H})^{-1}_{ii}\right|}$).
When these covariances are high, the optimization landscape is flat and therefore it presents a higher difficulty to finding the exact minimum. High covariances frequently signal that the obtained ΔG values are at the extreme ends of the time resolution of the experiment. In these regions, the ΔG values found represent an upper or lower limit and in order to improve the ΔG range of the experiment shorter or longer timepoints need to be added.
Covariances are stored in the fit result and we can plot these on top of the ΔG values to see which regions are within the range of the experiment.
colors = iter(["r", "b", "g"])
fig, ax = pplt.subplots(ncols=1, width="180mm", aspect=4)
v = results_dict[2]
ax.scatter(v.output.index, v.output["SecB WT apo"]["dG"] * 1e-3, s=10)
ylim = ax.get_ylim()
ax.errorbar(
v.output.index,
v.output["SecB WT apo"]["dG"] * 1e-3,
yerr=v.output["SecB WT apo"]["covariance"] * 1e-3,
linestyle="none",
color="k",
zorder=-1,
)
ax.set_ylim(*ylim)
ax.set_xlabel("Residue number")
ax.set_ylabel("ΔG (kJ/mol)")
Text(0, 0.5, 'ΔG (kJ/mol)')
In the graph above, covariances are high at SecB's disordered c-tail, highlighting that ΔG values obtained here represent an upper limit where the actual ΔG values are likely to be lower. Likewise, high ΔG values also show high covariances which can be improved by adding longer D-exposure datapoints. Shorter D-exposure datapoints can be added in principle, but regions with low ΔG are more likely to exchagne via EX1 kinetics, where PyHDX kinetics approximations break down.
Finally, the obtained ΔG values can be plotted with PyHDX default setings and colors using dG_scatter_figure
. More plotting options are shown in the next chapter.
from pyhdx.plot import dG_scatter_figure
dG_scatter_figure(v.output)
(Figure(nrows=1, ncols=1, refaspect=2.5, figwidth=6.3), SubplotGrid(nrows=1, ncols=1, length=1), [<matplotlib.colorbar.Colorbar at 0x1f9b8af3850>])