Protocol Citation: James M Fulcher, Yen-Chen Liao, Mowei Zhou, Ljiljana.PasaTolic 2022. Proteoform Identification and Quantitation with TopPIC and TDPortal for Human Tissues. protocols.io https://dx.doi.org/10.17504/protocols.io.3byl4bpj2vo5/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: In development
We are still developing and optimizing this protocol
Created: August 03, 2022
Last Modified: August 31, 2022
Protocol Integer ID: 68089
Funders Acknowledgements:
National Institutes of Health (NIH) Common Fund, Human Biomolecular Atlas Program (HuBMAP)
Grant ID: UG3CA256959-01
Abstract
This protocol describe a workflow for top-down proteomics analysis. Top-down proteomics data are processed with two separate software packages TopPIC and TDPortal. Proteoform identifications were merged from the two software with unified FDR to increase coverage. TopPICR was separately used to cluster TopPIC proteoform to extract abundances for label-free quantitation.
TopPIC Processing
TopPIC Processing
Convert Instrument raw data to mzML using MSConvert
Software
MSConvert
NAME
Analyze mzML files using the TopPIC Suite (version 1.4.13.1) .
The protein fasta contains human proteome from UniProt with both SwissProt and TREMBL sequences. Decoy sequences were added as well. Unzip the attachment to use it.
TopPIC outputs proteoform spectrum matches (PrSMs) as tab-separated files (...toppic_prsm.tsv) and quantification data within MS1 feature files (..._ms1.feature). These are both imported into the R environment for post-processing with TopPICR.
TopPICR is used for post-processing to improve proteoform identification and quantification. All functions are documented within the TopPICR R package.
First, result files are read into R using the read_toppic(file_path = path, file_name = names) function in TopPICR, where the "path" is the path to the directory containing the TopPIC PrSM files and "names" is a character vector specifying the PrSM files to import. This function can also be utililzed to import the MS1 feature files into a separate object.
Next, the data is further processed with the augment_annotation() and rm_false_gene() functions to account for ambiguity in proteoform identifications
False discovery rate (FDR) filtering is accomplished by finding the appropriate E-value cutoff to filter the results to 1% FDR at the isoform and protein level. This is provided by the find_evalue_cutoff() and apply_evalue_cutoff() functions.
Proteoform inference is performed with infer_pf() function and the proteoform level is determined with set_pf_level() function .
CITATION
Smith LM, Thomas PM, Shortreed MR, Schaffer LV, Fellers RT, LeDuc RD, Tucholski T, Ge Y, Agar JN, Anderson LC, Chamot-Rooke J, Gault J, Loo JA, Paša-Tolić L, Robinson CV, Schlüter H, Tsybin YO, Vilaseca M, Vizcaíno JA, Danis PO, Kelleher NL (2019). A five-level classification system for proteoform identifications.. Nature methods.
Retention time alignment is processed with the form_model() and align_rt() functions.
Mass calibration is accomplished with the calc_error() and recalibrate_mass() functions
Clustering and deisotoping error correction is performed with the cluster() and create_pcg() functions.
Metadata for each proteoform cluster is generated with the create_mdata() function.
Steps 4.5 and 4.6 are applied to the MS1 feature files as well before features are matched and combined (for MBR) with the match_features() and combine_features() functions.
The final table of proteoform identification and quantitation results from TopPIC Suite and TopPICR are exported as comma-separated value (.csv) files.
TDPortal Processing
TDPortal Processing
Request TDPortal access and follow their instructions to set up an account.
Software
TDPortal
NAME
Northwestern University
DEVELOPER
TDPortal search process
6.1 Upload data
6.2 Search on TDPortal
Note
TDportal has an option for label-free quantitation, but it is not used in this workflow.
Log in TDPortal with “your email address” and “your password”
Your subfolder's name will show as each dataset.
Select files into the “Input files” under the selected dataset.
Select organism "human".
Set parameters as follow:
User empirical P-score: False
Filter by FDR: True
Create SAS input sheet for quant: Select True when we need to.
Precursor resolution: High resolution
Fragmentation Type: Auto (or the type we used on MS).
Code set: Standard 4.0.0
Include ProSight Error Tolerance Search: False (select “true” when we want to allow one unknown mass shift in the proteoform).
Exporting TDPortal results
Software
TDViewer
NAME
Northwestern University
DEVELOPER
Download *.tdReports file. Note: There can be two separate processes created in the queue. One for ID results in the TDReport. Another is the CSV file for quantitation (if enabled).
4. Read and export proteoform ID results from TDViewer with 1% FDR cutoff.
Combining Results
Combining Results
Results (proteoform spectral matches) from TopPIC and TDPortal are then merged using a function written in R that is openly available on GitHub. The input proteoform tables from each software was pre-filtered with FDR cutoff of 1% (adjusted FDR in TopPICR for TopPIC, and the default FDR in TDPortal).
Results for proteoform spectral matches (merged from TopPIC and TDPortal) and proteoform quantitation (TopPICR) are uploaded to HIVE.
Citations
Step 4.4
Smith LM, Thomas PM, Shortreed MR, Schaffer LV, Fellers RT, LeDuc RD, Tucholski T, Ge Y, Agar JN, Anderson LC, Chamot-Rooke J, Gault J, Loo JA, Paša-Tolić L, Robinson CV, Schlüter H, Tsybin YO, Vilaseca M, Vizcaíno JA, Danis PO, Kelleher NL. A five-level classification system for proteoform identifications.