Proteoform Identification and Quantitation with TopPIC and TDPortal for Human Tissues

James M Fulcher; Yen-Chen Liao; Mowei Zhou; Ljiljana.PasaTolic

Aug 31, 2022

Proteoform Identification and Quantitation with TopPIC and TDPortal for Human Tissues

DOI

dx.doi.org/10.17504/protocols.io.3byl4bpj2vo5/v1

¹Pacific Northwest National lab;
²Pacific Northwest National Laboratory

Human BioMolecular Atlas Program (HuBMAP) Method Development Community
PNNL-TTD

Mowei Zhou

Pacific Northwest National Laboratory

DOI: dx.doi.org/10.17504/protocols.io.3byl4bpj2vo5/v1

Protocol Citation: James M Fulcher, Yen-Chen Liao, Mowei Zhou, Ljiljana.PasaTolic 2022. Proteoform Identification and Quantitation with TopPIC and TDPortal for Human Tissues. protocols.io https://dx.doi.org/10.17504/protocols.io.3byl4bpj2vo5/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: In development

We are still developing and optimizing this protocol

Created: August 03, 2022

Last Modified: August 31, 2022

Protocol Integer ID: 68089

Funders Acknowledgements:

National Institutes of Health (NIH) Common Fund, Human Biomolecular Atlas Program (HuBMAP)

Grant ID: UG3CA256959-01

Abstract

This protocol describe a workflow for top-down proteomics analysis. Top-down proteomics data are processed with two separate software packages TopPIC and TDPortal. Proteoform identifications were merged from the two software with unified FDR to increase coverage. TopPICR was separately used to cluster TopPIC proteoform to extract abundances for label-free quantitation.

TopPIC Processing

Convert Instrument raw data to mzML using MSConvert
Software
MSConvert
NAME

Analyze mzML files using the TopPIC Suite (version 1.4.13.1) .
Software
TopPIC Suite
NAME
Xiaowen Liu
DEVELOPER
https://github.com/toppic-suite/toppic-suite
SOURCE LINK

TopFD Parameters---------------------------------
Spectral data type:                          	Centroid
Maximum charge:                              	30
Maximum monoisotopic mass:                   	50000 Dalton
Peak error tolerance:                        	0.02 m/z
MS1 signal/noise ratio:                      	3
MS/MS signal/noise ratio:                    	1
Thread number:                               	10
Precursor window size:                       	2 m/z
Use Env CNN model:                           	No
Miss MS1 spectra:                            	No
Generate Html files:                         	Yes
Do final filtering:                          	Yes

TopPIC 1.4.13 Parameters----------------------------------
********************** Parameters **********************
Protein database file:      ID_008032_8627C6BD.fasta.zip  
Spectrum file:                              	xxxxxxxxxxxxxxxxx_ms2.msalign
Number of combined spectra:                 	1
Fragmentation method:                       FILE	
Search type:                                	TARGET
Fixed modifications:                        	None
Use TopFD feature file:                     	True
Maximum number of unexpected modifications: 	1
Error tolerance for matching masses:        	15 ppm
Error tolerance for identifying PrSM clusters: 	0.8 Da
Spectrum-level cutoff type:                 	EVALUE
Spectrum-level cutoff value:                	0.05
Proteoform-level cutoff type:               	EVALUE
Proteoform-level cutoff value:              	0.05
Allowed N-terminal forms:                   	NONE,NME,NME_ACETYLATION,M_ACETYLATION
Maximum mass shift of modifications:        	275 Da
Minimum mass shift of modifications:        	-150 Da
Thread number:                              	14
E-value computation:                        	Generating function
Common modification file name:              	Dynamic_mods.txt  
MIScore threshold:                          	0.15
Executable file directory:                  	
Version:                                    	1.4.13


Note
The protein fasta contains human proteome from UniProt with both SwissProt and TREMBL sequences. Decoy sequences were added as well. Unzip the attachment to use it.

TopPIC outputs proteoform spectrum matches (PrSMs) as tab-separated files (...toppic_prsm.tsv) and quantification data within MS1 feature files (..._ms1.feature). These are both imported into the R environment for post-processing with TopPICR.

TopPICR is used for post-processing to improve proteoform identification and quantification. All functions are documented within the TopPICR R package.
Software
TopPICR
NAME
Evan Martin
DEVELOPER
https://github.com/evanamartin/TopPICR
SOURCE LINK

First, result files are read into R using the read_toppic(file_path = path, file_name = names) function in TopPICR, where  the "path" is the path to the directory containing the TopPIC PrSM files and "names" is a character vector specifying the PrSM files to import. This function can also be utililzed to import the MS1 feature files into a separate object.

Next, the data is further processed with the augment_annotation() and  rm_false_gene() functions to account for ambiguity in proteoform identifications

False discovery rate (FDR) filtering is accomplished by finding the appropriate E-value cutoff to filter the results to 1% FDR at the isoform and protein level. This is provided by the find_evalue_cutoff() and apply_evalue_cutoff() functions.

 Proteoform inference is performed with infer_pf() function and the proteoform level  is determined with set_pf_level() function . 
CITATION
Smith LM, Thomas PM, Shortreed MR, Schaffer LV, Fellers RT, LeDuc RD, Tucholski T, Ge Y, Agar JN, Anderson LC, Chamot-Rooke J, Gault J, Loo JA, Paša-Tolić L, Robinson CV, Schlüter H, Tsybin YO, Vilaseca M, Vizcaíno JA, Danis PO, Kelleher NL (2019). A five-level classification system for proteoform identifications.. Nature methods.LINK
https://doi.org/10.1038/s41592-019-0573-x

Retention time alignment is processed with the form_model() and align_rt() functions. 

Mass calibration is accomplished with the calc_error() and recalibrate_mass() functions

Clustering and deisotoping error correction is performed with the cluster() and create_pcg() functions. 

Metadata for each proteoform cluster is generated with the create_mdata() function.

Steps 4.5 and 4.6 are applied to the MS1 feature files as well before features are matched and combined  (for MBR) with the match_features() and combine_features() functions. 

The final table of proteoform identification and quantitation results from TopPIC Suite and TopPICR are exported as comma-separated value (.csv) files.

TDPortal Processing

Request TDPortal access and follow their instructions to set up an account.
Software
TDPortal
NAME
Northwestern University
DEVELOPER
TDPortal search process
6.1 Upload data
6.2 Search on TDPortal
Note
TDportal has an option for label-free quantitation, but it is not used in this workflow. 

Upload data
Connect to Northwestern through VPN. (https://kb.northwestern.edu/page.php?id=94726)
Copy the files to your user folder. (Eg. \\resfiles.northwestern.edu\NU-PCEDATA\external_users\XXXXX)
The system will ask you to log in. Please use "ads\your id" with your password to log into your folder. 
Create a sub-folder under your user folder with each search.
Put raw files to the sub-folder accordingly and do not have more folders under the sub-folder. (https://kb.northwestern.edu/page.php?id=70525).

Search on TDPortal (https://portal.nrtdp.northwestern.edu/static/TDPortalSOP_043_20180301.pdf)
Connect website(https://portal.nrtdp.northwestern.edu/)
Log in TDPortal with “your email address” and “your password”
Your subfolder's name will show as each dataset.
Select files into the “Input files” under the selected dataset.
Select organism "human".
Set parameters as follow:
User empirical P-score: False
Filter by FDR: True
Create SAS input sheet for quant: Select True when we need to.
Precursor resolution: High resolution
Fragmentation Type: Auto (or the type we used on MS).
Code set: Standard 4.0.0
Include ProSight Error Tolerance Search: False (select “true” when we want to allow one unknown mass shift in the proteoform).

Exporting TDPortal results
Software
TDViewer
NAME
Northwestern University
DEVELOPER
Download *.tdReports file. Note: There can be two separate processes created in the queue. One for ID results in the TDReport. Another is the CSV file for quantitation (if enabled).
Click the download icon to download these files. 


3. Open TDReport by TDViewer 2.0(http://tdviewer2.northwestern.edu/)
4. Read and export proteoform ID results from TDViewer with 1% FDR cutoff.

Combining Results

Results (proteoform spectral matches) from TopPIC and TDPortal are then merged using a function written in R that is openly available on GitHub. The input proteoform tables from each software was pre-filtered with FDR cutoff of 1% (adjusted FDR in TopPICR for TopPIC, and the default FDR in TDPortal).
Software
TDPortal_TopPIC_Join
NAME
James M Fulcher
DEVELOPER
https://github.com/Cajun-data/TDPortal_TopPIC_Join_Function
SOURCE LINK

Final output

Results for proteoform spectral matches (merged from TopPIC and TDPortal) and proteoform quantitation (TopPICR)  are uploaded to HIVE. 

Citations

Step 4.4

Smith LM, Thomas PM, Shortreed MR, Schaffer LV, Fellers RT, LeDuc RD, Tucholski T, Ge Y, Agar JN, Anderson LC, Chamot-Rooke J, Gault J, Loo JA, Paša-Tolić L, Robinson CV, Schlüter H, Tsybin YO, Vilaseca M, Vizcaíno JA, Danis PO, Kelleher NL. A five-level classification system for proteoform identifications.

https://doi.org/10.1038/s41592-019-0573-x

Public workspaceProteoform Identification and Quantitation with TopPIC and TDPortal for Human Tissues

Proteoform Identification and Quantitation with TopPIC and TDPortal for Human Tissues