pRRophetic是明尼苏达大学Paul Geeleher, Nancy Cox, R. Stephanie Huang做的包,主要算法是基于自己课题组2014年的发的Genome Biology,Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines (https://link.springer.com/article/10.1186/gb-2014-15-3-r47)。这个R包的算法的输入主要是较大的Project的cell line expression profiles 与对应的IC50信息,然后通过ridge regression建立一个模型,再用这个模型去预测临床样本的Chemotherapeutic Response。2021年,这个课题组又出了
oncoPredict (oncoPredict: an R package for predicting in vivo or cancer patient drug response and biomarkers from cell line screening data), 算是对pRRophetic的一次升级。
(注:R 4.2.1 版本会报错,建议使用4.0.2以下版本)
二、使用指南
1. 包的安装
github地址:https://github.com/paulgeeleher/pRRophetic
但是大家好像下载tar本地安装
OSF | pRRophetic R package Wiki
https://osf.io/5xvsg/wiki/home/
specify if you would like to traing the models on only a subset of the CGP cell lines (based on the tissue type from which the cell lines originated). This be one any of "all" (for everything, default option), "allSolidTumors" (everything except for blood), "blood", "breast", "CNS", "GI tract" ,"lung", "skin", "upper aerodigestive"
batchCorrect
How should training and test data matrices be homogenized. Choices are "eb" (default) for ComBat, "qn" for quantiles normalization or "none" for no homogenization.
powerTransformPhenotype
Should the phenotype be power transformed before we fit the regression model? Default to TRUE, set to FALSE if the phenotype is already known to be highly normal.
removeLowVaryingGenes
What proportion of low varying genes should be removed? 20 percent be default
minNumSamples
How many training and test samples are requried. Print an error if below this threshold
selection
How should duplicate gene ids be handled. Default is -1 which asks the user. 1 to summarize by their or 2 to disguard all duplicates.
printOutput
Set to FALSE to supress output
Value
a gene expression matrix that does not contain duplicate gene ids
Pearsons correlation: 0.43 , P = 3.57572505890629e-14
R-squared value: 0.19
Estimated 95% confidence intervals: -4.41, 3.56
Mean prediction error: 1.61
复制代码
#Plot the cross validation predicted phenotype against the measured IC50s.
plot(cvOut)
复制代码
pRRopheticPredict
Given a gene expression matrix, predict drug senstivity for a drug in CGP
Based on the qqplot it is likely acceptable to use these data for prediction of
bortezomib sensitivity. Predict bortezomib sensitivity using all cell lines, then
only cell lines from hematological cancers and then only cell lines from derived from solid tumors. (selection = ?,How should duplicate gene ids be handled. Default is -1 which asks the user. 1 to summarize by their or 2 to disguard all duplicates.)
There are a very large number of cell lines resistant to Erlotinib (within
the drug screening window), so a correlation is not an appropriate measure of concordance. So lets do a t-test between some of the most sensitive and resistant cell lines to assess whether signal is being captured by the predictions.
data: predictedPtype_ccle_erlotinib[resistant] and predictedPtype_ccle_erlotinib[sensitive]
t = 2.431, df = 32.127, p-value = 0.02081
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2421538 2.7431351
sample estimates:
mean of x mean of y
-5.184405 -6.677049
复制代码
Despite the fact that IC50 values are not correlated for this drug between
these studies, the most sensitive/resistant samples are separated highly significantly with this logistic models.
vertical=TRUE, method="jitter", ylab="Log-odds of sensitivity")
复制代码
Include, is an example, prediction from the bortezomib clinical data where we try to predict CR, PR, MR, NC, PD from CR, PR, MR, NC, PD. This serves as both an example of prediction directly from clinical data and of using a dataset other than the CGP from which to predict.
First, prepare the training data and test expression data.