========================== TenFoldCVTimesTen_VinaLC.R ========================== - Description: - Takes VinaLC scores and adverse drug reaction (ADR) data in SIDER on our N=560 drug set and constructs a 560 drug(row) x 409 protein(column) data matrix and a 560 drug(row) x 10 ADR group(column) response matrix and trains an L1-regularized logistic regression model, performing model selection using ten-fold cross-validation, repeated ten times. - User-specified parameters: - 'filepath' - Absolute path to where all the input files are stored. The output files will be generated in this same directory. - 'dockThresh' - User can either use the raw docking score as features, or use the 'dockThresh' variable to set a threshold such that docking scores above 'dockThresh' are considered 'unbound' and docking scores below 'dockThresh' are considered 'bound'. - Requires the following R package(s): - glmnet - Requires the following Input Files: - N560_p409_DockScore_Matrix.csv - N560_p10_DrugADR_Matrix.csv - Produces the following Output Files: - Ten '_Betas.txt' files, one for each ADR group. - Beta coefficients of the median AUC model - tenAUCMatrix.csv - Contains 10 ten-fold cross validation (CV) area under the receiver operator characteristic curves (AUCs) for each of the ten separate CV runs for each of the ten ADR groups - tenCVResultsMatrix.csv - For each of the ten ADR groups, shows the following results for the median ten-fold cross-validation AUC of the ten performed - lambda_min = value of the regularization parameter that corresponds to the maximum AUC from the model-selection phase in a single glmnet ten-fold CV run - cvlo = lower confidence bound on the estimate of the median AUC - cvm = mean estimate of the median AUC - cvup = upper confidence bound on the estimate of the median AUC - nzero = number of non-zero parameters in the (maximum AUC) model corresponding to lambda_min ============================ TenFoldCVTimesTen_DrugBank.R ============================ - Description: - Takes DrugBank target protein information and ADR data in SIDER on our N=560 drug set and constructs a 560 drug(row) x 555 target protein(column) data matrix and a 560 drug(row) x 10 ADR group(column) response matrix and trains an L1-regularized logistic regression model, performing model selection using ten-fold cross-validation, repeated ten times. - User-specified parameters: - 'filepath' - Absolute path to where all the input files are stored. The output files will be generated in this same directory. - Requires the following R package(s): - glmnet - Requires the following Input Files: - N560_p555_DB_Matrix.csv - N560_p10_DrugADR_Matrix.csv - Produces the following Output Files: - Ten '_Betas.txt' files, one for each ADR group. - Beta coefficients of the median AUC model - tenAUCMatrix.csv - Contains 10 ten-fold cross validation (CV) area under the receiver operator characteristic curves (AUCs) for each of the ten separate CV runs for each of the ten ADR groups - tenCVResultsMatrix.csv - For each of the ten ADR groups, shows the following results for the median ten-fold cross-validation AUC of the ten performed - lambda_min = value of the regularization parameter that corresponds to the maximum AUC from the model-selection phase in a single glmnet ten-fold CV run - cvlo = lower confidence bound on the estimate of the median AUC - cvm = mean estimate of the median AUC - cvup = upper confidence bound on the estimate of the median AUC - nzero = number of non-zero parameters in the (maximum AUC) model corresponding to lambda_min =============================================== TenFoldCVTimesTen_VirtualToxicityPanel_MMGBSA.R =============================================== - Description: - Takes MM/GBSA re-scored docking scores for 33 protein structures that map to the sixteen proteins of the virtual toxicity panel and ADR data in SIDER on our 560 drug set and constructs a 560 drug(row) x 16 protein(column) data matrix and a 560 drug(row) x 10 ADR group(column) response matrix and trains an L1-regularized logsitic regression model, performing model selection using ten-fold cross-validation, repeated ten times. - User-specified parameters: - 'filepath' - Absolute path to where all the input files are stored. The output files will be generated in this same directory. - 'dockThresh' - User can either use the raw docking score as features, or use the 'dockThresh' variable to set a threshold such that docking scores above 'dockThresh' are considered 'unbound' and docking scores below 'dockThresh' are considered 'bound'. - Requires the following R package(s): - glmnet - Requires the following Input Files: - N560_p33_MMGBSAScore_Matrix.csv - N560_p10_DrugADR_Matrix.csv - Produces the following Output Files: - Ten '_Betas.txt' files, one for each ADR group. - Beta coefficients of the median AUC model - tenAUCMatrix.csv - Contains 10 ten-fold cross validation (CV) area under the receiver operator characteristic curves (AUCs) for each of the ten separate CV runs for each of the ten ADR groups - tenCVResultsMatrix.csv - For each of the ten ADR groups, shows the following results for the median ten-fold cross-validation AUC of the ten performed - lambda_min = value of the regularization parameter that corresponds to the maximum AUC from the model-selection phase in a single glmnet ten-fold CV run - cvlo = lower confidence bound on the estimate of the median AUC - cvm = mean estimate of the median AUC - cvup = upper confidence bound on the estimate of the median AUC - nzero = number of non-zero parameters in the (maximum AUC) model corresponding to lambda_min ================================================= TenFoldCVTimesTen_VirtualToxicityPanel_DrugBank.R ================================================= - Description: - Takes DrugBank target protein binding data for the sixteen proteins of the virtual toxicity panel and ADR data in SIDER on our N=560 drug set and constructs a 560 drug(row) x 16 target protein(column) data matrix and a 560 drug(row) x 10 ADR group(column) response matrix and trains an L1-regularized logistic regression model, performing model selection using ten-fold cross-validation, repeated ten times. - User-specified parameters: - 'filepath' - Absolute path to where all the input files are stored. The output files will be generated in this same directory. - Requires the following R package(s): - glmnet - Requires the following Input Files: - N560_p16_DB_Matrix.csv - N560_p10_DrugADR_Matrix.csv - Produces the following Output Files: - Ten '_Betas.txt' files, one for each ADR group. - Beta coefficients of the median AUC model - tenAUCMatrix.csv - Contains 10 ten-fold cross validation (CV) area under the receiver operator characteristic curves (AUCs) for each of the ten separate CV runs for each of the ten ADR groups - tenCVResultsMatrix.csv - For each of the ten ADR groups, shows the following results for the median ten-fold cross-validation AUC of the ten performed - lambda_min = value of the regularization parameter that corresponds to the maximum AUC from the model-selection phase in a single glmnet ten-fold CV run - cvlo = lower confidence bound on the estimate of the median AUC - cvm = mean estimate of the median AUC - cvup = upper confidence bound on the estimate of the median AUC - nzero = number of non-zero parameters in the (maximum AUC) model corresponding to lambda_min ==================== CalcPValue_Example.R ==================== - Description: - Script shows an example of how the p-values for the different protein features were calculated. Takes VinaLC scores and adverse drug reaction (ADR) data in SIDER on our N=560 drug set and constructs a 560 drug(row) x 409 protein(column) data matrix and a 560 drug(row) x 10 ADR group(column) response matrix. Also takes in the beta coefficients from a logistic regression model trained on this data and the list of potential predictors. Produces a list of features and corresponding (uncorrected for false discovery rate) p-values. Uses Fisher's exact test for discrete features and the Wilcoxon Rank Sum test for continuous real-valued features. - User-specified parameters: - User can either use the raw docking score as features, or use the 'dockThresh' variable to set a threshold such that docking scores above 'dockThresh' are considered 'unbound' and docking scores below 'dockThresh' are considered 'bound'. - User can specify 'adrIndex' which specifies the ADR group of interest for the p-value calculation. Index runs from 1-10 and follows the same order as how the ADR group input files appear in the code. - Requires the following R package(s): - glmnet - Requires the following : - n x p drug-protein design matrix called 'dataMatrix' - n x 10 drug-ADR group response matrix called 'Y' - Produces the following Output: - p-dimensional vector containing p-values called 'proteinPValVect', also printed out. - p-dimensional vector containing q-values called 'proteinQValVect', also printed out.