==========================
TenFoldCVTimesTen_VinaLC.R
==========================
- Description: 
       - Takes VinaLC scores and adverse drug reaction (ADR) data in SIDER on our N=560 drug set and constructs a 560 drug(row) x 409 protein(column) data matrix 
         and a 560 drug(row) x 10 ADR group(column) response matrix and trains an L1-regularized logistic regression model, performing model selection using ten-fold cross-validation, 
         repeated ten times.
- User-specified parameters:
       - 'filepath' - Absolute path to where all the input files are stored. The output files will be generated in this same directory.
       - 'dockThresh' - User can either use the raw docking score as features, or use the 'dockThresh' variable to set a threshold such that
                        docking scores above 'dockThresh' are considered 'unbound' and docking scores below 'dockThresh' are considered 'bound'.
- Requires the following R package(s):
       - glmnet 
- Requires the following Input Files:
       - N560_p409_DockScore_Matrix.csv
       - N560_p10_DrugADR_Matrix.csv 
- Produces the following Output Files:
       - Ten '<ADR_Group_Name>_Betas.txt' files, one for each ADR group.
            - Beta coefficients of the median AUC model
       - tenAUCMatrix.csv
            - Contains 10 ten-fold cross validation (CV) area under the receiver operator characteristic curves (AUCs) for each of the ten separate CV runs
              for each of the ten ADR groups   
       - tenCVResultsMatrix.csv
            - For each of the ten ADR groups, shows the following results for the median ten-fold cross-validation AUC of the ten performed
                   - lambda_min = value of the regularization parameter that corresponds to the maximum AUC from the model-selection phase in a single glmnet ten-fold CV run
                   - cvlo = lower confidence bound on the estimate of the median AUC 
                   - cvm = mean estimate of the median AUC
                   - cvup = upper confidence bound on the estimate of the median AUC 
                   - nzero = number of non-zero parameters in the (maximum AUC) model corresponding to lambda_min


============================
TenFoldCVTimesTen_DrugBank.R
============================
- Description:
       - Takes DrugBank target protein information and ADR data in SIDER on our N=560 drug set and constructs a 560 drug(row) x 555 target protein(column) data matrix and a 
         560 drug(row) x 10 ADR group(column) response matrix and trains an L1-regularized logistic regression model, performing model selection using ten-fold cross-validation, repeated ten times.
- User-specified parameters:
       - 'filepath' - Absolute path to where all the input files are stored. The output files will be generated in this same directory.
- Requires the following R package(s):
       - glmnet
- Requires the following Input Files:
       - N560_p555_DB_Matrix.csv
       - N560_p10_DrugADR_Matrix.csv
- Produces the following Output Files:
       - Ten '<ADR_Group_Name>_Betas.txt' files, one for each ADR group.
            - Beta coefficients of the median AUC model
       - tenAUCMatrix.csv
            - Contains 10 ten-fold cross validation (CV) area under the receiver operator characteristic curves (AUCs) for each of the ten separate CV runs
              for each of the ten ADR groups
       - tenCVResultsMatrix.csv
            - For each of the ten ADR groups, shows the following results for the median ten-fold cross-validation AUC of the ten performed
                   - lambda_min = value of the regularization parameter that corresponds to the maximum AUC from the model-selection phase in a single glmnet ten-fold CV run
                   - cvlo = lower confidence bound on the estimate of the median AUC
                   - cvm = mean estimate of the median AUC
                   - cvup = upper confidence bound on the estimate of the median AUC
                   - nzero = number of non-zero parameters in the (maximum AUC) model corresponding to lambda_min


===============================================
TenFoldCVTimesTen_VirtualToxicityPanel_MMGBSA.R 
===============================================
- Description:
       - Takes MM/GBSA re-scored docking scores for 33 protein structures that map to the sixteen proteins of the virtual toxicity panel and ADR data in SIDER on our 560 drug set and 
         constructs a 560 drug(row) x 16 protein(column) data matrix and a 560 drug(row) x 10 ADR group(column) response matrix and trains an L1-regularized logsitic regression model, 
         performing model selection using ten-fold cross-validation, repeated ten times.
- User-specified parameters:
       - 'filepath' - Absolute path to where all the input files are stored. The output files will be generated in this same directory.
       - 'dockThresh' - User can either use the raw docking score as features, or use the 'dockThresh' variable to set a threshold such that
                        docking scores above 'dockThresh' are considered 'unbound' and docking scores below 'dockThresh' are considered 'bound'.
- Requires the following R package(s):
       - glmnet
- Requires the following Input Files:
       - N560_p33_MMGBSAScore_Matrix.csv
       - N560_p10_DrugADR_Matrix.csv
- Produces the following Output Files:
       - Ten '<ADR_Group_Name>_Betas.txt' files, one for each ADR group.
            - Beta coefficients of the median AUC model
       - tenAUCMatrix.csv
            - Contains 10 ten-fold cross validation (CV) area under the receiver operator characteristic curves (AUCs) for each of the ten separate CV runs
              for each of the ten ADR groups
       - tenCVResultsMatrix.csv
            - For each of the ten ADR groups, shows the following results for the median ten-fold cross-validation AUC of the ten performed
                   - lambda_min = value of the regularization parameter that corresponds to the maximum AUC from the model-selection phase in a single glmnet ten-fold CV run
                   - cvlo = lower confidence bound on the estimate of the median AUC
                   - cvm = mean estimate of the median AUC
                   - cvup = upper confidence bound on the estimate of the median AUC
                   - nzero = number of non-zero parameters in the (maximum AUC) model corresponding to lambda_min


=================================================
TenFoldCVTimesTen_VirtualToxicityPanel_DrugBank.R
=================================================
- Description:
       - Takes DrugBank target protein binding data for the sixteen proteins of the virtual toxicity panel  and ADR data in SIDER on our N=560 drug set and constructs a 560 drug(row) x 16 target protein(column) 
         data matrix and a 560 drug(row) x 10 ADR group(column) response matrix and trains an L1-regularized logistic regression model, performing model selection using ten-fold cross-validation, repeated ten times.
- User-specified parameters:
       - 'filepath' - Absolute path to where all the input files are stored. The output files will be generated in this same directory.
- Requires the following R package(s):
       - glmnet
- Requires the following Input Files:
       - N560_p16_DB_Matrix.csv
       - N560_p10_DrugADR_Matrix.csv
- Produces the following Output Files:
       - Ten '<ADR_Group_Name>_Betas.txt' files, one for each ADR group.
            - Beta coefficients of the median AUC model
       - tenAUCMatrix.csv
            - Contains 10 ten-fold cross validation (CV) area under the receiver operator characteristic curves (AUCs) for each of the ten separate CV runs
              for each of the ten ADR groups
       - tenCVResultsMatrix.csv
            - For each of the ten ADR groups, shows the following results for the median ten-fold cross-validation AUC of the ten performed
                   - lambda_min = value of the regularization parameter that corresponds to the maximum AUC from the model-selection phase in a single glmnet ten-fold CV run
                   - cvlo = lower confidence bound on the estimate of the median AUC
                   - cvm = mean estimate of the median AUC
                   - cvup = upper confidence bound on the estimate of the median AUC
                   - nzero = number of non-zero parameters in the (maximum AUC) model corresponding to lambda_min


====================
CalcPValue_Example.R
====================
- Description:
       - Script shows an example of how the p-values for the different protein features were calculated. Takes VinaLC scores and adverse drug reaction (ADR) data in SIDER on our N=560 drug set 
         and constructs a 560 drug(row) x 409 protein(column) data matrix and a 560 drug(row) x 10 ADR group(column) response matrix. Also takes in the beta coefficients from a logistic regression model trained 
         on this data and the list of potential predictors. Produces a list of features and corresponding (uncorrected for false discovery rate) p-values. Uses Fisher's exact test for discrete features
         and the Wilcoxon Rank Sum test for continuous real-valued features. 
- User-specified parameters:
       - User can either use the raw docking score as features, or use the 'dockThresh' variable to set a threshold such that
         docking scores above 'dockThresh' are considered 'unbound' and docking scores below 'dockThresh' are considered 'bound'.
       - User can specify 'adrIndex' which specifies the ADR group of interest for the p-value calculation. Index runs from 1-10 and follows the same order as how the ADR group input files appear in the code.
- Requires the following R package(s):
       - glmnet
- Requires the following :
       - n x p drug-protein design matrix called 'dataMatrix'
       - n x 10 drug-ADR group response matrix called 'Y'
- Produces the following Output:
       - p-dimensional vector containing p-values called 'proteinPValVect', also printed out.
       - p-dimensional vector containing q-values called 'proteinQValVect', also printed out.