By Shrishaila Patil Vice President, Statistical Programming, Navitas Data Sciences
Data Sciences has been fueling powerful business decisions taken by Industry leaders in the last few years. Data scientists are storytellers. They often need to dig into data; to clean, transform, build and validate models; to understand patterns, generate insights, and most importantly communicate results effectively.
In the field of Statistics, Analytics, and Visualization, in addition to SAS, most talked about languages are R and Python. This article highlights current status, observed challenges of R, proposed approaches for risk assessment of R packages, mitigation, and implementation for Clinical Trial Data Analysis.
Need of the Hour: It is of paramount importance to understand the bigger picture and “Need of the Hour” for the Pharmaceutical Industry.
“The Pharmaceutical Industry needs greater Innovation to reach patients faster, with affordable drug pricing and improved accessibility. With the ongoing COVID-19 crisis, this is even more important than ever before.”
The Industry is seeking better alternative technologies and tools which are sustainable and can provide optimal solutions to address Industry challenges effectively. Are our current tools outdated? Do we have an alternate solution for SAS (to avoid high license and maintenance cost)? Is Industry ready for R or Python or another tool? We need efficient Data Science technologies and tools that can help us to manage Data Lake?? (Example: Big Data, Real World Data) to process it faster and accurately. Efficiency in Data Analysis results in greater insights about data and can help improve decision-making across Drug Development.
Innovation is needed to move away from any traditional inefficient process/tools toward efficient, simple, easy to implement, reliable and cost-effective solutions. Collaboration across Industry stakeholders is needed to develop better technology ecosystems and agree on Validation, and Regulatory benchmarks.
It is vital that we prepare our workforce with necessary skillsets for future needs.
Current Trends of R in Pharmaceuticals: Looking at the current industry trends, R usage is less than 10% in activities related to Pharmaceutical Regulatory Submissions at this juncture. However, R is extensively used in public health projects, healthcare economics, exploratory/scientific analysis, trend identification, generation of Plots/Graphs, specific statistical analysis and machine learning. R is not widely used for CDISC (SDTM, ADaM) dataset creation.
One of the common questions from the Programming community is “Should we replace SAS with R or use both or another language (Python)?” I personally feel, instead of choosing between SAS or R or Python, one should leverage the best out of each of these programming languages to solve appropriate Data Science problems (one size does not fit all).
We have a few early adopters of R and they have experienced some challenges. Ensuring regulatory compliance of R packages is one of the common challenges. If R is used in regulatory submissions, one needs to do Risk Assessment of R packages, feasibility analysis and to establish a process for R usage through Pilot Projects with necessary documentation.
The Expansion of Toolsets:
“I have enough Tools, said no Data Scientist ever.”
In the last few years, we have witnessed the expansion of toolsets across various areas like Analytics (SAS, R, and Python), Big Data (Hadoop, Hive, MongoDB, and Cassandra), Support Data Analytic Tools (Scala, Spark, and SQL), Data Transformation (Informatica, Abinitio, and ETL), Visualization Tools (Tableau, TIBCOs, Spotfire, PowerBI, Matplotlib, and ggplot), Integrated Development Environment (PyCharm, Jupyter Notebook, RStudio, and Intellij), Web Scraping (Beautiful Soup, and Scrapy), cloud computing platforms (Amazon Web Services and Microsoft Azure) etc.
Often, we end up using different technologies and thus it is difficult to integrate when needed.
The technology space continues to expand. We need to stay ahead in terms of the learning curve to take advantage of cutting-edge solutions required for appropriate Data Science problems.
Reasons why R can be a potential powerful tool for Data Analysis:
R is a language and environment for statistical computing and graphics. It is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. As an open-source software, R receives huge support from the Community. Source code availability provides superior and thorough documentation.
R compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed.
There are approximately 2 million users worldwide for R.
R has 3 decades of legacy. R was created by Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand (Similar to the S language) in year 1993. Since 1997, there has been an R Core group with write access to R source code.
The R development core team have found “The R Foundation” which is a not for profit organization working in the public interest. The R Foundation provides support for the R project and innovations in statistical computing. It also holds and administers copyright of R Software and documentation. The R Foundation is seated in Vienna, Austria and is currently hosted by Vienna University of Technology.
The first official annual gathering of R users called "useR!" was held at Vienna, Austria 2004. The latest CRAN (Comprehensive R Archive Network) repository has around 15,500+ R packages.
R Packages for Clinical Trial Design, Monitoring, and Analysis: R has many packages for Clinical Trial data analysis. The following are a few examples: Atable (Creates Tables for Reporting Clinical Trials), compareODM (Comparison of medical forms in CDISC ODM format), CRTSize (Sample size estimation in cluster (group) randomized trials), Blockrand (Creates randomizations for block random clinical trials), DoseFinding (Supports design and analysis of dose-finding experiments), Pact (Predictive Analysis of Clinical Trials), SASxport (Read and write 'SAS' 'XPORT' Files), ADCT (Adaptive Design in Clinical Trials), ClinPK, cpk (Clinical Pharmacokinetics Toolkit), randomizeR (Randomization for Clinical Trials), Base R (Lots of functionality useful for design and analysis of clinical trials), Greport (Graphical Reporting for Clinical Trials), Coronavirus (Provides a daily summary of the Coronavirus (COVID-19) cases by state/province) etc.
R Implementation in the Pharmaceutical Industry – Real-time examples:
Amgen integrates SAS and R using Microsoft DeployR: Although SAS has been the primary tool at Amgen, due to lack of SAS graph macros, R was considered (ggplot). As SAS Grid and R environment were hosted on different physical servers at Amgen, integration was much needed and thus Microsoft DeployR was chosen.
DeployR is an integration technology for deploying R analytics inside web, desktop, mobile, and dashboard applications. SAS Procedure PROC Groovy enables SAS Code to run Groovy code on the Java Virtual Machine (JVM). In this solution, PROC GROOVY is used to invoke Java code that calls the DeployR Java Client Library.
Roche’s bioWARP and other R Solutions:
bioWARP (biostatistical Web-Applications and R Procedures - Self-service statistics with Shiny): QA processes in Roche Diagnostics are dependent on statistical evaluations. The Biostatistics department decided to enable users to do them by themselves. The R-Shiny app bioWARP brings standard procedures such as linear regression or equivalence tests to people who cannot code R. It saves them time they would have spent consulting a Biostatistician or using validated Excel sheets.
Roche is developing an R-based ecosystem of tools, processes and environments to create datasets, tables, listings, and graphs (TLGs). The Rtable package ( https://github.com/Roche/rtables ) is used to generate Tables as part of data analysis. Graphs are made with ggplot2, grid, and lattice.
Using the R Interface in SAS to Call R Functions and Transfer Data: In 2009, the SAS/IML Studio application introduced a mechanism for calling R functions from programs written in the IMLPlus language. As of SAS/IML 9.22, this feature is available in PROC IML. Example: The following program transfers a dataset from the Sashelp libref into an R data frame named df. The program then submits an R statement that displays the names of the variables in the data frame.
call ExportDataSetToR("Sashelp.Class", "df" );
submit / R;
The R names function produces the output shown per below.
 "Name" "Sex" "Age" "Height" "Weight"
RStudio is one of the widely used IDE (Integrated Development Environment) by R Programmers.
The cheatsheets make it easy to use some of the common R packages. We have many cheatsheets in RStudio as follows: base-r, advancedR, tidyverse (For data import, tidying, manipulation, visualization, and programming), dplyr (Data Transformation), Ggplot2 (Data Visualization), lubridate (Dates and times), golem (A framework for building robust Shiny Apps), Shiny (to build interactive web apps straight from R), mlr (Machine Learning with R) , Keras (R package enabling use of Keras and TensorFlow in R for Deep Learning), Reticulate (Using Python with R together seamlessly in R Code, in Rmarkdown documents and in Rstudio integrated development environment – IDE), and R Markdown (.rmd file to reproduce your work by rerunning code).
In May 2015, the US FDA released a Statistical Software Clarifying Statement. The FDA does not require use of any specific software for statistical analysis. However, software packages used for statistical analysis should be fully documented in the submission, including version and build identification. Also, documentation of appropriate software testing procedures should be available.
In March 2018, the FDA released the Study Data Technical Conformance Guide. Delivering Software Programs, paragraph 220.127.116.11 states: “Sponsors should provide the software programs used to create all ADaM datasets and generate tables and figures associated with primary and secondary efficacy analyses. Furthermore, sponsors should submit software programs used to generate additional information included in Section 14 CLINICAL STUDIES of the Prescribing Information (PI)26 if applicable. The specific software utilized should be specified in the ADRG. The main purpose of requesting the submission of these programs is to understand the process by which the variables for the respective analyses were created and to confirm the analysis algorithms. Sponsors should submit software programs in ASCII text format; however, executable file extensions should not be used.”
Challenges and Validation of R:
R is free but it is an investment. The main challenge of using R is ensuring validation documentation. R needs to be programmed (How do we develop software for Clinical Science – that enables collaboration across the enterprise and the industry?). R has too many Packages (Which packages are validated?). R Packages may come from anywhere and be written by anyone or may not follow a typical SDLC (Software Development Life Cycle).
R Foundation Documentation: R Foundation has released 2 guidance documents per below.
- R: Regulatory Compliance and Validation Issues: A guidance document for the use of R in regulated clinical trial environments (March2018). Focus is on ICH/FDA Guidelines. Applicable to Base R plus Recommended Packages.
- R: Software Development Life Cycle: A description of R’s development, testing, release and maintenance processes. Source code maintenance is managed via Subversion and is available as archive files called “tarballs.” Track changes are regularly updated in NEWS file. There is always one current version of R. A major release happens every year in the month of April.
R Validation Hub: Enabling Use of R in Regulatory Setting:
R Validation hub is a Cross Industry Initiative. The mission is to enable the use of R by the Bio-Pharmaceutical Industry in a regulatory setting, where the output may be used in submissions to regulatory agencies.
The R Validation Hub is comprised of participants from across the Pharmaceutical Industry (AbbVie, Amgen, Astellas, Bayer, Boehringer-Ingelheim, Celgene, Eli Lilly, FDA, Genentech, Gilead, GSK, Johnson & Johnson, Merck, Novartis, Novo Nordisk, Pfizer, Roche, RStudio, Sanofi, Teva Pharmaceutical Industries Ltd and many more ). Participants contribute to the effort through regular group meetings, as well as the various workstreams that make up the project.
Focus of this group is on designing a framework that assesses the quality of an R package (Contributed by volunteers) and create a repository of “accepted” packages.
Risk Assessment Framework: Current technical checks on “checklist for CRAN submission” do not necessarily guarantee the accuracy of an R package. It is therefore suggested that a risk assessment exercise be conducted to evaluate the likely accuracy/validity of an R package with respect to its intended use. For R, the primary challenge is in ensuring the accuracy of results. A risk-based approach to the adoption of R packages is highly recommended.
Risk Assessment Framework should evaluate R Packages based on four Criteria:
- Purpose: Statistical packages pose greater risks as primary and secondary statistical analysis for a study might be based upon statistical models.
- Maintenance of Good Practice (Software Development Life Cycle): SDLC best practices will help to reduce bugs/errors. One needs to have metrics to check whether a package has a website, formal mechanism for bug tracking, whether the source code is publicly maintained, release rate for new versions, type of license etc.
- Community Usage: User Community plays an important role in open-source software development. More usage leads to more downloads and testing. This helps to know level of risk that a package presents.
- Testing: This is a vital component of well-established SDLC. More tests mean more confidence in the stability of the package over time. Packages should include Unit tests. There should be a standard process for developing functions/macros internally. Requirements and tests should be written against each requirement. Known results from literature are typically the best reference for testing complex statistical procedures.
Data Science is Evolving Fast: Industry is looking for better alternative solutions to unlock valuable insights from rich and diverse data. Acknowledging and protecting time to learn and play with new languages is key.
Innovation and Collaboration: Innovation and Collaboration across Industry stakeholders is key to develop better technology ecosystems and to agree on Validation and Regulatory benchmarks
SAS/R/Python?: Instead of choosing between SAS or R or Python or other tools, one should leverage the best out of each of these programming languages to solve appropriate Data Science problems
Data Quality and Scientific Integrity: Regulatory compliance is critical through necessary Risk Assessment Framework, Validation, and Documentation.
- The R Project for Statistical Computing can be found at https://www.r-project.org/
- Detailed list of R packages for Clinical Trial design, monitoring and analysis can be found at https://cran.r-project.org/web/views/ClinicalTrials.html
- List of all CRAN ((Comprehensive R Archive Network) packages can be found at https://cran.r-project.org/web/packages/available_packages_by_name.html
- R packages for Covid19 can be found at https://cran.r-project.org/web/packages/available_packages_by_name.html
- PharmaSUG2017(PO022 - SAS & R Playing Nice Together by David Edwards, Amgen) https://www.pharmasug.org/proceedings/2017/PO/PharmaSUG-2017-PO22.pdf
- Role of R in Drug Discovery, R&D – by Roche, Genentech https://rstudio.com/resources/webinars/the-role-of-r-in-drug-discovery-research-and-development/
- Roche.Diagnostics.bioWARP https://rinpharma.github.io/website2018/program/the-largest-shiny-application-in-the-world-roche-diagnostics-biowarp.html
- Reference Papers for “Using the R interface in SAS” can be found at https://documentation.sas.com/?docsetId=imlug&docsetTarget=imlug_r_sect001.htm&docsetVersion=14.2&locale=en https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3556-2019.pdf https://www.pharmasug.org/proceedings/2016/QT/PharmaSUG-2016-QT14.pdf
- Detailed list of useful R cheatsheets can be found at https://rstudio.com/resources/cheatsheets/
- Checklist for CRAN Submission can be found at https://cran.r-project.org/web/packages/submission_checklist.html
- Guidance for use of R in Regulated Clinical Trial Environment and R’s SDLC process https://www.r-project.org/certification.html
- R Validation Hub Cross Industry Initiative to enable use of R in Regulatory setting. For more information, please check https://www.pharmar.org/
- A Risk-based Approach for Assessing R package Accuracy within a Validated Infrastructure https://www.pharmar.org/white-paper/