Good Programming Practice Guidance
Good Programming Practice Project |
---|
The PHUSE Good Programming Practice guidance and associated documents are maintained by the PHUSE Good Programming Practice project team. |
Published Guidance Document |
---|
Version 1 of the guidance has been published and is available as an MS Word document for adoption. GPP Guidance Document v1.1 We also have a one page poster summary of GPP principles for download The guidance is open for public comment and comments will be incorporated in periodic revisions so please continue to review and provide your feedback. |
Introduction |
---|
This document provides guidance for Good Programming Practices (GPP) for data manipulation, analysis and reporting of clinical data in health and life sciences organizations. GPP addresses the way in which you write code and comments. The following key principles of Good Programming Practice and are followed in this document.
This guidance is primarily aimed at SAS programmers, however, the principles of GPP also apply to other languages such as R and Stata. In addition, although this is not produced with SAS macros in mind, the same principles apply to macros too. We often have to update existing programs to add new rules, copy programs from one study to another, and take over programs written by others. This guideline aims to show how to produce well structured and well documented programs so that they are easy to read and maintain over time. It is meant to be applicable to all programs, and hence all programmers regardless of experience. Specific rules may be of more use to novice programmers, but applying the principles should be in mind for experienced programmers and mentors. |
Why Good Programming Practice? |
---|
Good Programming Practice (GPP) is important within the life sciences and healthcare industries as an increased need for efficiency means that code that is clear, easy to maintain, and efficient is more important than ever. Efficient code and best practices should not conflict with one another. It is essential to have various guidelines to govern and regulate code on clarity, efficiency, re-usability, adaptability and robustness. Good Programming Practices :
|
What is Good Programming Practice? |
---|
Some examples of GPP from the above definition:
Use of headersCommentsSelf documenting codeStyle conventions (such as indenting, clear definition of datasteps and procedures)
Dynamic programming- writing programs to accommodate potential changes to data or specifications. Test first design. |
Good Programming Practice (GPP Project Team) |
---|
|
Articles on Programming Practices |
---|
Quick guide to GPP courtesy of Shafi Consultancy Useful Principles and Practices for Professional SAS Programmers |
Good Programming Practice Conference and Discussion Clubs |
---|
GPP discussion Club at PHUSE Conference London 2014 |
Study on Good Programming Practices in Health and Life Sciences |
---|
Through this first study on Good Programming Practice in health and life sciences, we are looking to understand
If so-
|
How can you contribute? |
---|
|
Getting Started With a New Project |
---|
When starting work on any new study, it is important to familiarize yourself with the study. Review the study documents and try to understand the following:
Study documents include:
Before you start programming, it is important that you familiarize yourself with all the relevant company IT systems, standards, SOPs and Guidelines on both programming and program validation. All these should be adhered to.
Now that you are ready to start programming, keep in mind some basic standards:
|
Language |
---|
The language used in programming code and within headers and comments is English. |
Program Header |
---|
A standard header should be used for every program. The purpose of the header is to identify the program and provide documentation including revision history. It provides the necessary information to a code reviewer to identify and understand the program and its development life cycle. Standardizing the header will allow the information contained in the header to be leveraged programmatically for things such as auditing, project documentation, macro and dataset use tracking, consistency checking, and revision history reporting. The elements included in a header will vary from organization to organization but below is a discussion of some of the most common elements. Required elements The following should be included in all program headers:
Recommended elements The following are not required but are highly recommended in all program headers:
|
Revision History |
---|
The revision history section is critical to document the revisions made to the program once it is put into production. A well designed revision history section should include the author of the change, date of release of the change, a short description of the change. Revision history may also include a version number for changes which can be used as a reference in the code. |
Comments |
---|
Comments are important to help anyone reviewing, modifying or using a program to be able to quickly understand the code. All major data or proc steps should be commented, especially data specific and complex code. Ideally comments should be comprehensive, and should describe the rationale and not simply the action. For example, instead of simply typing "Access demography data", describe which data elements you are accessing and why they are needed, for example, “Bringing in DM to get gender and age and subset to include only the intent to treat population”. Comments can also include links to external documentation (requirement specifications, design documents. The programs can also be split up into sections by creating a different type of comments, e.g. many rows with asterisks. This helps to structure the program and make it easier for others to see an overview of the program. |
Naming Conventions |
---|
All organizations should have standard naming conventions. Program naming conventions should make it possible to identify groups of related programs such as adverse events tables. Dataset and variable names should describe as best as possible their content. Temporary datasets and variables should be named consistently in a way that makes their purpose and role in the program clear. Where possible organisations should use industry level standards such as Clinical Data Interchange Standards Consortium (CDISC) standards for permanaent datasets and variables. This aids sharing of programs across companies and facilitates development of standard code. Space characters should be avoided in variable, dataset and output file names. |
Coding Conventions |
---|
In order to be efficient and streamline the sharing of program code between programmers, with regulatory agencies, and with external partners or vendors, it is vital for code structure to follow standard conventions. SAS code which follows these conventions is much easier to read, modify, maintain, and correct. These conventions are divided into those which should be considered as required, and those which are merely recommendations to be followed as applicable. Required conventions
Recommended conventions
|
Log File Checking |
---|
As part of development and validation practices, it is often mandated that the log file generated is checked to ensure that the program has executed in the correct intention. Many companies may have their own automatic log file checking utilities to aid in this, and there are many examples of such tools in widely available papers. “ERROR” and “WARNING” in logs should normally be avoided. There are sometime exceptions to this, such as warnings that are output from statistical models that do not have enough data. Ordinarily, any warnings that are deemed acceptable are to be documented. There are also some specific “NOTE”s that can indicate a problem. The common “NOTE”s that should normally be avoided include those relating to “repeats”, “more than one”, “uninitialized” and “referenced”. Also, any user defined checks that have been added, such as from defensive programming, should be checked for in the log and followed up on. A company-specific naming convention for user defined checks can aid in this, so as the specific string can be searched for within the log. Examples of such conventions include “ISSUE:”, “USER:”, and “ALERT:”. Avoid the use of user-generated errors and warnings labeled "NOTE:", "WARNING:" or "ERROR:", as these may make it difficult to find genuine problems when searching the log. |
Portability |
---|
Most organizations are now working across multiple platforms, commonly combining Windows and Unix environments. There can be many occasions where code will work on one platform and not on another. Portability is more than just working across multiplatform environments, it is also about making programs easier to be used across projects. Below are some suggestions to address some of the most common impediments to portability.
|
Hard Coding |
---|
Hardcoding is the modification of the value of an item of source data within program code. Hardcoding should be avoided whenever possible in final code, and changes to source data should be done in data entry or capture systems which give better compliance to regulations such as FDA 21CFR11. Hardcoding may be done temporarily in order to get a program to run due to dirty data or correct for database inconsistencies. Permanent hardcoding to fix incorrect data values in a final database is strongly discouraged, but if it is unavoidable then it must be approved following a standard process (usually an SOP) and clearly documented using standard comments and PUT statements to the log to show what has been hard coded. |
Defensive Programming |
---|
Defensive programming is an approach to programming intended to anticipate future changes of the data that might influence the coding algorithms. Ideally programs should be written in such a way that they will continue to work correctly in case of new or unexpected data values which did not exist at the time the code was developed. Analysis dataset and table programs are often developed in the early stages of a project or even when the only available data is test data. In these situations the data often does not contain all possible values of data points such as visits or time points, race values, and questionnaire responses, but the program must be able to handle those values when they do become present in the data at a later point. |