Style Guide
Hello and welcome to the Longo Lab Informatics style guide. This will provide a general reference for Longo Lab members to generate reproduceble code. Open science and all that.
Languages
We end up using quite a few languages in the Longo Lab, but the core of our codebase is centered around R
for bioinformatic analysis (with python
supporting) and sh
ell scripting to interface with the clusters and document any and all command line entries necessary to reproduce research results. In order to keep it manageable we use the folllowing resources for coding style. In addition to basic syntax, each has some resources for checking or modifying code style on the fly
R and Rnotenook
We use R scripts *.R
when running analyses in batch mode. For interactive analyses, R markdown notebooks *.Rmd
provide an easy way to run limited analysis on SCG OnDemand and track your notes on one-off analyses and projects.
For R style, tidyverse style guide has been extensively updated by Hadley Wickham and offers a consistent solution with a code lintr and interactive styler
. You can use this resource in an OnDemand Rstudio session to reformat legacy code quickly.
Python
We primarily use python scripts in batch processing, but SCG OnDemand contains the resources for Jupyter notebooks (*.ipynb
) to generate interactive analysis sessions.
For python style, it showed up with PEP8 baked in, making it pretty easy. There are a number of lintrs to check or autoformat your code.
Shell scripts
In general shell scripting is not useful for code development, but for sbatch
submissions to slurm, and for documenting commands used in an analysis. Every analysis folder should have a shell script 00-commands.sh
that records every command used in the analysis, including sbatch submissions.
Style is loosely structured around the Google guide but primarily focused on documentation (see also Headers below)
Linters, stylers & auto-formatting
These things can make your life much easier by automating the tedious things. In the same way generative ML models for code can speed things along too, by asking it to clean up your code. But you should check with a the output with a linter.
Headers
Another universal policy is that all scripts be they *.py
, *.R
, or *.???
should have a script header that describes what the script does. We have plenty of legacy code lying around that the author themself couldn’t figure out. Give it a month or two and you too can be the proud owner of a mystery box. So think of this as mandatory. Here are some examples of helpful information to include in your header (templates are available on SCG in the scripts
folder):
R script header
#!/usr/bin/env Rscript
#' ---------------------------
#'
#' Script name: 12_sc_wkflow_subclass_de.R
#'
#' Version: 0.1.0
#'
#' Purpose of script: Performing differential expression after integration
#'
#' Author: Robert R Butler III
#'
#' Date Created: 2022-08-29
#'
#' Copyright (c) 2022
#' Email: rrbutler@stanford.edu
#'
#' ---------------------------
#'
#' Notes:
#'
#' This version introduces a breaking change, shifting date prefixes to round
#' numbers
#'
#' Usage:
#' sbatch -J MG --mem=50G -c 8 -t 01:00:00 -p interactive \
#' -o %x/%A_sc_wkflow_subclass_de_%x.log \
#' --wrap "ml R/4.0; Rscript 12_sc_wkflow_subclass_de.R MG R6"
#'
#' or interactive session:
#' sdev -m 50 -c 8 -t 01:00:00 -p interactive
#'
#' ---------------------------
#' load up the packages we will need: (uncomment as required)
library(Seurat)
library(future.apply)
library(patchwork)
library(ggplot2)
library(ggrepel)
library(ggpubr)
library(data.table)
library(stringr)
library(Cairo)
library(dplyr)
library(RColorBrewer)
#' ---------------------------
py script header
#!/usr/bin/env python
#!interpreter [optional-arg]
# -*- coding: utf-8 -*-
"""
Pipeline for generating a gene-set analysis using MAGMA a set of gene lists.
Runs with a range of annotation windows surrounding the gene, and can
incorporate gene-set covariate files as defined by magma
"""
# Futures
from __future__ import print_function
# […]
# Built-in/Generic Imports
import os
import sys
# […]
# Libs
import logging
import argparse
import datetime
import subprocess as sp
# […]
# Own modules
# […]
# global variables
__author__ = 'Robert R Butler III, William A. Johnson'
__copyright__ = 'Copyright 2023, Longo Lab'
__version__ = '0.0.12'
__maintainer__ = 'Robert R Butler III'
__email__ = 'rrbutler@stanford.edu'
shell script header
#!/usr/bin/env bash
###################################################################
#Script Name : 01-magma_command_curated.sh
#Description : Runs magma on PREDICT-HD set using 0kb window
#Usage : sbatch 01-magma_command_curated.sh
#Author : Robert R Butler III
#Date Created : 2023-08-11
#Email : rrbutler@stanford.edu
#Copyright (c) 2023
###################################################################
Code documentation
In addition to headers, code should also be well commented in the manner of each respective language. In particular, function documentation is a must. To support eventual utilization of functions across multiple scripts, follow the annotation guidelines for Roxygen2
package building in R (also see R style guide above):
R function documentation
# Define functions --------------------
#' For a given column of common names, replace them with ensembl gene ids.
#' Includes a filter for autosomal genes that are not pseudo or small RNAs.
#'
#' @param dt Data table to replace names
#' @param colname Name of the column containing gene symbols
#' @param keep.symbols Boolean, retain the symbols column?
#'
#' @return dt with a GENE column
convert_gene_symbol <- function(dt, colname, keep.symbols = FALSE) {
...
For python, we can stick with relatively simple docstrings for functions and classes:
py function documentation
def get_spreadsheet_cols(file_loc, print_cols=False):
"""Gets and prints the spreadsheet's header columns
Parameters
----------
file_loc : str
The file location of the spreadsheet
print_cols : bool, optional
A flag used to print the columns to the console (default is
False)
Returns
-------
list
a list of strings used that are the header columns
"""
file_data = pd.read_excel(file_loc)
...
py class documentation
class Animal:
"""
A class used to represent an Animal
...
Attributes
----------
says_str : str
a formatted string to print out what the animal says
name : str
the name of the animal
sound : str
the sound that the animal makes
num_legs : int
the number of legs the animal has (default 4)
Methods
-------
says(sound=None)
Prints the animals name and what sound it makes
"""
says_str = "A {name} says {sound}"
Maintaining your code via GitHub
An essential component of your research is notes! In order to take notes on code, you need some manner of version control, and for us that is GitHub. Getting started with GitHub can be daunting, but you can start with some simple practice via their site, and some tutorials
SCG scripts are backed up nightly
A very important reason to keep your analysis on SCG is that all coding scripts are backed up to our GitHub organization page nightly. You are welcome to do your own github commits as you work to save a full history of your code, and to ultimately break out projects on their own respective github projects for more control.
Their are several unusual things about this setup from traditional GitHub projects:
Only code is backed up to the repository
File sizes on GitHub are limited to 100 MiB, and our primary purpose is script maintenance, so anything other than scripts has to be opted in. Every git parent directory (your project directory), should have a .gitignore file like this:
.gitignore
# Ignore everything
*
# But not these files...
!*.sh
!*.pl
!*.py
!*.R
!*.Rmd
!*.ipynb
!*.md
!README.md
!.gitignore
!celltype-gene-database.xlsx
!overall.name_schema.txt
# etc...
# ...even if they are in subdirectories
!*/
.Rproj.user
# But do ignore these files...
deploy.R
Note that celltype-gene-database.xlsx
and overall.name_schema.txt
are special files that have been opted into backup in this folder, and deploy.R
has been opted out of backups (see for this reason).
Code is backed up to the main
branch
Traditionally, commits should not be made to the main
branch in GitHub. However, the cluster is a shared resource, which will have multiple users working in the same filesystem. If they were trying to each work on a different branches in the same location they would collide. So, an imperfect solution to be sure, but necessary.
Strategies to follow for effective project control
These are a couple of things to keep your GitHub on the rails:
Break out your own project
The instructions for doing so are here. You can add it to the nightly backup (by contacting the Senior Research Scientist) or not, but once broken out it is much easier to branch and develop at your own pace.
Version control
Semantic versioning is softly included in each of the script headers above. Use it to gain control of the versions of your scripts on GitHub. We are perpetually stuck in development, so if you ever advance to v1.0.0
, that will be quite a feat.
Tie your commits to issues and projects
When you make a commit, you can tie it to issues by mentioning their issue #
in the git commit message. It should then go without saying you always add a meaningful commit message to your commits. Also, you can and should link those repository issues to a respective project. If you forget, you can comment on the desired issue with the git-commit-id (e.g. 356b039
).
Note
Not only can you say,
updated #3
, you can instantly manage issues and pull requests bygit commit -m "Closed #5"
USE PROJECTS!
Having a plan and keeping track of it helps a great deal once you have multiple branches, multiple issues, multiple pull requests. See Project Management
Etc
For any additional issues that do arise, don’t forget to ask for help! troubleshooting is the name of the game in bioinfomatics.