Data Science Tutorials

8 minute read

Published:

A collection of data science, programming, and data visualization tutorials I have found on the internet. Sources referenced.

A collection of resources and links

While everyone is transitioning to working from home in order to practice social distancing during the SARS-CoV2 outbreak, lots of scientists are trying to find ways to stay engaged despite not being able to perform experiments. A good use of this time (if you have the mental bandwidth!), is to learn or brush up on coding and data science fundamentals. I’ve found a number of online tutorials and resources that I want to collect in one place to share with others. These are thrown together so the order may not make sense everywhere.

The Command Line

The most important thing I learned in my computational biology experience is how to effectively use the command line interface, or CLI, and a shell interpreter, such as Bash. Not every problem requires programming. Some problems are best, and most efficiently, solved using software tools run from the command line or Bash scripting. Here are a few resources that I would recommend for learning about the command line and using bash:

I would also recommend learning a text editor that can be used through a terminal. My personal favorite is Vim, I even wrote a short blog post about it last year. Other CLI text editors include nano and Emacs. All three come installed on all Unix-based operating systems like Linux or MacOS (in other words, any operating system that comes with a terminal).

I want to throw in here that learning about version control software is a useful thing for computational biologists. Git is useful for collaborative projects, but could be useful to keep track of your own code, too! Here is a simple guide for getting started with git.

Learning R

R is a great language to learn for statistical analyses and data visualization. A good first step in learning to use R is to download RStudio for desktop. RStudio is an integrated development environment, or IDE, for the language R. IDEs are useful for writing, reading, and debugging code. For an introduction to R, this tutorial from swirl lets you learn interactively at your own pace. Here is a text-based tutorial for learning R from datamentor.

Within RStudio, you can create documents of your work using RMarkdown. These are useful for reproducibility, transparency, and keeping track of your own work. I found a tutorial for using RMarkdown from the Coding Club. If you need help learning RMarkdown, which supports LaTeX, here is a guide about mathematics in RMarkdown, but there are many cheatsheets available online, too.

For more intermediate users, I would recommend diving into the Tidyverse! The tidyverse is a collection of R packages designed specifically for data science. All packages work well together and share not only an underlying design philosophy, but also grammar and data structures. You may already be familiar with some of the packages included in the tidyverse: ggplot2 and dplyr. Easily install the complete tidyverse with: install.packages("tidyverse").

R is great for data visualization (^ggplot2!). Selva Prabhakaran has created a nice tutorial for R and data visualization in R using ggplot2. This page also goes through many topics including linear regression, model selection, and time series data analysis. The Cookbook for R for graphs is one of my most used resources for plotting in R.

Claus O. Wilke wrote an entire book called Fundamentals of Data Visualization using RMarkdown, which he made open access as well as selling hard copies. This book is a great resource for learning about the basics of data visualization and how to avoid common problems like visualizing proportions or color choice.

Learning Python

CodeAcademy has courses for learning Python that include the basics, analysing data, data visualization, and getting started with machine learning in Python.

There are a number of Python librarys that would be useful to learn as a computational biolgist. Here, I will list a few topics and related tutorials.

There are a number of IDEs out there, but Jupyter notebook seems to be the most popular for Python right now. JupyterLab is a web-based IDE for Jupyter notebooks. In order to install Jupyter on your computer, you will need conda or pip, which may require some basic Python knowledge.

Andaconda is a package and environment manager and is theoretically language agnostic (whereas ‘pip’, for example, is just a package manager for Python). Environments are useful for creating a virtual sandbox for your project, where you can keep track of software versions and installations. They also make reproducing an analysis or pipeline easier. I found this blog post on Towards Data Science to be especially helpful for getting started with using conda environments.

Data Science

The RafaLab has a number of excellent teaching materials for using R for data science. They have lessons for many topics including, but not limited to:

  • R basics
  • Data visualization
  • Tidyverse
  • Machine learning in R
  • R for Life Sciences:
    • High dimensional data analysis
    • Introduction to linear models and matrix algebra

If you are interested in improving your statistical abilities, here is a Coursera course called “Improving your statistical inferences”.

The blog Eight to Late has a number of “gentle introduction” blog posts for various topics in data science, including one for linear and logistic regression. Understanding linear and logistic regression are important topics, even if you do not intend to dive into deep learning and artificial intelligence.

Learning a database language can be useful for data science. The Knight Lab has created a fun “murder mystery” game to learn SQL. They include an introductory lesson for beginners. Once you have completed the introduction (or if you already know SQL), you can jump into the murder mystery.

Chris Albon is a data scientist who makes very easy to follow notes on many topics including machine learning and Python, but also things like computer science topics, AWS, Linux, and regular expressions. His website is loaded with information and step-by-step guides and I highly recommend spending some time browsing it.

Genomic Data Science

Genomic data science is a specialized field of data science that deals with next generation sequencing data. This Coursera course on Genomic Data Science from Johns Hopkins University covers topics from how genomic data are generated to the fundamentals of data analysis. Skills and topics included in this course:

  • Next gen sequencing and genomic technologies
  • Genome analysis; DNA, RNA, and epigenetics
  • R: Bioconductor and R programming
  • Python: Biopython and python programming
  • Galaxy

Bioconductor has also created many lessons for learning about their various tools.

High-dimensional data analysis is becoming a necessary skill for computational biologists. High-dimensional data are basically any data set (think of a table of data with columns and rows, where rows are samples and columns are features of data, i.e. gene expression valuse or taxa abundances) where there are many, many more columns than rows. Another way you might see this, p »> n (where n=number of samples and p=number of genomic features). The RafaLab resrouces that I mention above has lessons on this. There is also an online course from Harvard that covers topics of high-dimensional data analysis such as: dealing with batch effects, dimension reduction, and PCA.

Dimensionality reduction is a topic that will touch most of our computational biology projects at some point. Anyone who has done a PCA has engaged in dimensionality reduction. Susan Holmes and Lan Huong Nguyen wrote a paper called “Ten quick tips for effective dimensionality reduction” that I would recommend reading.

For beginning and more advanced Python users, rosalind has a bunch of bioinformatic challenges to help you learn about using Python for bioinformatics through problem solving.

Notes…

I will continue to update this post as I find more useful resources.