Mastering Essential Skills: Incorporating Software Development Principles in Statistical Practice

New Zealand Research Software Engineering Conference 2023

Olivia Angelin-Bonnet

The New Zealand Institute for Plant and Food Research Limited

18 September 2023

What I do as a statistician

Software engineering skills I wish I learned about

What I learned at uni:

  • statistics

  • data analysis

  • programming in R and Python

What I wish I had learned:

  • Version control

  • Managing your computational environment

  • Documentation

  • Unit testing

  • Workflow management

Why do we need software engineering skills?

Sustainability

Mind Vectors by Vecteezy

Open science

From University of Lucerne website. This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

Version control

This illustration is created by Scriberia with The Turing Way community.
Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

Code versioning with Git

   

  • Access to the history of the project

  • Easy collaboration on code

  • Standardised way of sharing and publishing code

   

Data versioning

  • Data are not static
  • Link results to versions of the data
  • Versioning raw and processed data

This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

Managing computational environment

Illustration by Dmitriy Zub

Computational environment matters

Dependency management tools

 

Use dependency management tools like renv (R) or conda (Python) in your work

1. Record dependencies

2. Isolate project library

Illustration by Dmitriy Zub

Documentation

Write it down!

Literate programming

  • Concept: mix code and plain text

  • Record reasoning, notes and interpretation alongside source code and results

  • Reduce copy-pasting for reports and presentations

(Unit) testing

Testing to catch issues

Write formal tests for:

  • data (dimensions, range of values, etc.)

  • code (correct output type, handles errors)

  • analysis steps (sensible results, returns known answer)

Using simulations to test statistical methods

  • Test models and algorithms with simulated data

  • Is the model appropriate to answer the research question?

  • How do assumptions violations affect the results?

 

Workflow management

A real-life example of a complex analysis

  • Multiple scripts, folders, datasets

  • Interdependent analyses

  • What to do when a part changes

Workflow management tools

Turn your analysis into a pipeline, i.e. series of steps linked through input/output, which will be executed in the correct order


Conclusions

  • Statisticians and data scientists need more than statistical skills

  • Software engineering practices are crucial for work sustainability and for collaboration

  • These skills should be taught in statistical degrees (but there are lots of resources out there!)

To go further

The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.

Thank you for your attention!

olivia.angelin-bonnet@plantandfood.co.nz

Presentation disclaimer

References

Herndon, T., Ash, M., & Pollin, R. (2014). Does high public debt consistently stifle economic growth? A critique of reinhart and rogoff. Cambridge journal of economics, 38(2), 257–279. https://doi.org/https://doi.org/10.1093/cje/bet075
Miller, G. (2006). A scientist’s nightmare: Software problem leads to five retractions. American Association for the Advancement of Science. https://doi.org/https://doi.org/10.1126/science.314.5807.1856
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086