An important principle in any project is how to manage the source code. In this module, you’ll learn about the directories that hold the and package source code and some general tips for organizing source code functionality. We’ll also add some new functionality to our package to demonstrate our thoughts.

Organization

Organizing your source code provides many benefits such as:

  • making the intentions of the code obvious,
  • making it easy for others to contribute,
  • making it easy for you to maintain,
  • making it easy to debug,
  • making it easy to expand.

There are very few strict requirements in how you organize your code but what follows are general best practices.

Individual units

“Functions should do one thing. They should do it well. They should do it only.” - Robert Martin

Individual units of functionality (i.e. functions, methods, classes) should be small. How small? That’s the magic question 🤷. Individual units of functionality should not require significant scrolling in your editor. They should be very clear regarding the intention and functionality they are providing. And they should only “do one thing”.

One way to know if a function is doing more than one thing is if you can extract another function from it with a name that is not merely a restatement of its implementation.

So far our package has only one simple function; however, we typically build our packages to be more comprehensive and complex. Organizing this expansion is important. There are two main ways our code expands:

  1. breadth: we add more functionality but not necessarily building onto existing functionality,
  2. aggregation: we continue building onto existing functionality to create layers of abstraction.

Expanding code breadth

Expanding code breadth is easier to maintain. For our example package that currently has a my_mean() function, this could include adding new functions such as my_median() and my_mode(). When expanding code breadth the main thing you should consider is where to put the new functionality.

Grouping like functionality into one file is good. For example, since my_mean(), my_median() and my_mode() are all forms of central tendencies we may put them in the same file while functions with other purposes (i.e. deviations) may go into a different file.

While you’re free to arrange functions into files as you wish, the two extremes are bad: don’t put all functions into one file and don’t put each function into its own separate file. (It’s OK if some files only contain one function, particularly if the function is large 🛑 or has a lot of documentation.).

Aggregating functionality

Often, we write code that builds on top of each other. This is known as creating higher levels of abstraction. For example, say we wanted to create a function that computes the z-score, which uses the mean and standard deviations (\(z=\frac{x - \mu}{\sigma}\)). This is a higher level of abstraction and we should write our code in such a way that:

  1. our code reads like a top-down narrative,
  2. functionality is abstracted away and grouped at similar levels of abstraction.

If your code still fits into one file, we want the code to read like a top-down narrative where the highest level of abstraction is exposed first followed by the next level of abstraction. This means your highest abstracted function should be at the top of the .R or .py script and then as you scroll down functions lower level functions that provide supporting help should be listed. For example, a simple file that holds functionality to compute the z-score would look like:

# highest abstracted level
def z_score():
  pass
  
# next layer of abstractoin
def my_mean():
  pass
  
def my_sd():
  pass
  
# lowest level of abstraction
def validate_input():
  pass

If your code grows signicantly then you want to separate functionality into separate files but still group functions based on similar levels of abstraction.

Naming is important

Naming is important but as our source code grows it becomes even moreso. Here are some good naming tips:

  • If a file contains a single function, give the file the same name as the function.
  • If a file contains multiple related functions, give it a concise, but descriptive name. For example, we could group my_mean(), my_median() and my_mode() together into a file named central_tendencies.
  • Sometimes you have many helper functions in your source code that all support one primary user facing API functionality. Put the user-facing API function/class into a file named main or main_api.
  • Many times people create a general utils file to hold general utility functions. This is not advised. Always try to find a proper home and name for all supporting functions. This is a great read explaining why.
  • Deprecated functions should live in a file or directory that makes it obvious they are deprecated (i.e. prefix file name with deprec-).

External vs internal

When developing packages, we often build functions with two main purposes:

  1. External use: functions that are designed to be used by the end-user. This is the primary purpose of writing a package, to make some functionality easier for an end-user.
  2. Internal use: functions that are designed to help you, the developer, write efficient and effect code. Internal functions are usually not intended for external use by end-users.

With both and you have the ability to make your functions be either externally or internally focused. In the sections that follow we will illustrate how but it is important that you apply the same rules to both external and internal functions - document and test all your functions. Although not a requirement this will make your life and other developer’s lives that want to contribute much easier.

source code example

For an package, all source code goes in the R/ directory and you cannot have subdirectories1.

Let’s add some functionality to our package to illustrate some points from our previous discussion:

  • add an internally-focused function that validates the user inputs,
  • add breadth to our package by adding additional summary statistics computations,
  • add aggregation to our package by adding a z-score computation that leverages other functions within our package.

To simplify, we’ll show the final functions and tests added rather than illustrate every step of the test-driven development process.

Setup

Before we start adding new functionality, let’s make sure the current code base is ready. Open up your R project, switch to the develop branch and make sure it is current with your remote repo:

git checkout develop
git pull

Now let’s create a new branch to add this module’s new functionality. Usually you name the branch after the new functionality that you’re adding or a Github issue that you are addressing. In our case we can name the branch “ch6” since it’s related to this module.

git checkout -b ch6

Internal function

First, we’ll add an internally-focused function that validates the user inputs. Since we may continue to expand our package and add more validation procedures, we typically create a validation.R file to hold these functions.

Go ahead and create the test and .R file:

usethis::use_test("validation")
usethis::use_r("validation")

Place the following in the test-validation.R test file:

test_that("inputs are a numeric vector", {
  expect_error(validate_numeric_vector("a"))
  expect_error(validate_numeric_vector(factor(1, 2, 3)))
  expect_error(validate_numeric_vector(list(1, 2, 3)))

  expect_silent(c(1, 2, 3))
  expect_silent(c(TRUE, FALSE))
})

and the following in the validation.R source code file:

#' Validate numeric vector input
#'
#' @description
#' Checks that an in put is a vector that contains numeric inputs or
#' logical values that can be coerced to numeric values..
#'
#' @param x A numeric or logical vector.
#'
#' @return
#' Raises exception if input is not a numeric or logical vector; otherwise
#' provides a silent return.
#'
#' @examples
#' x <- 1:10
#' myfirstpkg:::validate_numeric_vector(x)
#'
#' @keywords internal
validate_numeric_vector <- function(x) {
  stopifnot(is.atomic(x) || is.logical(x), is.numeric(x))
}

Three important items to note in the above:

  1. We used the @keywords internal tag instead of @export as we did in the workflow module. Only functions that are documented with @export are made explicitly visible to the end user. Using @keywords internal instead of @export signals that this function is for internal use only. Internal functions are still accessible to end users but they must use the triple ::: syntax - myfirstpkg:::validate_numeric_vector().
  2. When including examples for internal functions you need to use the triple ::: syntax - myfirstpkg:::validate_numeric_vector() otherwise you will get an error when you run devtools::check().
  3. Even though this is an internal function, we still document it as normal. This isn’t required but it helps us and contributing developers understand the purpose of the function.

Remember, as we add new code we always want to be running the tests.

  1. load new functionality: devtools::load_all() or Cmd/Ctrl + Shift + L
  2. test new functionality: devtools::test() or Cmd/Ctrl + Shift + T
  3. document new functionality: devtools::document() or Cmd/Ctrl + Shift + D

Now before adding any new functionality, let’s add this validation function to our existing my_mean() function. Your my_mean() should look like:

my_mean <- function(x) {
  validate_numeric_vector(x)
  total <- sum(x)
  units <- length(x)
  return(total / units)
}

Adding breadth

Next, let’s add a new summary statistic to our collection. For now, we’ll store this summary statistic in the same file but in the future if this file became too large we may look to refactor and split up the centralized organization. Let’s rename the original mean.R file to summary-stats.R and also rename the associated test file to test-summary-stats.R.

It is common to rename files and functions as you begin developing a package since you are feeling out what the best design and organization will be but as your package matures this will happen less frequently.

Now we’ll add a function that computes the standard deviation. Add the following to the test-summary-stats.R file:

test_that("standard deviation of simple vector computes accurately", {
  x <- 1:3
  expect_equal(my_sd(x), 1)
})

and add the following to the summary-stats.R file in the R/ directory:

#' Standard deviation of a vector
#'
#' @description
#' Computes standard deviation of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @details
#' The denominator n - 1 is used which gives an unbiased estimator of the
#' (co)variance for i.i.d. observations.
#'
#' @return
#' The standard deviation of the values in x returned as a numeric vector of
#' length one.
#'
#' @examples
#' x <- 1:10
#' my_sd(x)
#'
#' @export
my_sd <- function(x) {
  validate_numeric_vector(x)
  squared_diff <- (x - my_mean(x))^2
  total <- sum(squared_diff)
  units <- length(x) - 1
  return(sqrt(total / units))
}

Adding aggregation

Last, we’ll add a new summary statistics, the z-score, that leverages the my_mean and my_sd functions. Since this is adding onto our level of abstraction we’ll place this at the top of our R/summary_stats.R file.

Add the following to the test-summary-stats.R file:

test_that("z-score of simple vector computes accurately", {
  x <- 1:3
  expected <- c(-1, 0, 1)
  expect_equal(z_score(x), expected)
})

and add the following to the summary-stats.R file in the R/ directory:

#' Z-score of a vector
#'
#' @description
#' Computes z-score of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @return
#' The z-score of each value in x returned as a numeric vector of
#' with equal length as the input x vector.
#'
#' @examples
#' x <- 1:3
#' z_score(x)
#'
#' @export
z_score <- function(x) {
  return((x - my_mean(x)) / my_sd(x))
}

Our summary-stats.R file should now include three functions in a top-down approach:

#' Z-score of a vector
#'
#' @description
#' Computes z-score of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @return
#' The z-score of each value in x returned as a numeric vector of
#' with equal length as the input x vector.
#'
#' @examples
#' x <- 1:3
#' z_score(x)
#'
#' @export
z_score <- function(x) {
  return((x - my_mean(x)) / my_sd(x))
}

#' Standard deviation of a vector
#'
#' @description
#' Computes standard deviation of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @details
#' The denominator n - 1 is used which gives an unbiased estimator of the
#' (co)variance for i.i.d. observations.
#'
#' @return
#' The standard deviation of the values in x returned as a numeric vector of
#' length one.
#'
#' @examples
#' x <- 1:10
#' my_sd(x)
#'
#' @export
my_sd <- function(x) {
  validate_numeric_vector(x)
  squared_diff <- (x - my_mean(x))^2
  total <- sum(squared_diff)
  units <- length(x) - 1
  return(sqrt(total / units))
}

#' Mean of a vector
#'
#' @description
#' Computes arithmetic mean of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @return
#' The arithmetic mean of the values in x returned as a numeric vector of length one.
#'
#' @examples
#' x <- 1:10
#' my_mean(x)
#'
#' @export
my_mean <- function(x) {
  validate_numeric_vector(x)
  total <- sum(x)
  units <- length(x)
  return(total / units)
}

And our test-summary-stats.R file should look like:

test_that("z-score of simple vector computes accurately", {
  x <- 1:3
  expected <- c(-1, 0, 1)
  expect_equal(z_score(x), expected)
})

test_that("standard deviation of simple vector computes accurately", {
  x <- 1:3
  expect_equal(my_sd(x), 1)
})

test_that("mean of simple vector computes accurately", {
  x <- 0:10
  expect_equal(my_mean(x), 5)
})

Final checks & version control

Now before we commit our changes to git, let’s make sure everything is loaded and documented and passes our tests and checks:

  1. document package functionality: devtools::document() or Cmd/Ctrl + Shift + D
  2. load package functionality: devtools::load_all() or Cmd/Ctrl + Shift + L
  3. test package functionality: devtools::test() or Cmd/Ctrl + Shift + T
  4. R Cmd check package functionality: devtools::check() or Cmd/Ctrl + Shift + E

As long as everything is passing, we can now commit our changes, push to Github and then do a pull request to incorporate our changes from our current working branch to the develop branch.

git add -A
git commit -m 'feat: add new validation & summary stats
> Add validation function to ensure proper numeric vector input.
> Add summary stat functions to compute standard deviation and z-score.'
git push --set-upstream origin ch6

source code example

For a package, all source code goes in the src/pkgname/ directory. Unlike packages, you can have subdirectories and submodules in packages. However, for simplicity, we’ll just include all source code at the src/pkgname/ directory level.

Let’s add some functionality to our package to illustrate some points from our previous discussion. We’ll take the same approach as in the example section and:

  • add an internally-focused function that validates the user inputs,
  • add breadth to our package by adding additional summary statistics computations,
  • add aggregation to our package by adding a z-score computation that leverages other functions within our package.

To simplify, we’ll show the final functions and tests added rather than illustrate every step of the test-driven development process.

Setup

Before we start adding new functionality, let’s make sure the current code base is ready. Open up your Python project, activate your virtual environment, switch to the develop branch and make sure it is current with your remote repo:

git checkout develop
git pull

Now let’s create a new branch to add this module’s new functionality. Usually you name the branch after the new functionality that you’re adding or a Github issue that you are addressing. In our case we can name the branch “ch6” since it’s related to this module.

git checkout -b ch6

Internal function

First, we’ll add an internally-focused function that validates the user inputs. Since we may continue to expand our package and add more validation procedures, we typically create a validation.R file to hold these functions.

Go ahead and create the test and .py file:

touch tests/test_validation.py
touch src/myfirstpypkg/validation.py

Place the following in the test_validation.py test file:

from myfirstpypkg.validation import _validate_numeric_sequence
import pytest


def test_validate_sequence_numeric():
    assert _validate_numeric_sequence(range(10)) == None
    with pytest.raises(TypeError):
        _validate_numeric_sequence(list('a', 'b', 'c'))

def test_validate_sequence_type():
    with pytest.raises(TypeError):
        _validate_numeric_sequence({})

and the following in the validation.py source code file:

from collections import Sequence

def _validate_numeric_sequence(x):
    """
    Validate numeric sequence input

    Checks that an in put is a sequence that contains numeric inputs or
    logical values that can be coerced to numeric values.

    Parameters
    ----------
    x
        A numeric or logical vector.

    Returns
    -------
    None

    Raises
    ------
    TypeError
        If `x` is not a sequence of type float, int or bool.

    Examples
    --------
    x = range(10)
    _validate_numeric_sequence(x)
    """
    if not isinstance(x, (Sequence, float, int, bool)):
        raise TypeError("`x` must be a sequence of type float, int, or bool.")

Two important items to note in the above:

  1. We start the function with an underscore _validate.... In Python, we cannot hide internal functions from end-users; however, it is Pythonic to start all functions designed for internal use with an underscore.
  2. Even though this is an internal function, we still document it as normal. This isn’t required but it helps us and contributing developers understand the purpose of the function.

Remember, as we add new code we always want to be running the tests with pytest.

Now before adding any new functionality, let’s add this validation function to our existing my_mean() function. Your mean.py file should look like below. Note that we use a relative import to import _validate_numeric_sequence from the validation.py module.

from .validation import _validate_numeric_sequence


def my_mean(x):
    """
    Mean of a vector

    Computes arithmetic mean of a vector with numeric or logical values.

    Parameters
    ----------
    x
        A numeric or logical list.

    Returns
    -------
    The arithmetic mean of the values in x returned as a numeric vector of length one.

    Examples
    --------
    >>> x = range(0, 11)
    ... my_mean(x)
    """
    _validate_numeric_sequence(x)
    total = sum(x)
    units = len(x)
    return total / units

Adding breadth

Next, let’s add a new summary statistic to our collection. For now, we’ll store this summary statistic in the same file but in the future if this file became too large we may look to refactor and split up the centralized organization. Let’s rename the original mean.py file to summary_stats.py and also rename the associated test file to test_summary_stats.py.

It is common to rename files and functions as you begin developing a package since you are feeling out what the best design and organization will be but as your package matures this will happen less frequently.

Now we’ll add a function that computes the standard deviation. Update your test_summary_stats.py file to look like:

from myfirstpypkg.summary_stats import my_mean
from myfirstpypkg.summary_stats import my_sd


def test_my_sd():
    x = [1, 2, 3]
    assert my_sd(x) == 1

def test_my_mean():
    x = range(0, 11)
    assert my_mean(x) == 5

and add the following to the summary_stats.py file in the src/myfirstpypkg/ directory:

from math import sqrt


def my_sd(x):
    """
    Standard deviation of a sequence

    Computes standard deviation of a vector with numeric or logical values. The
    denominator `units = len(x) - 1` is used which gives an unbiased estimator
    of the (co)variance for i.i.d. observations.

    Parameters
    ----------
    x
        A numeric or logical sequence.

    Returns
    -------
    The standard deviation of the values in x returned as a numeric value of length one.

    Examples
    --------
    >>> x = [1, 2, 3]
    ... my_sd(x)
    """
    _validate_numeric_sequence(x)
    mu = my_mean(x)
    squared_diff = [(i - mu)**2 for i in x]
    total = sum(squared_diff)
    units = len(x) - 1
    return(sqrt(total / units))

Adding aggregation

Last, we’ll add a new summary statistics, the z-score, that leverages the my_mean and my_sd functions. Since this is adding onto our level of abstraction we’ll place this at the top of our src/myfirstpypkg/summary_stats.py file.

Add a z-score test to the test_summary_stats.py file:

def test_z_score():
    x = [1, 2, 3]
    expected = [-1, 0, 1]
    assert z_score(x) == expected

and add the following to the summary_stats.py file in the src/myfirstpypkg/ directory:

def z_score(x):
    """
    Z-score of a sequence

    Computes the z-score of a sequence with numeric or logical values.

    Parameters
    ----------
    x
        A numeric or logical sequence.

    Returns
    -------
    The z-score for each value in x as a list.

    Examples
    --------
    >>> x = [1, 2, 3]
    ... z_score(x)
    """
    mu = my_mean(x)
    sd = my_sd(x)
    return [((i - mu) / sd) for i in x]

Our summary_stats.py file should now include three functions in a top-down approach:

from .validation import _validate_numeric_sequence
from math import sqrt


def z_score(x):
    """
    Z-score of a sequence

    Computes the z-score of a sequence with numeric or logical values.

    Parameters
    ----------
    x
        A numeric or logical sequence.

    Returns
    -------
    The z-score for each value in x as a list.

    Examples
    --------
    >>> x = [1, 2, 3]
    ... z_score(x)
    """
    mu = my_mean(x)
    sd = my_sd(x)
    return [((i - mu) / sd) for i in x]

def my_sd(x):
    """
    Standard deviation of a sequence

    Computes standard deviation of a sequence with numeric or logical values.
    The denominator `units = len(x) - 1` is used which gives an unbiased
    estimator of the (co)variance for i.i.d. observations.

    Parameters
    ----------
    x
        A numeric or logical sequence.

    Returns
    -------
    The standard deviation of the values in x returned as a numeric value of length one.

    Examples
    --------
    >>> x = [1, 2, 3]
    ... my_sd(x)
    """
    _validate_numeric_sequence(x)
    mu = my_mean(x)
    squared_diff = [(i - mu)**2 for i in x]
    total = sum(squared_diff)
    units = len(x) - 1
    return(sqrt(total / units))

def my_mean(x):
    """
    Mean of a sequence

    Computes arithmetic mean of a sequence with numeric or logical values.

    Parameters
    ----------
    x
        A numeric or logical sequence.

    Returns
    -------
    The arithmetic mean of the values in x returned as a numeric value of length one.

    Examples
    --------
    >>> x = range(0, 11)
    ... my_mean(x)
    """
    _validate_numeric_sequence(x)
    total = sum(x)
    units = len(x)
    return total / units

And our test_summary_stats.py file should look like:

from myfirstpypkg.summary_stats import my_mean
from myfirstpypkg.summary_stats import my_sd
from myfirstpypkg.summary_stats import z_score


def test_z_score():
    x = [1, 2, 3]
    expected = [-1, 0, 1]
    assert z_score(x) == expected

def test_my_sd():
    x = [1, 2, 3]
    assert my_sd(x) == 1

def test_my_mean():
    x = range(0, 11)
    assert my_mean(x) == 5

Simplify user access

With packages, users access objects with a module namespace approach. This means, since our z_score() function is located at:

.
└── src/myfirstpypkg
    └── summary_stats.py

When they use our package they would have to access this function with one of the following:

import myfirstpypkg

myfirstpypkg.summary_stats.z_score(x)

or

from myfirstpypkg.summary_stats import z_score

z_score(x)

We can simplify this for the users by exporting the z_score() function so the user can simply do:

from myfirstpypkg import z_score

z_score(x)

To allow for this you simply add the following to the src/myfirstpypkg/__init__.py file:

from .summary_stats import z_score

Final checks & version control

Now before we commit our changes to git, let’s make sure everything still loads appropriately and passes all tests:

pip install -e .
pytest

As long as everything is passing, we can now commit our changes, push to Github and then do a pull request to incorporate our changes from our current working branch to the develop branch.

git add -A
git commit -m 'feat: add new validation & summary stats
> Add validation function to ensure proper numeric sequence input.
> Add summary stat functions to compute standard deviation and z-score.'
git push --set-upstream origin ch6

Exercises

  1. With the R and Python packages you created in the portfolio builder, work through the R and Python examples above to add new functionality to your package.

  2. Fork one of your classmates packages and:
    • create a new branch,
    • add a new function (and associated unit test) that is unique to the existing functions,
    • add a new function (and associated unit test) that builds onto one of their existing functions,
    • create a pull request to add this new functionality to your classmates develop branch.


🏠


  1. Subdirectories are actually allowed but they can only be for OS-specific purposes and named unit/ or windows/. See this Stackoverflow question for details.