An important principle in any project is how to manage the source code. In this module, you’ll learn about the directories that hold the and package source code and some general tips for organizing source code functionality. We’ll also add some new functionality to our package to demonstrate our thoughts.
Organizing your source code provides many benefits such as:
There are very few strict requirements in how you organize your code but what follows are general best practices.
“Functions should do one thing. They should do it well. They should do it only.” - Robert Martin
Individual units of functionality (i.e. functions, methods, classes) should be small. How small? That’s the magic question 🤷. Individual units of functionality should not require significant scrolling in your editor. They should be very clear regarding the intention and functionality they are providing. And they should only “do one thing”.
One way to know if a function is doing more than one thing is if you can extract another function from it with a name that is not merely a restatement of its implementation.
So far our package has only one simple function; however, we typically build our packages to be more comprehensive and complex. Organizing this expansion is important. There are two main ways our code expands:
Expanding code breadth is easier to maintain. For our example package that currently has a my_mean()
function, this could include adding new functions such as my_median()
and my_mode()
. When expanding code breadth the main thing you should consider is where to put the new functionality.
Grouping like functionality into one file is good. For example, since my_mean()
, my_median()
and my_mode()
are all forms of central tendencies we may put them in the same file while functions with other purposes (i.e. deviations) may go into a different file.
While you’re free to arrange functions into files as you wish, the two extremes are bad: don’t put all functions into one file and don’t put each function into its own separate file. (It’s OK if some files only contain one function, particularly if the function is large 🛑 or has a lot of documentation.).
Often, we write code that builds on top of each other. This is known as creating higher levels of abstraction. For example, say we wanted to create a function that computes the z-score, which uses the mean and standard deviations (\(z=\frac{x - \mu}{\sigma}\)). This is a higher level of abstraction and we should write our code in such a way that:
If your code still fits into one file, we want the code to read like a top-down narrative where the highest level of abstraction is exposed first followed by the next level of abstraction. This means your highest abstracted function should be at the top of the .R or .py script and then as you scroll down functions lower level functions that provide supporting help should be listed. For example, a simple file that holds functionality to compute the z-score would look like:
# highest abstracted level
def z_score():
pass
# next layer of abstractoin
def my_mean():
pass
def my_sd():
pass
# lowest level of abstraction
def validate_input():
pass
If your code grows signicantly then you want to separate functionality into separate files but still group functions based on similar levels of abstraction.
Naming is important but as our source code grows it becomes even moreso. Here are some good naming tips:
my_mean()
, my_median()
and my_mode()
together into a file named central_tendencies
.main
or main_api
.utils
file to hold general utility functions. This is not advised. Always try to find a proper home and name for all supporting functions. This is a great read explaining why.deprec-
).When developing packages, we often build functions with two main purposes:
With both and you have the ability to make your functions be either externally or internally focused. In the sections that follow we will illustrate how but it is important that you apply the same rules to both external and internal functions - document and test all your functions. Although not a requirement this will make your life and other developer’s lives that want to contribute much easier.
For an package, all source code goes in the R/
directory and you cannot have subdirectories1.
Let’s add some functionality to our package to illustrate some points from our previous discussion:
To simplify, we’ll show the final functions and tests added rather than illustrate every step of the test-driven development process.
Before we start adding new functionality, let’s make sure the current code base is ready. Open up your R project, switch to the develop branch and make sure it is current with your remote repo:
git checkout develop
git pull
Now let’s create a new branch to add this module’s new functionality. Usually you name the branch after the new functionality that you’re adding or a Github issue that you are addressing. In our case we can name the branch “ch6” since it’s related to this module.
git checkout -b ch6
First, we’ll add an internally-focused function that validates the user inputs. Since we may continue to expand our package and add more validation procedures, we typically create a validation.R file to hold these functions.
Go ahead and create the test and .R file:
usethis::use_test("validation")
usethis::use_r("validation")
Place the following in the test-validation.R
test file:
test_that("inputs are a numeric vector", {
expect_error(validate_numeric_vector("a"))
expect_error(validate_numeric_vector(factor(1, 2, 3)))
expect_error(validate_numeric_vector(list(1, 2, 3)))
expect_silent(c(1, 2, 3))
expect_silent(c(TRUE, FALSE))
})
and the following in the validation.R
source code file:
#' Validate numeric vector input
#'
#' @description
#' Checks that an in put is a vector that contains numeric inputs or
#' logical values that can be coerced to numeric values..
#'
#' @param x A numeric or logical vector.
#'
#' @return
#' Raises exception if input is not a numeric or logical vector; otherwise
#' provides a silent return.
#'
#' @examples
#' x <- 1:10
#' myfirstpkg:::validate_numeric_vector(x)
#'
#' @keywords internal
validate_numeric_vector <- function(x) {
stopifnot(is.atomic(x) || is.logical(x), is.numeric(x))
}
Three important items to note in the above:
@keywords internal
tag instead of @export
as we did in the workflow module. Only functions that are documented with @export
are made explicitly visible to the end user. Using @keywords internal
instead of @export
signals that this function is for internal use only. Internal functions are still accessible to end users but they must use the triple :::
syntax - myfirstpkg:::validate_numeric_vector()
.:::
syntax - myfirstpkg:::validate_numeric_vector()
otherwise you will get an error when you run devtools::check()
.Remember, as we add new code we always want to be running the tests.
devtools::load_all()
or Cmd/Ctrl + Shift + L
devtools::test()
or Cmd/Ctrl + Shift + T
devtools::document()
or Cmd/Ctrl + Shift + D
Now before adding any new functionality, let’s add this validation function to our existing my_mean()
function. Your my_mean()
should look like:
my_mean <- function(x) {
validate_numeric_vector(x)
total <- sum(x)
units <- length(x)
return(total / units)
}
Next, let’s add a new summary statistic to our collection. For now, we’ll store this summary statistic in the same file but in the future if this file became too large we may look to refactor and split up the centralized organization. Let’s rename the original mean.R
file to summary-stats.R
and also rename the associated test file to test-summary-stats.R
.
It is common to rename files and functions as you begin developing a package since you are feeling out what the best design and organization will be but as your package matures this will happen less frequently.
Now we’ll add a function that computes the standard deviation. Add the following to the test-summary-stats.R
file:
test_that("standard deviation of simple vector computes accurately", {
x <- 1:3
expect_equal(my_sd(x), 1)
})
and add the following to the summary-stats.R
file in the R/
directory:
#' Standard deviation of a vector
#'
#' @description
#' Computes standard deviation of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @details
#' The denominator n - 1 is used which gives an unbiased estimator of the
#' (co)variance for i.i.d. observations.
#'
#' @return
#' The standard deviation of the values in x returned as a numeric vector of
#' length one.
#'
#' @examples
#' x <- 1:10
#' my_sd(x)
#'
#' @export
my_sd <- function(x) {
validate_numeric_vector(x)
squared_diff <- (x - my_mean(x))^2
total <- sum(squared_diff)
units <- length(x) - 1
return(sqrt(total / units))
}
Last, we’ll add a new summary statistics, the z-score, that leverages the my_mean
and my_sd
functions. Since this is adding onto our level of abstraction we’ll place this at the top of our R/summary_stats.R
file.
Add the following to the test-summary-stats.R
file:
test_that("z-score of simple vector computes accurately", {
x <- 1:3
expected <- c(-1, 0, 1)
expect_equal(z_score(x), expected)
})
and add the following to the summary-stats.R
file in the R/
directory:
#' Z-score of a vector
#'
#' @description
#' Computes z-score of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @return
#' The z-score of each value in x returned as a numeric vector of
#' with equal length as the input x vector.
#'
#' @examples
#' x <- 1:3
#' z_score(x)
#'
#' @export
z_score <- function(x) {
return((x - my_mean(x)) / my_sd(x))
}
Our summary-stats.R
file should now include three functions in a top-down approach:
#' Z-score of a vector
#'
#' @description
#' Computes z-score of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @return
#' The z-score of each value in x returned as a numeric vector of
#' with equal length as the input x vector.
#'
#' @examples
#' x <- 1:3
#' z_score(x)
#'
#' @export
z_score <- function(x) {
return((x - my_mean(x)) / my_sd(x))
}
#' Standard deviation of a vector
#'
#' @description
#' Computes standard deviation of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @details
#' The denominator n - 1 is used which gives an unbiased estimator of the
#' (co)variance for i.i.d. observations.
#'
#' @return
#' The standard deviation of the values in x returned as a numeric vector of
#' length one.
#'
#' @examples
#' x <- 1:10
#' my_sd(x)
#'
#' @export
my_sd <- function(x) {
validate_numeric_vector(x)
squared_diff <- (x - my_mean(x))^2
total <- sum(squared_diff)
units <- length(x) - 1
return(sqrt(total / units))
}
#' Mean of a vector
#'
#' @description
#' Computes arithmetic mean of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @return
#' The arithmetic mean of the values in x returned as a numeric vector of length one.
#'
#' @examples
#' x <- 1:10
#' my_mean(x)
#'
#' @export
my_mean <- function(x) {
validate_numeric_vector(x)
total <- sum(x)
units <- length(x)
return(total / units)
}
And our test-summary-stats.R
file should look like:
test_that("z-score of simple vector computes accurately", {
x <- 1:3
expected <- c(-1, 0, 1)
expect_equal(z_score(x), expected)
})
test_that("standard deviation of simple vector computes accurately", {
x <- 1:3
expect_equal(my_sd(x), 1)
})
test_that("mean of simple vector computes accurately", {
x <- 0:10
expect_equal(my_mean(x), 5)
})
Now before we commit our changes to git, let’s make sure everything is loaded and documented and passes our tests and checks:
devtools::document()
or Cmd/Ctrl + Shift + Ddevtools::load_all()
or Cmd/Ctrl + Shift + Ldevtools::test()
or Cmd/Ctrl + Shift + Tdevtools::check()
or Cmd/Ctrl + Shift + EAs long as everything is passing, we can now commit our changes, push to Github and then do a pull request to incorporate our changes from our current working branch to the develop branch.
git add -A
git commit -m 'feat: add new validation & summary stats
> Add validation function to ensure proper numeric vector input.
> Add summary stat functions to compute standard deviation and z-score.'
git push --set-upstream origin ch6
For a package, all source code goes in the src/pkgname/
directory. Unlike packages, you can have subdirectories and submodules in packages. However, for simplicity, we’ll just include all source code at the src/pkgname/
directory level.
Let’s add some functionality to our package to illustrate some points from our previous discussion. We’ll take the same approach as in the example section and:
To simplify, we’ll show the final functions and tests added rather than illustrate every step of the test-driven development process.
Before we start adding new functionality, let’s make sure the current code base is ready. Open up your Python project, activate your virtual environment, switch to the develop branch and make sure it is current with your remote repo:
git checkout develop
git pull
Now let’s create a new branch to add this module’s new functionality. Usually you name the branch after the new functionality that you’re adding or a Github issue that you are addressing. In our case we can name the branch “ch6” since it’s related to this module.
git checkout -b ch6
First, we’ll add an internally-focused function that validates the user inputs. Since we may continue to expand our package and add more validation procedures, we typically create a validation.R file to hold these functions.
Go ahead and create the test and .py file:
touch tests/test_validation.py
touch src/myfirstpypkg/validation.py
Place the following in the test_validation.py
test file:
from myfirstpypkg.validation import _validate_numeric_sequence
import pytest
def test_validate_sequence_numeric():
assert _validate_numeric_sequence(range(10)) == None
with pytest.raises(TypeError):
_validate_numeric_sequence(list('a', 'b', 'c'))
def test_validate_sequence_type():
with pytest.raises(TypeError):
_validate_numeric_sequence({})
and the following in the validation.py
source code file:
from collections import Sequence
def _validate_numeric_sequence(x):
"""
Validate numeric sequence input
Checks that an in put is a sequence that contains numeric inputs or
logical values that can be coerced to numeric values.
Parameters
----------
x
A numeric or logical vector.
Returns
-------
None
Raises
------
TypeError
If `x` is not a sequence of type float, int or bool.
Examples
--------
x = range(10)
_validate_numeric_sequence(x)
"""
if not isinstance(x, (Sequence, float, int, bool)):
raise TypeError("`x` must be a sequence of type float, int, or bool.")
Two important items to note in the above:
_validate...
. In Python, we cannot hide internal functions from end-users; however, it is Pythonic to start all functions designed for internal use with an underscore.
Remember, as we add new code we always want to be running the tests with pytest
.
Now before adding any new functionality, let’s add this validation function to our existing my_mean()
function. Your mean.py
file should look like below. Note that we use a relative import to import _validate_numeric_sequence
from the validation.py
module.
from .validation import _validate_numeric_sequence
def my_mean(x):
"""
Mean of a vector
Computes arithmetic mean of a vector with numeric or logical values.
Parameters
----------
x
A numeric or logical list.
Returns
-------
The arithmetic mean of the values in x returned as a numeric vector of length one.
Examples
--------
>>> x = range(0, 11)
... my_mean(x)
"""
_validate_numeric_sequence(x)
total = sum(x)
units = len(x)
return total / units
Next, let’s add a new summary statistic to our collection. For now, we’ll store this summary statistic in the same file but in the future if this file became too large we may look to refactor and split up the centralized organization. Let’s rename the original mean.py
file to summary_stats.py
and also rename the associated test file to test_summary_stats.py
.
It is common to rename files and functions as you begin developing a package since you are feeling out what the best design and organization will be but as your package matures this will happen less frequently.
Now we’ll add a function that computes the standard deviation. Update your test_summary_stats.py
file to look like:
from myfirstpypkg.summary_stats import my_mean
from myfirstpypkg.summary_stats import my_sd
def test_my_sd():
x = [1, 2, 3]
assert my_sd(x) == 1
def test_my_mean():
x = range(0, 11)
assert my_mean(x) == 5
and add the following to the summary_stats.py
file in the src/myfirstpypkg/
directory:
from math import sqrt
def my_sd(x):
"""
Standard deviation of a sequence
Computes standard deviation of a vector with numeric or logical values. The
denominator `units = len(x) - 1` is used which gives an unbiased estimator
of the (co)variance for i.i.d. observations.
Parameters
----------
x
A numeric or logical sequence.
Returns
-------
The standard deviation of the values in x returned as a numeric value of length one.
Examples
--------
>>> x = [1, 2, 3]
... my_sd(x)
"""
_validate_numeric_sequence(x)
mu = my_mean(x)
squared_diff = [(i - mu)**2 for i in x]
total = sum(squared_diff)
units = len(x) - 1
return(sqrt(total / units))
Last, we’ll add a new summary statistics, the z-score, that leverages the my_mean
and my_sd
functions. Since this is adding onto our level of abstraction we’ll place this at the top of our src/myfirstpypkg/summary_stats.py
file.
Add a z-score test to the test_summary_stats.py
file:
def test_z_score():
x = [1, 2, 3]
expected = [-1, 0, 1]
assert z_score(x) == expected
and add the following to the summary_stats.py
file in the src/myfirstpypkg/
directory:
def z_score(x):
"""
Z-score of a sequence
Computes the z-score of a sequence with numeric or logical values.
Parameters
----------
x
A numeric or logical sequence.
Returns
-------
The z-score for each value in x as a list.
Examples
--------
>>> x = [1, 2, 3]
... z_score(x)
"""
mu = my_mean(x)
sd = my_sd(x)
return [((i - mu) / sd) for i in x]
Our summary_stats.py
file should now include three functions in a top-down approach:
from .validation import _validate_numeric_sequence
from math import sqrt
def z_score(x):
"""
Z-score of a sequence
Computes the z-score of a sequence with numeric or logical values.
Parameters
----------
x
A numeric or logical sequence.
Returns
-------
The z-score for each value in x as a list.
Examples
--------
>>> x = [1, 2, 3]
... z_score(x)
"""
mu = my_mean(x)
sd = my_sd(x)
return [((i - mu) / sd) for i in x]
def my_sd(x):
"""
Standard deviation of a sequence
Computes standard deviation of a sequence with numeric or logical values.
The denominator `units = len(x) - 1` is used which gives an unbiased
estimator of the (co)variance for i.i.d. observations.
Parameters
----------
x
A numeric or logical sequence.
Returns
-------
The standard deviation of the values in x returned as a numeric value of length one.
Examples
--------
>>> x = [1, 2, 3]
... my_sd(x)
"""
_validate_numeric_sequence(x)
mu = my_mean(x)
squared_diff = [(i - mu)**2 for i in x]
total = sum(squared_diff)
units = len(x) - 1
return(sqrt(total / units))
def my_mean(x):
"""
Mean of a sequence
Computes arithmetic mean of a sequence with numeric or logical values.
Parameters
----------
x
A numeric or logical sequence.
Returns
-------
The arithmetic mean of the values in x returned as a numeric value of length one.
Examples
--------
>>> x = range(0, 11)
... my_mean(x)
"""
_validate_numeric_sequence(x)
total = sum(x)
units = len(x)
return total / units
And our test_summary_stats.py
file should look like:
from myfirstpypkg.summary_stats import my_mean
from myfirstpypkg.summary_stats import my_sd
from myfirstpypkg.summary_stats import z_score
def test_z_score():
x = [1, 2, 3]
expected = [-1, 0, 1]
assert z_score(x) == expected
def test_my_sd():
x = [1, 2, 3]
assert my_sd(x) == 1
def test_my_mean():
x = range(0, 11)
assert my_mean(x) == 5
With packages, users access objects with a module namespace approach. This means, since our z_score()
function is located at:
.
└── src/myfirstpypkg
└── summary_stats.py
When they use our package they would have to access this function with one of the following:
import myfirstpypkg
myfirstpypkg.summary_stats.z_score(x)
or
from myfirstpypkg.summary_stats import z_score
z_score(x)
We can simplify this for the users by exporting the z_score()
function so the user can simply do:
from myfirstpypkg import z_score
z_score(x)
To allow for this you simply add the following to the src/myfirstpypkg/__init__.py
file:
from .summary_stats import z_score
Now before we commit our changes to git, let’s make sure everything still loads appropriately and passes all tests:
pip install -e .
pytest
As long as everything is passing, we can now commit our changes, push to Github and then do a pull request to incorporate our changes from our current working branch to the develop branch.
git add -A
git commit -m 'feat: add new validation & summary stats
> Add validation function to ensure proper numeric sequence input.
> Add summary stat functions to compute standard deviation and z-score.'
git push --set-upstream origin ch6
With the R and Python packages you created in the portfolio builder, work through the R and Python examples above to add new functionality to your package.
Subdirectories are actually allowed but they can only be for OS-specific purposes and named unit/
or windows/
. See this Stackoverflow question for details.↩