As data scientists, we get used to common workflows that exist in exploring and modeling data. However, the package development workflow is often unique from how we are used to working. This chapter is designed to get you started on a small package so you can experience the typical workflow. First, we’ll discuss the common steps and then we will work through them for both an and version of the package.
Once you have determined the need to create a package, the first thing we need to do is identify a package name, create the basic package structure, and connect it to a version control system (i.e. Github) of interest.
Naming our package is important. There are certain requirements we need to adhere to but, also, the name you choose should be easy to remember and follow the respective languages idiomatic approach to naming. Moreover, the name you choose should not already exist.
and have slightly different requirements regarding acceptable names. In both languages, you can only use letters and numbers; however, you can’t start the name with a number. In your name can contain a .
but not a -
or _
while in your name can contain all three. In both languages you can combine upper and lowercase letters in the name.
However, our advice is to keep the name short, all lowercase, and with no separator when possible. Examples of good approaches to name include:
Once you’ve come up with a name or two you probably want to make sure that the name is not already being used. The package managers (CRAN & PyPI) do not allow duplicate names so your name must be unique. Even if you don’t intend to share your package publicly, you should avoid duplicate names if possible. Both and have tools that make this easy to do and we’ll cover them in the hands-on sections of this chapter.
Once you have a name its time to create the package. This includes determining where the package will live on your operating system and creating the bare bones framework of the package.
When creating a package, the source code location (path) refers to where the source lives, not where the installed form lives. Recall in the introduction notebook that installed packages all live within a library directory. But when developing, you will keep the source code somewhere else.
Where should you keep source packages? The main principle is that this location should be distinct from where installed packages live. In the absence of external considerations, a typical user should designate a directory inside their home directory for and (source) packages. This may be in directories such as ~/Desktop/Packages/
or /r/packages
and /python/packages
. Some of us use one directory for this, others divide source packages among a few directories. This probably reflects that we are primarily tool-builders. An academic researcher might organize their files around individual publications, whereas a data scientist might organize around data products and reports. There is no particular technical or traditional reason for one specific approach. As long as you keep a clear distinction between source and installed packages, just pick a strategy that works within your overall system for file organization, and use it consistently.
Once you have a location to hold the source code, both and have tools that help to automate the creation of a basic package structure. We’ll use these tools in the hands-on sections of this chapter.
Most package development never resides solely on one operating system nor with only one developer. So not only should we be using git for version control, we should be using a hosting platform such as GitHub, GitLab, Azure Repos, or the like. So the first thing we need to do is connect our local git repository to a remote repository so we can push, pull, and merge updates we make to the package.
Once we have established our remote repository, we should set up a proper branching method for our work. We advise a Git flow branching strategy where:
In this branching approach, we will make updates to our package in a supporting branch, then do a pull request to merge these updates into the develop branch. Once the develop branch has enough updates to warrant a new release we then merge the develop to master with a pull request.
This may seem like a lot to digest and it may seem like overkill when working on our own small prototype packages. However, this is how larger projects should be organized so adopting this approach now will make it easier when you start working on larger, group projects/packages. If you are unfamiliar with git and branching then we recommend these tutorials to get started:
In the hands-on sections of this chapter we’ll walk through setting up this git flow branching strategy.
By default, every project on your system will use the same library directory to store and retrieve packages. At first glance, this may not seem like a big deal but it does matter when different projects require different versions of dependency packages.
Consider the following scenario where you have two projects: Project A and Project B, both of which have a dependency on the same library, somepkg. The problem becomes apparent when we start requiring different versions of somepkg. Maybe Project A needs v1.0.0, while Project B requires the newer v2.0.0, for example.
This is a real problem for both and since they can’t differentiate between versions in the library directory. So both v1.0.0 and v2.0.0 would reside in the same directory with the same name. Since packages are stored according to just their name, there is no differentiation between versions. Thus, both Project A and Project B would be required to use the same version, which is unacceptable in many cases.
This is where virtual environments come into play. Virtual environments create an isolated environment for our project. This means that each project can have its own dependencies, regardless of what dependencies every other project has. This creates better reliability during the development process.
In the hands-on sections of this chapter, we’ll setup and virtual environments. We consider it a best practice to always do your development work in virtual environments.
To learn more about virtual environments we suggest starting with the following tutorials:
So we finally have our project set-up ready, now it’s time to start adding some functionality. When adding new functionality or fixing a bug, it’s important to approach it strategically. Rather than just adding or changing code, we should do it in a way that we know whether our code is successful or not. To accomplish this, we should use a test-driven development (TDD) approach.
The general idea behind TDD is to follow this basic approach:
This may seem like a lot of steps to remember but this process is rather quick and each step takes very small incremental steps forward. After implementing a few features in this manner it quickly becomes a natural approach to software development.
If you are interested in learning more about the TDD philosophy we recommend the following books:
Once your new functionality is successfully passing tests and has been refactored, you should make sure any necessary documentation is added or updated. This will include function specific documentation, module level documentation, any code in example notebooks or vignettes and the like.
Lastly, we need to now save our work to our remote repository. This typically includes staging, commiting and pushing our changes. Depending on the stage of our work we may be ready to do a pull request into the main development branch.
Okay, so this may seem like a lot of steps to keep track of. Often, this feels more convoluted on paper than when you are actually implementing so let’s start working with an example package to go through these steps.
Let’s work through a simple example to illustrate how to perform the previous steps discussed for an package. We’ll create a package that has a single function. To simplify this process we’ll just reimplement existing functionality that already exists in R…calculating the mean of a vector.
Although we are not going to publish this package we should still check to make sure there are no other packages that have the same name. For this example, I’m going to use the name “myfirstpkg”. We can use available::available()
to check the availability of the name on CRAN, Bioconductor (another R package manager that is focused on biostats packages), and Github. It will even provide you hyperlinks to common websites that will let you know if this name is an existing abbreviation, wikipedia topic, or even in an urban dictionary along with any unexpected sentiment.
In this case, we see that there are no conflicting packages on CRAN or Bioconductor but there are on Github. This is ok in this case as I don’t expect to import and use other peoples first packages! So let’s proceed.
available::available("myfirstpkg", browse = FALSE)
## ── myfirstpkg ────────────────────────────────────────────────────────
## Name valid: ✔
## Available on CRAN: ✔
## Available on Bioconductor: ✔
## Available on GitHub: ✖
## Abbreviations: http://www.abbreviations.com/myfirstpkg
## Wikipedia: https://en.wikipedia.org/wiki/myfirstpkg
## Wiktionary: https://en.wiktionary.org/wiki/myfirstpkg
## Urban Dictionary:
## Not found.
## Sentiment:???
Now let’s create the initial package structure. There are several ways to do this but in this case I’ve set up a pre-built template that helps automate the initial structure. Open up your terminal and run:
pip install cookiecutter
Now go to the location that you want the package source code to live (i.e. a directory on your Desktop versus in a Packages subdirectory). In my case, I am going to keep this in a misk subdirectory:
cd ~/Desktop/Workspace/Projects/misk
Next, we’ll run the following command. This uses cookiecutter to create our simplified pre-built package.
cookiecutter https://github.com/misk-data-science/package-template
When you run this, you will be asked a series of questions such as your name, email, and package specific questions. For my package, I provided the following command responses:
first_name [ex: John]: Brad
last_name [ex: Smith]: Boehmke
email [first.last@example.com]: bradleyboehmke@gmail.com
github_username [bradleyboehmke]: bradleyboehmke
package_language [r or python]: r
package_name [awesome]: myfirstpkg
package_short_description [short one-liner, ex: My first package]: My first package
version [0.1.0]:
url [https://github.com/bradleyboehmke/myfirstpkg]:
Select open_source_license:
1 - MIT license
2 - BSD license
3 - ISC license
4 - Apache Software License 2.0
5 - GNU General Public License v3
6 - Not open source
Choose from 1, 2, 3, 4, 5, 6 [1]: 5
The values in the brackets are the default values if you do not supply any input. For example, note how I did not enter any values for the version number or the URL. This is because 0.1.0 is nearly always a good first version number to start with. The Github URL is automatically generated so unless you choose to use a different remote the default should be a good choice.
You should now have a directory with your package name in your current directory. If you use the shell command ls
you should see myfirstpkg listed. Here is what my directory look likes:
ls
ds-packages misk-dl misk-homl myfirstpkg package-template
The first thing we want to do is create a remote where we’ll push our source code to (i.e. Github). In our case, we’ll use Github. First, create a new repository on GitHub.
To avoid errors, do not initialize the new repository with README, license, or gitignore files since our local directory already contains these files.
Now go back to your terminal and cd
into the source code directory.
cd myfirstpkg
Next, we initialize the local package directory as a git repo and add the remote URL. This allows changes to be tracked and when we push and pull our changes it will push and pull from the URL location. We can always verify that our remote was added with git remote -v
:
git init .
git remote add origin https://github.com/bradleyboehmke/myfirstpkg
# verify remotes were added
git remote -v
origin https://github.com/bradleyboehmke/myfirstpkg (fetch)
origin https://github.com/bradleyboehmke/myfirstpkg (push)
Now let’s create our first commit so that we have a baseline (although empty) package.
git add -A
git commit -m "create initial package structure"
git push --set-upstream origin master
Now if you look at your Github repo you will see the master branch has your current directory contents.
Recall in the version control section that we prefer to do development work with a support » develop » master branch framework. This has us doing work in a support branch and not the master or develop branch. Right now we only have a master branch established so let’s create a develop branch with git checkout -b develop
and push the contents of the develop branch (which are the same as the master) to Github.
git checkout -b develop
git push --set-upstream origin develop
Now if you look at Github you’ll notice that we have two branches in our repo:
Whew 😥! We finally have our version control set up. That was a bit of work but you only need to do that once at package creation time.
Before we add any new functionality to our package, let’s create a virtual environment so we can keep all our package dependencies isolated to the location we are working in. Go ahead and open up the R project in the package source code location. You can do that by clicking on the myfirstpkg.Rproj
file or from the command line with:
open myfirstpkg.Rproj
This will open the package project within RStudio, which is where we’ll do the majority of our work from here on out. We’ll use the renv package to create a virtual environment. Using bare = TRUE
will create an empty projec library (with the exception of renv.
Running renv::init()
uses the default bare = FALSE
which will perform an automate search throughout your directory to identify required packages and automatically install them into your environment.
# install.packages("renv") # install renv if necessary
renv::init(bare = TRUE)
This will add some hidden files and a renv/
directory to your repo. If you look inside the renv/
directory you will see that only one package exists, the renv package. The first thing we want to do is make sure we have the devtools package installed. This is the primary dev package required for package development. Running the following will install devtools into our virtual environment.
If you previously installed a package that meets the version number requested, renv is smart and does a linked cache rather than install the package completely. This means if you use the latest version of devtools in many projects you won’t be needlessly installing and storing multiples of the same version.
install.packages("devtools")
Now let’s install all the initial packages that we will need. The DESCRIPTION
file always contains a packages dependencies. By running the following, we will install all these dependencies into our environment.
devtools::install_deps()
The last thing we need to do is run renv::snapshot()
. This saves the state of our project dependencies to a renv.lock
file. That is, renv.lock
holds all installed dependency packages required by your project. That way, anyone can recreate your virtual environment by running renv::restore()
.
renv::snapshot()
Anytime you add a new dependency package make sure you run renv::snapshot()
to update the renv.lock
project dependency list.
Now that our virtual environment is set up, let’s add some functionality to our package. The first thing we’ll add is a function that computes the mean of a vector. Recall that when we add new functionality we:
So first, we’ll create a new branch, which we can just call add_mean
:
git checkout -b add_mean
Now let’s create two new files:
usethis::use_r("mean")
: creates a new mean.R
file within the R/
directory. This is where our function will be written.usethis::use_test("mean")
: creates a new test-mean.R
file within the tests/testthat/
directory. This is where the test ensure our function is working correctly will go.usethis::use_r("mean")
usethis::use_test("mean")
Later chapters will discuss ways to organize your source code within the R/
directory and your associated tests. For now, we’ll just create a new script for both.
Now open up the R/mean.R
file and insert a shell function. Since this function is empty it simply returns NULL
but it will allow us to create and run a failing first test.
my_mean <- function(x) {
}
Now open the tests/testthat/test-mean.R
file and let’s create our first test. We can always add more tests later but for the first test we just want to check for the most basic functionality. Consequently, this test is simply testing that the mean of vector containing values 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 is 5.
test_that("mean of simple vectors compute accurately", {
x <- 0:10
expect_equal(my_mean(x), 5)
})
To run our test, we need to first load our package’s function (just my_mean
for now) and then execute the test. You can do this from the RStudio console with:
devtools::load_all()
devtools::test()
## ✓ | OK F W S | Context
## x | 0 1 | mean
## ─────────────────────────────────────────────────────────────────────────────────────
## test-mean.R:4: failure: mean of simple vectors compute accurately
## my_mean(x) not equal to 5.
## target is NULL, current is numeric
## ─────────────────────────────────────────────────────────────────────────────────────
##
## ══ Results ══════════════════════════════════════════════════════════════════════════
## OK: 0
## Failed: 1
## Warnings: 0
## Skipped: 0
You will find yourself doing this pattern often. The shortcut for devtools::load_all()
is Ctrl + Shift + L (Windows & Linux) or Cmd + Shift + L (macOS) and devtools::test()
is Ctrl/Cmd + Shift + T.
As you see above, our test is failing. So let’s start adding code to make this test pass. We know that the mean is computed as \(m = \frac{\text{sum of terms}}{\text{number of terms}}\). Let’s update our function to account for that:
my_mean <- function(x) {
total <- sum(x)
units <- length(x)
return(total / units)
}
Now running our test again we get success!
devtools::test()
## ✓ | OK F W S | Context
## ✓ | 1 | mean
##
## ══ Results ══════════════════════════════════════════════════════════════════════════
## OK: 1
## Failed: 0
## Warnings: 0
## Skipped: 0
We likely want to iterate like this to make our function more robust. For example, what happens if a user passes non-numeric values or a non-vector data structure to our function? Or what happens if our vector contains NA
, NaN
, or Inf
values? For each scenario like this we want to create a test first, make sure it fails, and then add the new functionality to our code to make our tests pass. We’ll explore this process more in the notebook on tests.
We have empirical evidence that my_mean()
works. But how can we be sure that all the moving parts of our package still work? This may seem silly to check, after such a small addition, but it’s good to establish the habit of checking this often. R CMD check
, executed in the shell, is the gold standard for checking that an R package is in full working order. devtools::check()
is a convenient way to run this without leaving your R session.
A shortcut for devtools::check()
is Ctrl/Cmd + Shift + E
Also, note that this check produces rather voluminous output. The last output will list the key results and, if there are errors or warnings, it will typically point you in the right direction to resolve them.
devtools::check()
## ── Building ────────────────────────────────────────────────────────────────── myfirstpkg ──
## Setting env vars:
## ● CFLAGS : -Wall -pedantic -fdiagnostics-color=always
## ● CXXFLAGS : -Wall -pedantic -fdiagnostics-color=always
## ● CXX11FLAGS: -Wall -pedantic -fdiagnostics-color=always
## ────────────────────────────────────────────────────────────────────────────────────────────
## ✓ checking for file ‘/Users/b294776/Desktop/Workspace/Projects/misk/myfirstpkg/DESCRIPTION’ ...
## ...
## ...
## ── R CMD check results ─────────────────────────────────────────────── myfirstpkg 0.1.0 ────
## Duration: 11.4s
##
## 0 errors ✓ | 0 warnings ✓ | 0 notes ✓
We now have new working functionality in our package, the last thing we want to do is properly document our function. We do this with roxygen2, which was installed in our environment when we ran devtools::install_deps()
. We can add object-level documentation to our my_mean()
function like this:
#' Mean of a vector
#'
#' @description
#' Computes arithmetic mean of a vector with numeric or logical values.
#'
#' @param x A numeric or logical vector.
#'
#' @return
#' The arithmetic mean of the values in x returned as a numeric vector of length one.
#'
#' @examples
#' x <- 1:10
#' my_mean(x)
#'
#' @export
my_mean <- function(x) {
total <- sum(x)
units <- length(x)
return(total / units)
}
After adding the documentation we can run devtools::document()
or (Cmd/Ctrl + Shift + D):
devtools::document()
## Updating myfirstpkg documentation
## Loading myfirstpkg
## Writing my_mean.Rd
## Writing NAMESPACE
## Documentation completed
Now you can run ?my_mean
and you’ll see the help documentation show up in the RStudio window:
After adding the documentation its best to re-run the tests and R Cmd check to ensure nothing unexpected happened while adding the documentation. Later chapters will go into the details of roxygen documentation along with other documentation we should be updating along the way (i.e. NEWS.md).
We now have the new functionality and documentation in place. Now we need to commit our changes, push them to Github and do a pull request to merge into the development branch of the package.
Learn about writing clear and effective git messages here.
git add -A
git commit -m "feat: add my_mean to compute arithmetic mean"
git push --set-upstream origin add_mean
Once the changes have been pushed to Github you will notice the updated branch changes were successfully pushed. Next, select the “Compare & pull request” button next to the new changes:
Be sure to choose the “base: develop” option for the pull request. This will signal that we want to merge our changes in the the feature branch with the develop branch:
Next we add a good summary of the pull request that signals what we added/changed and that we ran tests and checks successfully. After you add the message, go ahead and tag a friend or colleague to review your pull request. With more formalized projects we typically require that 1-2 folks have reviewed code changes in pull requests.
Once all reviewers have successfully signed off on the pull request go ahead and merge it into develop.
If pull requests are new to you, read more about them here:
Now we’ll work through a simple example to illustrate how to perform the basic workflow steps for n package. As in the previous section, we’ll create a package that has a single function…calculating the mean of a vector.
Although we are not going to publish this package we should still check to make sure there are no other packages that have the same name. For this example, I’m going to use the name “myfirstpypkg”. We can use pip search
to check the availability of the name on PyPI.
If you followed along in last sections R example then make sure you use a different name.
In this case, we see that there are no conflicting packages on PyPI! So let’s proceed.
pip search myfirstpypkg
Now let’s create the initial package structure. There are several ways to do this but similar to the example we’ll use a pre-built template that helps automate the initial structure. Open up your terminal. If you didn’t follow along in the last section then run the following to install cookiecutter:
pip install cookiecutter
Now go to the location that you want the package source code to live (i.e. a directory on your Desktop versus in a Packages subdirectory). In my case, I am going to keep this in a misk subdirectory:
cd ~/Desktop/Workspace/Projects/misk
Next, run the following command:
cookiecutter https://github.com/misk-data-science/package-template
When you run this, you will be asked a series of questions such as your name, email, and package specific questions. For my package, I provided the following command responses:
first_name [ex: John]: Brad
last_name [ex: Smith]: Boehmke
email [first.last@example.com]: bradleyboehmke@gmail.com
github_username [bradleyboehmke]: bradleyboehmke
package_language [r or python]: python
package_name [awesome]: myfirstpypkg
package_short_description [short one-liner, ex: My first package]: My first package
version [0.1.0]:
url [https://github.com/bradleyboehmke/myfirstpypkg]:
Select open_source_license:
1 - MIT license
2 - BSD license
3 - ISC license
4 - Apache Software License 2.0
5 - GNU General Public License v3
6 - Not open source
Choose from 1, 2, 3, 4, 5, 6 [1]: 5
The values in the brackets are the default values if you do not supply any input. For example, note how I did not enter any values for the version number or the URL. This is because 0.1.0 is nearly always a good first version number to start with. The Github URL is automatically generated so unless you choose to use a different remote the default should be a good choice.
You should now have a directory with your package name in your current directory. If you use the shell command ls
you should see myfirstpypkg listed. Here is what my directory look likes:
ls
ds-packages misk-dl misk-homl myfirstpkg myfirstpypkg package-template
The first thing we want to do is create a remote where we’ll push our source code to (i.e. Github). In our case, we’ll use Github. First, create a new repository on GitHub.
To avoid errors, do not initialize the new repository with README, license, or gitignore files since our local directory already contains these files.
Now go back to your terminal and cd
into the source code directory.
cd myfirstpypkg
Next, we initialize the local package directory as a git repo and add the remote URL. This allows changes to be tracked and when we push and pull our changes it will push and pull from the URL location. We can always verify that our remote was added with git remote -v
:
git init .
git remote add origin https://github.com/bradleyboehmke/myfirstpypkg
# verify remotes were added
git remote -v
origin https://github.com/bradleyboehmke/myfirstpypkg (fetch)
origin https://github.com/bradleyboehmke/myfirstpypkg (push)
Now let’s create our first commit so that we have a baseline (although empty) package.
git add -A
git commit -m "create initial package structure"
git push --set-upstream origin master
Now if you look at your Github repo you will see the master branch has your current directory contents.
Recall in section @ref(version-control) that we prefer to do development work with a support » develop » master branch framework. This has us doing work in a support branch and not the master or develop branch. Right now we only have a master branch established so let’s create a develop branch with git checkout -b develop
and push the contents of the develop branch (which are the same as the master) to Github.
git checkout -b develop
git push --set-upstream origin develop
Now if you look at Github you’ll notice that we have two branches in our repo:
Whew 😥! We finally have our version control set up. That was a bit of work but you only need to do that once at package creation time.
Before we add any new functionality to our package, let’s create a virtual environment so we can keep all our package dependencies isolated to the location we are working in. Go ahead and open up the Python project in your favorite editor. I will be using VS Code and I can open up the project with:
This assumes that you have already set the project directory as the working directory.
code .
This will open the project, which is where we’ll do the majority of our work from here on out. We’ll use the venv package to create a virtual environment. Run the following in your terminal to create and activate the virtual environment:
python -m venv venv
source venv/bin/activate
This will add a venv/
directory to your repo. If you look inside the venv/
directory you will see a lib/
directory that contains a couple of basic packages.
Now let’s install all the initial packages that we will need. The setup.py
file contains package dependencies. By running the following, we will install all these dependencies into our environment. The -e
installs our empty package in an __e__ditable fashion, this means as we make updates to the package we won’t need to reinstall along the way. The ".[dev]"
installs all development required dependencies. We’ll cover this in chapter @ref(metadata).
pip install -e ".[dev]"
Whenever you are done working in your virtual environment you can run deactivate
to exit out of your virtual environment.
Now that our virtual environment is set up, let’s add some functionality to our package. The first thing we’ll add is a function that computes the mean of a vector. Recall that when we add new functionality we:
So first, we’ll create a new branch, which we can just call add_mean
:
git checkout -b add_mean
Now let’s create two new files:
touch src/myfirstpypkg/mean.py
: creates a new mean.py
file within the source code directory. This is where our function will be written.touch tests/test_mean.py
: creates a new test_mean.py
file within the tests/
directory. This is where the test to ensure our function is working correctly will go.touch src/myfirstpypkg/mean.py
touch tests/test_mean.py
Later chapters will discuss ways to organize your source code within the src/
directory and your associated tests. For now, we’ll just create a new script for both.
Now open up the src/myfirstpypkg/mean.py
file and insert a shell function. pass
is a null operation – when it’s executed, nothing happens and consequently the function will return None
. This allows us to create and run a failing first test.
def my_mean(x):
pass
Now open the tests/test_mean.py
file and let’s create our first test. We can always add more tests later but for the first test we just want to check for the most basic functionality. Consequently, this test is simply testing that the mean of values 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 is 5.
from myfirstpypkg.mean import my_mean
def test_my_mean():
x = range(0, 11)
assert my_mean(x) == 5
Now we can run our tests by executing pytest
at the command line:
pytest
===================================== test session starts ======================================
platform darwin -- Python 3.7.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
rootdir: /Users/b294776/Desktop/Workspace/Projects/misk/myfirstpypkg, inifile: setup.cfg, testpaths: tests/
collected 1 item
tests/test_mean.py F [100%]
=========================================== FAILURES ===========================================
_________________________________________ test_my_mean _________________________________________
def test_my_mean():
x = range(0, 11)
> assert my_mean(x) == 5
E assert None == 5
E + where None = my_mean(range(0, 11))
tests/test_mean.py:5: AssertionError
=================================== short test summary info ====================================
FAILED tests/test_mean.py::test_my_mean - assert None == 5
====================================== 1 failed in 0.09s =======================================
As you see above, our test is failing. So let’s start adding code to make this test pass. We know that the mean is computed as \(m = \frac{\text{sum of terms}}{\text{number of terms}}\). Let’s update our function to account for that:
def my_mean(x):
total = sum(x)
units = len(x)
return total / units
Now running our test again we get success!
pytest
========================================= test session starts =========================================
platform darwin -- Python 3.7.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
rootdir: /Users/b294776/Desktop/Workspace/Projects/misk/myfirstpypkg, inifile: setup.cfg, testpaths: tests/
collected 1 item
tests/test_mean.py . [100%]
========================================== 1 passed in 0.01s ==========================================
We likely want to iterate like this to make our function more robust. For example, what happens if a user passes non-numeric values or a non-list data structure to our function? Or what happens if our list contains None
, np.nan
, or or some other missing value representation values? For each scenario like this we want to create a test first, make sure it fails, and then add the new functionality to our code to make our tests pass. We’ll explore this process more in chapter @ref(test).
We have empirical evidence that my_mean()
works so now let’s talk about documentation.
We now have new working functionality in our package, the last thing we want to do is properly document our function. We do this with docstrings. We can add object-level documentation to our my_mean()
function like this:
def my_mean(x):
"""
Mean of a vector
Computes arithmetic mean of a vector with numeric or logical values.
Parameters
----------
x
A numeric or logical list.
Returns
-------
The arithmetic mean of the values in x returned as a numeric vector of length one.
Examples
--------
>>> x = range(0, 11)
... my_mean(x)
"""
total = sum(x)
units = len(x)
return total / units
After adding the documentation we can now get various help documentation on the function. For example, hovering over the function in my editor shows the following:
After adding the documentation its best to re-run the tests to ensure nothing unexpected happened while adding the documentation. Later chapters will go into the details of docstrings along with other documentation we should be updating along the way (i.e. module level docstrings, CHANGELOG.md).
We now have the new functionality and documentation in place. Now we need to commit our changes, push them to Github and do a pull request to merge into the development branch of the package.
Learn about writing clear and effective git messages here.
git add -A
git commit -m "feat: add my_mean to compute arithmetic mean"
git push --set-upstream origin add_mean
Once the changes have been pushed to Github you will notice the updated branch changes were successfully pushed. Next, select the “Compare & pull request” button next to the new changes:
Be sure to choose the “base: develop” option for the pull request. This will signal that we want to merge our changes in the the feature branch with the develop branch:
Next we add a good summary of the pull request that signals what we added/changed and that we ran tests and checks successfully. After you add the message, go ahead and tag a friend or colleague to review your pull request. With more formalized projects we typically require that 1-2 folks have reviewed code changes in pull requests.
Once all reviewers have successfully signed off on the pull request go ahead and merge it into develop.
If pull requests are new to you, read more about them here:
Pick at least one exercise for R and one for Python to complete. Write up a summary of the findings and share with a colleague: