The following dictionary accompanies the publication of the up-coming O’Reilly title Python & R for the Modern Data Scientist by Rick J Scavetta & Boyan Angelov. It is meant to be used as a quick reference for translating commands between Python & R. Visit the book’s repo for access to other resources.

Corrections & additions are welcome. Please contact Rick or place an issue on the Repo. For a downloadable summary, please visit the book’s website at https://moderndata.design/

Aside for some command line expressions to be entered in the terminal, which are explicitly noted, expressions are in Python (right side) or R (left side).

1 Package management

1.1 Installing a single package

install.packages("tidyverse")

# Command line
pip install pandas 

1.2 Installing specific package versions

devtools::install_version(
  "ggmap", 
  version = "3.5.2"
  )

# Command line
pip install pandas==1.1.0 

1.3 Installing multiple packages

install.packages(c("sf", "ggmap"))

# Command line
pip install pandas scikit-learn seaborn

Write a list of all packages (and versions) in use to requirements.txt.

# Command line
pip freeze > requirements.txt

Use requirements.txt as input to install packages in a new environment.

# Command line
pip install -r requirements.txt

1.4 Loading Packages

# Multiple calls to library()
library(MASS)
library(nlme)
library(psych)
library(sf)

# Install if not already available:
if (!require(readr)) {
  install.packages("readr")
  library(readr)
  }

# Check, install if necessary, and load single or multiple packages:
pacman::p_load(MASS, nlme, psych, sf)

# Full package
import math
from sklearn import * # Less recommended, see below

# Full package with alias
import pandas as pd

# Module
from sklearn import datasets

# Module with alias
import statsmodels.api as sm

# Function
from statsmodels.formula.api import ols # For ordinary least squares regression

2 Assign Operators

Definitions:

  • RHS: Right-hand side
  • LHS: Left-hand side

2.1 Typical

Operator Direction Environment Name Comment
<- RHS to LHS Current Assignment Operator (leftwards) Preferred: Common and unambiguous.
= RHS to LHS Current Assignment Operator (leftwards) Less preferred. Common but easily confused with == (equivalency) and = (assign to function argument). No corollary super assignment.
-> LHS to RHS Current Assignment operator (rightwards) Less preferred. Uncommon, easily overlooked and unexpected. Often used at the end of a long dplyr/tidyverse chain of functions, choose %<% instead.
Operator Direction Environment Name Comment
= RHS to LHS Current Simple assignment operator Preferred. Use following environment scoping rules.

2.2 Super assignment

Operator Direction Environment Name Comment
<<- RHS to LHS Parent Super assignment operator (leftwards) Common. Use following environment scoping rules.
->> LHS to RHS Parent Super assignment operator (rightwards) Less common.

2.3 Special cases & incrementals

n.b. How thorough do we want to be here?

These operators are particularly preferred when using a dplyr/tidyverse chain of functions.

Operator Direction Environment Name Comment
%>% LHS to RHS Current Pipe Assign to the first argument of the downstream function, magrittr package.
%$% LHS to RHS Current Exposition pipe Expose the named elements to the downstream function, magrittr package.
%<>% RHS to LHS Current Assignment pipe Assign to the first argument of the downstream function and assign output in situ, magrittr package.
%<-% RHS to LHS Current Multiple assign Assign to multiple objects, zeallot package.
Operator Direction Environment Name Comment
+= RHS to LHS Current Increment assignment Adds a value and the variable and assigns the result to that variable.
-= RHS to LHS Current Decrement assignment Subtracts a value from the variable and assigns the result to that variable.
*= RHS to LHS Current Multiplication assignment Multiplies the variable by a value and assigns the result to that variable.
/= RHS to LHS Current Division assignment Divides the variable by a value and assigns the result to that variable.
**= RHS to LHS Current Power assignment Raises the variable to a specified power and assigns the result to the variable.
%= RHS to LHS Current Modulus assignment Computes the modulus of the variable and a value and assigns the result to that variable.
//= RHS to LHS Current Floor division assignment Floor divides the variable by a value and assigns the result to that variable.

3 Types

The four most common user-defined types.

Type Data frame shorthand Tibble shorthand Description Example
Logical logi <lgl> Binary data TRUE/FALSE, T/F, 1/0
Integer int <int> Whole numbers from [-\(\infty\), \(\infty\)] 7, 9, 2, -4
Double num <dbl> Real numbers from [-\(\infty\), \(\infty\)] 3.14, 2.78, 6.45
Character chr <chr> All alpha-numeric characters, including white spaces "Apple", "Dog",

Type Base shorthand Pandas shorthand Description Example
Boolean bool bool Binary data True/False
Integer int int Whole numbers from [-\(\infty\), \(\infty\)] 7, 9, 2, -4
Float float float Real numbers from [-\(\infty\), \(\infty\)] 3.14, 2.78, 6.45
String str obj All alpha-numeric characters, including white spaces 'Apple', 'Dog',

4 Arithmetic Operators

Description R Operator Python Operator
Addition + +
Subtraction - -
Multiplication * *
Division (float) / /
Exponentiation ^ or ** **
Integer Division (floor) %/% //
Modulus %% %

5 Attributes

n.b. Focus more on listing then setting new attributes.

# List attributes
attributes(df)

# Accessor functions
dim(df)
names(df)
class(df)
comment(df)

# Add comment
comment(df) <- "new info"

# Add custom attribute
attr(df, "custom") <- "alt info"
attributes(df)$custom

# Definition of a class
class Food:
    name = 'toast'

# An instance of a class    
breakfast = Food()

# An attribute of the class
# inerited by the instance
breakfast.name

# Setting an attribute
breakfast.name = 'museli'
# setattr(breakfast, 'name', 'museli')

6 Keywords

?reserved
if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, NA_character_, ... (..1, ..2, etc.) 

# py Keywords
import keyword
print(keyword.kwlist)
## ['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']

7 Functions & Methods

# Basic definition
myFunc <- function (x, ...) {
  x * 10
}

myFunc(4)
## [1] 40
# Multiple unnamed arguments
myFunc <- function (...) {
  sum(...)
}

myFunc(100,40,60)
## [1] 200

# Simple definition
def my_func(x):
  return(x * 10)

my_func(4)
## 40

# Multiple named arguments, passed as a tuple
def my_func(*x):
  return(x[2])

my_func(100, 40, 60)
## 60
# Multiple unknown arguments, saved as a dict
def my_func(**numb):
  print("x: ",numb["x"])
  print("y: ",numb["y"])

my_func(x = 40, y = 100)
## x:  40
## y:  100
# Using doc strings
def my_func(**numb):
  """An example function
  that takes multiple unknown arguments.
  """
  print("x: ",numb["x"])
  print("y: ",numb["y"])

# Access doc strings with dunder
my_func.__doc__  
'An example function
 that takes multiple unknown arguments.'

8 Style and Naming conventions

Style is R is generally more loosely defined than in Python. Nonetheless, see the Advanced R Style guide or Google’s R Style guide for suggestions.

Indentation and Spacing:

White space is generally for style and inconsequential to execution. Add a space around operators, and use a tab to indent on successive lines of long commands.

Naming in a script:

The trend is currently towards lowercase snake case: underscores ("_") between words and only lower case letters.

my_data <- 1:6

See PEP8 style guide.

Indentation and Spacing:

White space, in particular indentation is a part of Python execution. Use 4 spaces instead of a tabs (can be set in your text editor).

Naming in a script:

Type Style Example
Functions & variables lowercase snake case func, my_func, var, my_var, x

When defining classes:

Type Style Example
Class Capitalized camel case Recipe, MyClass
Method lowercase snake case class_method, method
Constant Full uppercase snake case CONS, MY_CONS, LONG_NAME_CONSTANT

In packages:

Type Style Example
Packages & module lowercase snake case mypackage, module.py, my_module.py

Naming conventions with _:

Naming Meaning
_var A convention used to show that a variable is meant for internal use within a function or method
var_ A convention used to avoid naming conflicts with Python keywords
__var Triggers name mangling when used in a class context to prevent inheritance collisions. Enforced by the Python interpreter
__var__ Dunder (“double underscore”) variables. Special methods defined by the Python language. Avoid this naming scheme for your own attributes
_ Naming a temporary or insignificant variable, e.g. in a for loop

9 Analogous Data Storage Objects

R structure Analogous Python Structure(s)
Vector (1-dimensional homogeneous) ndarray, but also scalars, Homogeneous list & tuple
Vector, matrix or array (homogeneous) NumPy n-dimensional Array (ndarray)
Unnamed list (heterogenous) list
Named list (heterogeneous) Dictionary dict, but lacking order
Environment (named, but unordered elements) Dictionary dict
Variable/column in a data.frame. Pandas Series (pd.Series)
2-dimensional data.frame Pandas DataFrame (pd.DataFrame)

Python Structure Analogous R Structure(s)
scalar 1-element long vector
list (homogeneous) Vector, but as if lacking vectorization
list (heterogeneous) Unnamed list
tuple (immutable, homogeneous) Vector, list as separated output from a function.
tuple (immutable, heterogeneous) Vector, list as separated output from a function.
Dictionary dict, a key-value pair Named list or better environment.
NumPy n-dimensional Array (ndarray) Vector, matrix or array.
Pandas Series (pd.Series) Vector, variable/column in a data.frame.
Pandas DataFrame (pd.DataFrame) 2-dimensional data.frame.

9.1 One-dimensional, Homogenous

# Vectors
cities_R <- c("Munich", "Paris", "Amsterdam")
dist_R <- c(584, 1054, 653)

# Lists
cities = ['Munich', 'Paris', 'Amsterdam']
dist = [584, 1054, 653]

9.2 One-dimensional, Heterogenous

Key-value pairs. Lists in R. Dictionaries in Python

# A list of data frames
cities_list <- list(Munich = data.frame(dist = 584,
                                    pop = 1484226,
                                     area = 310.43,
                                     country = "DE"),
                 Paris = data.frame(dist = 1054,
                                     pop = 2175601,
                                    area = 105.4,
                                    country = "FR"),
                 Amsterdam = data.frame(dist = 653,
                                        pop = 1558755,
                                        area = 219.32,
                                        country = "NL"))
# As a list object
cities_list[1] 
## $Munich
##   dist     pop   area country
## 1  584 1484226 310.43      DE
cities_list["Munich"]
## $Munich
##   dist     pop   area country
## 1  584 1484226 310.43      DE
# As a data.frame object
cities_list[[1]] 
##   dist     pop   area country
## 1  584 1484226 310.43      DE
cities_list$Munich
##   dist     pop   area country
## 1  584 1484226 310.43      DE
# A list of heterogenous data
lm_list <- lm(weight ~ group, data = PlantGrowth)

# length(lm_list)
# names(lm_list)

# lists
city_l = ['Munich', 'Paris', 'Amsterdam']

dist_l = [584, 1054, 653]

pop_l = [1484226, 2175601, 1558755]

area_l = [310.43, 105.4, 219.32] 

country_l = ['DE', 'FR', 'NL']
import numpy as np

# Numpy arrays
city_a = np.array(['Munich', 'Paris', 'Amsterdam'])
city_a
## array(['Munich', 'Paris', 'Amsterdam'], dtype='<U9')
pop_a = np.array([1484226, 2175601, 1558755])
pop_a
## array([1484226, 2175601, 1558755])


# Dictionaries
yy = {'city': ['Munich', 'Paris', 'Amsterdam'], 
      'dist': [584, 1054, 653],
      'pop': [1484226, 2175601, 1558755],
      'area': [310.43, 105.4, 219.32], 
      'country': ['DE', 'FR', 'NL']}
yy
## {'city': ['Munich', 'Paris', 'Amsterdam'], 'dist': [584, 1054, 653], 'pop': [1484226, 2175601, 1558755], 'area': [310.43, 105.4, 219.32], 'country': ['DE', 'FR', 'NL']}

9.3 Data frames

# class data.frame from vectors
cities_df <- data.frame(city = c("Munich", "Paris", "Amsterdam"),
                    dist = c(584, 1054, 653),
                    pop = c(1484226, 2175601, 1558755),
                    area = c(310.43, 105.4, 219.32), 
                    country = c("DE", "FR", "NL"))

cities_df
##        city dist     pop   area country
## 1    Munich  584 1484226 310.43      DE
## 2     Paris 1054 2175601 105.40      FR
## 3 Amsterdam  653 1558755 219.32      NL

# class pandas.DataFrame
import pandas as pd

# From scratch

# From a dictionary, yy
yy_df = pd.DataFrame(yy)
yy_df
##         city  dist      pop    area country
## 0     Munich   584  1484226  310.43      DE
## 1      Paris  1054  2175601  105.40      FR
## 2  Amsterdam   653  1558755  219.32      NL

From lists


# From lists
# names
list_names = ['city', 'dist', 'pop', 'area', 'country']

# columns are a list of lists
list_cols = [city_l, dist_l, pop_l, area_l, country_l]
list_cols

# A ziped list of tuples
## [['Munich', 'Paris', 'Amsterdam'], [584, 1054, 653], [1484226, 2175601, 1558755], [310.43, 105.4, 219.32], ['DE', 'FR', 'NL']]
zip_list = list(zip(list_cols, list_names))
zip_list

# zip_dict = dict(zip_list)
# zip_df = pd.DataFrame(zip_dict)
# zip_df


# zip_df = pd.DataFrame(zip_list)
# zip_df
## [(['Munich', 'Paris', 'Amsterdam'], 'city'), ([584, 1054, 653], 'dist'), ([1484226, 2175601, 1558755], 'pop'), ([310.43, 105.4, 219.32], 'area'), (['DE', 'FR', 'NL'], 'country')]

Easier


# Import pandas library 
import pandas as pd 
  
# initialize list of lists 
list_rows = [['Munich',   584,  1484226,  310.43, 'DE'],
             ['Paris',  1054,  2175601,  105.40,      'FR'],    
             ['Amsterdam',   653,  1558755,  219.32,      'NL']] 

# Create the pandas DataFrame 
df = pd.DataFrame(list_rows, columns = list_names) 
  
# print dataframe. 
df 
##         city  dist      pop    area country
## 0     Munich   584  1484226  310.43      DE
## 1      Paris  1054  2175601  105.40      FR
## 2  Amsterdam   653  1558755  219.32      NL

9.4 Multi-dimensional arrays

# array
arr_r <- array(c(1:4,
                 seq(10, 40, 10),
                 seq(100, 400, 100)), 
               dim = c(2,2,3) )

arr_r
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]   10   30
## [2,]   20   40
## 
## , , 3
## 
##      [,1] [,2]
## [1,]  100  300
## [2,]  200  400
rowSums(arr_r, dims = 2)
##      [,1] [,2]
## [1,]  111  333
## [2,]  222  444
rowSums(arr_r, dims = 1)
## [1] 444 666
colSums(arr_r, dims = 1)
##      [,1] [,2] [,3]
## [1,]    3   30  300
## [2,]    7   70  700
colSums(arr_r, dims = 2)
## [1]   10  100 1000



arr = np.array([[[ 1,  2], 
                 [ 3,  4]],
                [[ 10, 20],
                 [30, 40]],
                [[100, 200],
                 [300, 400]]])
arr
## array([[[  1,   2],
##         [  3,   4]],
## 
##        [[ 10,  20],
##         [ 30,  40]],
## 
##        [[100, 200],
##         [300, 400]]])
arr.sum(axis=0)
## array([[111, 222],
##        [333, 444]])
arr.sum(axis=1)
## array([[  4,   6],
##        [ 40,  60],
##        [400, 600]])
arr.sum(axis=2)
## array([[  3,   7],
##        [ 30,  70],
##        [300, 700]])

10 Writing Functions

n.b. move up to the main funcitons section and … include lambda and anonymous functions here. and… possibly map()

mathFun_R <- function(x,y) {
  c(x + y, x - y)
}

# explicit return
mathFun_R <- function(x,y) {
  a <- c(x + y, x - y)
  return(a)
}


def mathFun_Py(x, y):
    """Add and subtract two numbers together"""
    result = (x + y, x - y)
    
    return result

11 Logical Expressions

11.1 Relational operators

Description R Operator Python Operator
Equivalency == ==
Non-equivalency != !=
Greater-than (or equal to) > (>=) > (>=)
Lesser-than (or equal to) < (<=) < (<=)
Negation !x not()

xx <- 1:10

xx == 6
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
xx != 6
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
xx >= 6
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
xx < 6
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

# Rel Op in Python
a = np.array([23, 6, 7, 9, 12])
a > 10
## array([ True, False, False, False,  True])

11.2 Logical operators

Description R Operator Python Operator
AND &, && &, and
OR |, || |, or
WITHIN y %in% x in, not in
identity identical() is, is not

xx <- 1:6

# tails of a distribution
xx < 3 | xx > 4 
## [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE
# Range in a distribution
xx > 3 & xx < 4 
## [1] FALSE FALSE FALSE FALSE FALSE FALSE


# Log Op in Python
# x = range(6)
# x = [*x]
# x
# type(x)
import numpy as np
x  = np.array(range(6))
# type(x)
# tails of a distribution
# x < 3 or x > 4
[i for i in x if i < 3 or i > 4]

# Range in a distribution
# x > 3 and x < 4 
## [0, 1, 2, 5]
[i for i in x if i >= 3 and i <= 4]
## [3, 4]

11.3 Identity

n.b. add python a.any() and a.all() and R any() and all().

x <- c("Caracas", "Bogotá", "Quito")
y <- c("Bern", "Berlin", "Brussels")
z <- c("Caracas", "Bogotá", "Quito")

# Are the objects identical?
identical(x, y)
## [1] FALSE
identical(x, z)
## [1] TRUE
# Is any TRUE
any(x == "Quito")
## [1] TRUE
# Are all TRUE
all(str_detect(y, "^B"))
## [1] TRUE

x = ['Caracas', 'Bogotá', 'Quito']
y = ['Bern', 'Berlin', 'Brussels']
z = ['Caracas', 'Bogotá', 'Quito']

x == y
## False
x == z
## True
# Is any True
import numpy as np
x = np.array(x)
np.any(x == "Caracas")
## True
# Are all True

np.all(x == "Caracas")
## False

12 Indexing

Dimensions Use Description
1 x[index] Isolate contents, keep container
1 x[-index] Isolate contents remove item, keep container
1 x[[index]] Extract one content, discard container
2 x[row_index, col_index] Isolate contents, keep container
2 x[col_index] Short-cut for columns
2 x[[index]] Extract one content, discard container
n x[row_index, col_index, dim_index] Isolate contents, keep container

Where index, row_index, col_index and dim_index are vectors of type integer, character or logical.

Dimensions Use Description
1 x[index] Isolate contents, keep container
1 x[-index] Isolate contents from reverse direction, keep container
1 x[index_1:index_2] Slice
1 x[index_1:index_2:stride] Slice with an interval
1 x[index_1:index_2:-1] Slice with reversal
2 x.loc[index_1:index_2] location
2 x.iloc[index_1:index_2:stride] index

12.1 1-dimensional

xx <- LETTERS[6:16]
xx[4]
## [1] "I"
xx[[4]]
## [1] "I"
cities_list[2]
## $Paris
##   dist     pop  area country
## 1 1054 2175601 105.4      FR

cities = ['Toronto', 'Santiago', 'Berlin', 'Singapore', 'Kampala', 'New Delhi']

cities[0]
## 'Toronto'
cities[-1]
## 'New Delhi'
cities[1:2]
## ['Santiago']
cities[:2]
## ['Toronto', 'Santiago']

12.2 2-dimensional

# class data.frame from vectors
cities_df <- data.frame(city = c("Munich", "Paris", "Amsterdam"),
                    dist = c(584, 1054, 653),
                    pop = c(1484226, 2175601, 1558755),
                    area = c(310.43, 105.4, 219.32), 
                    country = c("DE", "FR", "NL"))

cities_df[2] # Data.frame
##   dist
## 1  584
## 2 1054
## 3  653
cities_df[,2]  # vector
## [1]  584 1054  653
cities_df[[2]] # vector
## [1]  584 1054  653
cities_df[2:3] # Data.frame
##   dist     pop
## 1  584 1484226
## 2 1054 2175601
## 3  653 1558755
cities_df[,2:3]  # dataframe
##   dist     pop
## 1  584 1484226
## 2 1054 2175601
## 3  653 1558755
cities_tbl <- tibble(city = c("Munich", "Paris", "Amsterdam"),
                    dist = c(584, 1054, 653),
                    pop = c(1484226, 2175601, 1558755),
                    area = c(310.43, 105.4, 219.32), 
                    country = c("DE", "FR", "NL"))


cities_tbl[2]  # data frame
## # A tibble: 3 x 1
##    dist
##   <dbl>
## 1   584
## 2  1054
## 3   653
cities_tbl[,2]  # data frame
## # A tibble: 3 x 1
##    dist
##   <dbl>
## 1   584
## 2  1054
## 3   653
cities_tbl[[2]] # vector
## [1]  584 1054  653
cities_tbl[2:3]  # data frame
## # A tibble: 3 x 2
##    dist     pop
##   <dbl>   <dbl>
## 1   584 1484226
## 2  1054 2175601
## 3   653 1558755
cities_tbl[,2:3]  # data frame
## # A tibble: 3 x 2
##    dist     pop
##   <dbl>   <dbl>
## 1   584 1484226
## 2  1054 2175601
## 3   653 1558755

df
##         city  dist      pop    area country
## 0     Munich   584  1484226  310.43      DE
## 1      Paris  1054  2175601  105.40      FR
## 2  Amsterdam   653  1558755  219.32      NL
df[1:]
##         city  dist      pop    area country
## 1      Paris  1054  2175601  105.40      FR
## 2  Amsterdam   653  1558755  219.32      NL
# position
df.iloc[0, 1]
## 584

df.iat[0, 1]
## 584

# label
df.loc[1:,  'city']
## 1        Paris
## 2    Amsterdam
## Name: city, dtype: object
data = {'Country': ['Belgium',  'India',  'Brazil'],
        'Capital': ['Brussels',  'New Delhi',  'Brasilia'],
        'Population': [11190846, 1303171035, 207847528]}

df_2 = pd.DataFrame(data,columns=['Country',  'Capital',  'Population'])

df_2
##    Country    Capital  Population
## 0  Belgium   Brussels    11190846
## 1    India  New Delhi  1303171035
## 2   Brazil   Brasilia   207847528
df[1:]
# df.iloc([0], [0])
##         city  dist      pop    area country
## 1      Paris  1054  2175601  105.40      FR
## 2  Amsterdam   653  1558755  219.32      NL

12.3 n-dimensional

cities_array <- c(1:16)
dim(cities_array) <- c(4,2,2)
cities_array
## , , 1
## 
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    9   13
## [2,]   10   14
## [3,]   11   15
## [4,]   12   16
cities_array[1,2,2]
## [1] 13
cities_array[1,2,]
## [1]  5 13
cities_array[,2,1]
## [1] 5 6 7 8

# Python n-dimensional indexing
arr
## array([[[  1,   2],
##         [  3,   4]],
## 
##        [[ 10,  20],
##         [ 30,  40]],
## 
##        [[100, 200],
##         [300, 400]]])
arr[1,1,1]
## 40
arr[:,1,1]
## array([  4,  40, 400])
arr[1,:,1]
## array([20, 40])
arr[1,1,:]
## array([30, 40])