The following dictionary accompanies the publication of the up-coming O’Reilly title Python & R for the Modern Data Scientist by Rick J Scavetta & Boyan Angelov. It is meant to be used as a quick reference for translating commands between Python & R. Visit the book’s repo for access to other resources.
Corrections & additions are welcome. Please contact Rick or place an issue on the Repo. For a downloadable summary, please visit the book’s website at https://moderndata.design/
Aside for some command line expressions to be entered in the terminal, which are explicitly noted, expressions are in Python (right side) or R (left side).
install.packages("tidyverse")
# Command line
pip install pandas
devtools::install_version(
"ggmap",
version = "3.5.2"
)
# Command line
pip install pandas==1.1.0
install.packages(c("sf", "ggmap"))
# Command line
pip install pandas scikit-learn seaborn
Write a list of all packages (and versions) in use to requirements.txt
.
# Command line
pip freeze > requirements.txt
Use requirements.txt
as input to install packages in a new environment.
# Command line
pip install -r requirements.txt
# Multiple calls to library()
library(MASS)
library(nlme)
library(psych)
library(sf)
# Install if not already available:
if (!require(readr)) {
install.packages("readr")
library(readr)
}
# Check, install if necessary, and load single or multiple packages:
pacman::p_load(MASS, nlme, psych, sf)
# Full package
import math
from sklearn import * # Less recommended, see below
# Full package with alias
import pandas as pd
# Module
from sklearn import datasets
# Module with alias
import statsmodels.api as sm
# Function
from statsmodels.formula.api import ols # For ordinary least squares regression
Definitions:
Operator | Direction | Environment | Name | Comment |
---|---|---|---|---|
<- |
RHS to LHS | Current | Assignment Operator (leftwards) | Preferred: Common and unambiguous. |
= |
RHS to LHS | Current | Assignment Operator (leftwards) | Less preferred. Common but easily confused with == (equivalency) and = (assign to function argument). No corollary super assignment. |
-> |
LHS to RHS | Current | Assignment operator (rightwards) | Less preferred. Uncommon, easily overlooked and unexpected. Often used at the end of a long dplyr/tidyverse chain of functions, choose %<% instead. |
Operator | Direction | Environment | Name | Comment |
---|---|---|---|---|
= | RHS to LHS | Current | Simple assignment operator | Preferred. Use following environment scoping rules. |
Operator | Direction | Environment | Name | Comment |
---|---|---|---|---|
<<- | RHS to LHS | Parent | Super assignment operator (leftwards) | Common. Use following environment scoping rules. |
->> |
LHS to RHS | Parent | Super assignment operator (rightwards) | Less common. |
n.b. How thorough do we want to be here?
These operators are particularly preferred when using a dplyr/tidyverse chain of functions.
Operator | Direction | Environment | Name | Comment |
---|---|---|---|---|
%>% |
LHS to RHS | Current | Pipe | Assign to the first argument of the downstream function, magrittr package. |
%$% |
LHS to RHS | Current | Exposition pipe | Expose the named elements to the downstream function, magrittr package. |
%<>% |
RHS to LHS | Current | Assignment pipe | Assign to the first argument of the downstream function and assign output in situ, magrittr package. |
%<-% |
RHS to LHS | Current | Multiple assign | Assign to multiple objects, zeallot package. |
Operator | Direction | Environment | Name | Comment |
---|---|---|---|---|
+= | RHS to LHS | Current | Increment assignment | Adds a value and the variable and assigns the result to that variable. |
-= | RHS to LHS | Current | Decrement assignment | Subtracts a value from the variable and assigns the result to that variable. |
*= | RHS to LHS | Current | Multiplication assignment | Multiplies the variable by a value and assigns the result to that variable. |
/= | RHS to LHS | Current | Division assignment | Divides the variable by a value and assigns the result to that variable. |
**= | RHS to LHS | Current | Power assignment | Raises the variable to a specified power and assigns the result to the variable. |
%= | RHS to LHS | Current | Modulus assignment | Computes the modulus of the variable and a value and assigns the result to that variable. |
//= | RHS to LHS | Current | Floor division assignment | Floor divides the variable by a value and assigns the result to that variable. |
The four most common user-defined types.
Type | Data frame shorthand | Tibble shorthand | Description | Example |
---|---|---|---|---|
Logical | logi |
<lgl> |
Binary data | TRUE /FALSE , T /F , 1 /0 |
Integer | int |
<int> |
Whole numbers from [-\(\infty\), \(\infty\)] | 7 , 9 , 2 , -4 |
Double | num |
<dbl> |
Real numbers from [-\(\infty\), \(\infty\)] | 3.14 , 2.78 , 6.45 |
Character | chr |
<chr> |
All alpha-numeric characters, including white spaces | "Apple" , "Dog" ,
|
Type | Base shorthand | Pandas shorthand | Description | Example |
---|---|---|---|---|
Boolean | bool |
bool |
Binary data | True /False |
Integer | int |
int |
Whole numbers from [-\(\infty\), \(\infty\)] | 7 , 9 , 2 , -4 |
Float | float |
float |
Real numbers from [-\(\infty\), \(\infty\)] | 3.14 , 2.78 , 6.45 |
String | str |
obj |
All alpha-numeric characters, including white spaces | 'Apple' , 'Dog' ,
|
Description | R Operator | Python Operator |
---|---|---|
Addition | + |
+ |
Subtraction | - |
- |
Multiplication | * |
* |
Division (float) | / |
/ |
Exponentiation | ^ or ** |
** |
Integer Division (floor) | %/% |
// |
Modulus | %% |
% |
n.b. Focus more on listing then setting new attributes.
# List attributes
attributes(df)
# Accessor functions
dim(df)
names(df)
class(df)
comment(df)
# Add comment
comment(df) <- "new info"
# Add custom attribute
attr(df, "custom") <- "alt info"
attributes(df)$custom
# Definition of a class
class Food:
name = 'toast'
# An instance of a class
breakfast = Food()
# An attribute of the class
# inerited by the instance
breakfast.name
# Setting an attribute
breakfast.name = 'museli'
# setattr(breakfast, 'name', 'museli')
?reserved
if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, NA_character_, ... (..1, ..2, etc.)
# py Keywords
import keyword
print(keyword.kwlist)
## ['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']
# Basic definition
myFunc <- function (x, ...) {
x * 10
}
myFunc(4)
## [1] 40
# Multiple unnamed arguments
myFunc <- function (...) {
sum(...)
}
myFunc(100,40,60)
## [1] 200
# Simple definition
def my_func(x):
return(x * 10)
my_func(4)
## 40
# Multiple named arguments, passed as a tuple
def my_func(*x):
return(x[2])
my_func(100, 40, 60)
## 60
# Multiple unknown arguments, saved as a dict
def my_func(**numb):
print("x: ",numb["x"])
print("y: ",numb["y"])
my_func(x = 40, y = 100)
## x: 40
## y: 100
# Using doc strings
def my_func(**numb):
"""An example function
that takes multiple unknown arguments.
"""
print("x: ",numb["x"])
print("y: ",numb["y"])
# Access doc strings with dunder
my_func.__doc__
'An example function
that takes multiple unknown arguments.'
Style is R is generally more loosely defined than in Python. Nonetheless, see the Advanced R Style guide or Google’s R Style guide for suggestions.
Indentation and Spacing:
White space is generally for style and inconsequential to execution. Add a space around operators, and use a tab to indent on successive lines of long commands.
Naming in a script:
The trend is currently towards lowercase snake case: underscores ("_") between words and only lower case letters.
my_data <- 1:6
See PEP8 style guide.
Indentation and Spacing:
White space, in particular indentation is a part of Python execution. Use 4 spaces instead of a tabs (can be set in your text editor).
Naming in a script:
Type | Style | Example |
---|---|---|
Functions & variables | lowercase snake case | func , my_func , var , my_var , x |
When defining classes:
Type | Style | Example |
---|---|---|
Class | Capitalized camel case | Recipe , MyClass |
Method | lowercase snake case | class_method , method |
Constant | Full uppercase snake case | CONS , MY_CONS , LONG_NAME_CONSTANT |
In packages:
Type | Style | Example |
---|---|---|
Packages & module | lowercase snake case | mypackage , module.py , my_module.py |
Naming conventions with _
:
Naming | Meaning |
---|---|
_var |
A convention used to show that a variable is meant for internal use within a function or method |
var_ |
A convention used to avoid naming conflicts with Python keywords |
__var |
Triggers name mangling when used in a class context to prevent inheritance collisions. Enforced by the Python interpreter |
__var__ |
Dunder (“double underscore”) variables. Special methods defined by the Python language. Avoid this naming scheme for your own attributes |
_ |
Naming a temporary or insignificant variable, e.g. in a for loop |
R structure | Analogous Python Structure(s) |
---|---|
Vector (1-dimensional homogeneous) | ndarray , but also scalars , Homogeneous list & tuple |
Vector, matrix or array (homogeneous) |
NumPy n-dimensional Array (ndarray ) |
Unnamed list (heterogenous) | list |
Named list (heterogeneous) | Dictionary dict , but lacking order |
Environment (named, but unordered elements) | Dictionary dict |
Variable/column in a data.frame . |
Pandas Series (pd.Series ) |
2-dimensional data.frame |
Pandas DataFrame (pd.DataFrame ) |
Python Structure | Analogous R Structure(s) |
---|---|
scalar | 1-element long vector |
list (homogeneous) | Vector, but as if lacking vectorization |
list (heterogeneous) | Unnamed list |
tuple (immutable, homogeneous) | Vector, list as separated output from a function. |
tuple (immutable, heterogeneous) | Vector, list as separated output from a function. |
Dictionary dict , a key-value pair |
Named list or better environment . |
NumPy n-dimensional Array (ndarray ) |
Vector, matrix or array . |
Pandas Series (pd.Series ) |
Vector, variable/column in a data.frame . |
Pandas DataFrame (pd.DataFrame ) |
2-dimensional data.frame . |
# Vectors
cities_R <- c("Munich", "Paris", "Amsterdam")
dist_R <- c(584, 1054, 653)
# Lists
cities = ['Munich', 'Paris', 'Amsterdam']
dist = [584, 1054, 653]
Key-value pairs. Lists in R. Dictionaries in Python
# A list of data frames
cities_list <- list(Munich = data.frame(dist = 584,
pop = 1484226,
area = 310.43,
country = "DE"),
Paris = data.frame(dist = 1054,
pop = 2175601,
area = 105.4,
country = "FR"),
Amsterdam = data.frame(dist = 653,
pop = 1558755,
area = 219.32,
country = "NL"))
# As a list object
cities_list[1]
## $Munich
## dist pop area country
## 1 584 1484226 310.43 DE
cities_list["Munich"]
## $Munich
## dist pop area country
## 1 584 1484226 310.43 DE
# As a data.frame object
cities_list[[1]]
## dist pop area country
## 1 584 1484226 310.43 DE
cities_list$Munich
## dist pop area country
## 1 584 1484226 310.43 DE
# A list of heterogenous data
lm_list <- lm(weight ~ group, data = PlantGrowth)
# length(lm_list)
# names(lm_list)
# lists
city_l = ['Munich', 'Paris', 'Amsterdam']
dist_l = [584, 1054, 653]
pop_l = [1484226, 2175601, 1558755]
area_l = [310.43, 105.4, 219.32]
country_l = ['DE', 'FR', 'NL']
import numpy as np
# Numpy arrays
city_a = np.array(['Munich', 'Paris', 'Amsterdam'])
city_a
## array(['Munich', 'Paris', 'Amsterdam'], dtype='<U9')
pop_a = np.array([1484226, 2175601, 1558755])
pop_a
## array([1484226, 2175601, 1558755])
# Dictionaries
yy = {'city': ['Munich', 'Paris', 'Amsterdam'],
'dist': [584, 1054, 653],
'pop': [1484226, 2175601, 1558755],
'area': [310.43, 105.4, 219.32],
'country': ['DE', 'FR', 'NL']}
yy
## {'city': ['Munich', 'Paris', 'Amsterdam'], 'dist': [584, 1054, 653], 'pop': [1484226, 2175601, 1558755], 'area': [310.43, 105.4, 219.32], 'country': ['DE', 'FR', 'NL']}
# class data.frame from vectors
cities_df <- data.frame(city = c("Munich", "Paris", "Amsterdam"),
dist = c(584, 1054, 653),
pop = c(1484226, 2175601, 1558755),
area = c(310.43, 105.4, 219.32),
country = c("DE", "FR", "NL"))
cities_df
## city dist pop area country
## 1 Munich 584 1484226 310.43 DE
## 2 Paris 1054 2175601 105.40 FR
## 3 Amsterdam 653 1558755 219.32 NL
# class pandas.DataFrame
import pandas as pd
# From scratch
# From a dictionary, yy
yy_df = pd.DataFrame(yy)
yy_df
## city dist pop area country
## 0 Munich 584 1484226 310.43 DE
## 1 Paris 1054 2175601 105.40 FR
## 2 Amsterdam 653 1558755 219.32 NL
From lists
# From lists
# names
list_names = ['city', 'dist', 'pop', 'area', 'country']
# columns are a list of lists
list_cols = [city_l, dist_l, pop_l, area_l, country_l]
list_cols
# A ziped list of tuples
## [['Munich', 'Paris', 'Amsterdam'], [584, 1054, 653], [1484226, 2175601, 1558755], [310.43, 105.4, 219.32], ['DE', 'FR', 'NL']]
zip_list = list(zip(list_cols, list_names))
zip_list
# zip_dict = dict(zip_list)
# zip_df = pd.DataFrame(zip_dict)
# zip_df
# zip_df = pd.DataFrame(zip_list)
# zip_df
## [(['Munich', 'Paris', 'Amsterdam'], 'city'), ([584, 1054, 653], 'dist'), ([1484226, 2175601, 1558755], 'pop'), ([310.43, 105.4, 219.32], 'area'), (['DE', 'FR', 'NL'], 'country')]
Easier
# Import pandas library
import pandas as pd
# initialize list of lists
list_rows = [['Munich', 584, 1484226, 310.43, 'DE'],
['Paris', 1054, 2175601, 105.40, 'FR'],
['Amsterdam', 653, 1558755, 219.32, 'NL']]
# Create the pandas DataFrame
df = pd.DataFrame(list_rows, columns = list_names)
# print dataframe.
df
## city dist pop area country
## 0 Munich 584 1484226 310.43 DE
## 1 Paris 1054 2175601 105.40 FR
## 2 Amsterdam 653 1558755 219.32 NL
# array
arr_r <- array(c(1:4,
seq(10, 40, 10),
seq(100, 400, 100)),
dim = c(2,2,3) )
arr_r
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 10 30
## [2,] 20 40
##
## , , 3
##
## [,1] [,2]
## [1,] 100 300
## [2,] 200 400
rowSums(arr_r, dims = 2)
## [,1] [,2]
## [1,] 111 333
## [2,] 222 444
rowSums(arr_r, dims = 1)
## [1] 444 666
colSums(arr_r, dims = 1)
## [,1] [,2] [,3]
## [1,] 3 30 300
## [2,] 7 70 700
colSums(arr_r, dims = 2)
## [1] 10 100 1000
arr = np.array([[[ 1, 2],
[ 3, 4]],
[[ 10, 20],
[30, 40]],
[[100, 200],
[300, 400]]])
arr
## array([[[ 1, 2],
## [ 3, 4]],
##
## [[ 10, 20],
## [ 30, 40]],
##
## [[100, 200],
## [300, 400]]])
arr.sum(axis=0)
## array([[111, 222],
## [333, 444]])
arr.sum(axis=1)
## array([[ 4, 6],
## [ 40, 60],
## [400, 600]])
arr.sum(axis=2)
## array([[ 3, 7],
## [ 30, 70],
## [300, 700]])
n.b. move up to the main funcitons section and … include lambda and anonymous functions here. and… possibly map()
mathFun_R <- function(x,y) {
c(x + y, x - y)
}
# explicit return
mathFun_R <- function(x,y) {
a <- c(x + y, x - y)
return(a)
}
def mathFun_Py(x, y):
"""Add and subtract two numbers together"""
result = (x + y, x - y)
return result
Description | R Operator | Python Operator |
---|---|---|
Equivalency | == |
== |
Non-equivalency | != |
!= |
Greater-than (or equal to) | > (>=) |
> (>=) |
Lesser-than (or equal to) | < (<=) |
< (<=) |
Negation | !x |
not() |
xx <- 1:10
xx == 6
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
xx != 6
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
xx >= 6
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
xx < 6
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
# Rel Op in Python
a = np.array([23, 6, 7, 9, 12])
a > 10
## array([ True, False, False, False, True])
Description | R Operator | Python Operator |
---|---|---|
AND | & , && |
& , and |
OR | | , || |
| , or |
WITHIN | y %in% x |
in , not in |
identity | identical() |
is , is not |
xx <- 1:6
# tails of a distribution
xx < 3 | xx > 4
## [1] TRUE TRUE FALSE FALSE TRUE TRUE
# Range in a distribution
xx > 3 & xx < 4
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
# Log Op in Python
# x = range(6)
# x = [*x]
# x
# type(x)
import numpy as np
x = np.array(range(6))
# type(x)
# tails of a distribution
# x < 3 or x > 4
[i for i in x if i < 3 or i > 4]
# Range in a distribution
# x > 3 and x < 4
## [0, 1, 2, 5]
[i for i in x if i >= 3 and i <= 4]
## [3, 4]
n.b. add python a.any()
and a.all()
and R any()
and all()
.
x <- c("Caracas", "Bogotá", "Quito")
y <- c("Bern", "Berlin", "Brussels")
z <- c("Caracas", "Bogotá", "Quito")
# Are the objects identical?
identical(x, y)
## [1] FALSE
identical(x, z)
## [1] TRUE
# Is any TRUE
any(x == "Quito")
## [1] TRUE
# Are all TRUE
all(str_detect(y, "^B"))
## [1] TRUE
x = ['Caracas', 'Bogotá', 'Quito']
y = ['Bern', 'Berlin', 'Brussels']
z = ['Caracas', 'Bogotá', 'Quito']
x == y
## False
x == z
## True
# Is any True
import numpy as np
x = np.array(x)
np.any(x == "Caracas")
## True
# Are all True
np.all(x == "Caracas")
## False
Dimensions | Use | Description |
---|---|---|
1 | x[index] |
Isolate contents, keep container |
1 | x[-index] |
Isolate contents remove item, keep container |
1 | x[[index]] |
Extract one content, discard container |
2 | x[row_index, col_index] |
Isolate contents, keep container |
2 | x[col_index] |
Short-cut for columns |
2 | x[[index]] |
Extract one content, discard container |
n | x[row_index, col_index, dim_index] |
Isolate contents, keep container |
Where index
, row_index
, col_index
and dim_index
are vectors of type integer, character or logical.
Dimensions | Use | Description |
---|---|---|
1 | x[index] |
Isolate contents, keep container |
1 | x[-index] |
Isolate contents from reverse direction, keep container |
1 | x[index_1:index_2] |
Slice |
1 | x[index_1:index_2:stride] |
Slice with an interval |
1 | x[index_1:index_2:-1] |
Slice with reversal |
2 | x.loc[index_1:index_2] |
location |
2 | x.iloc[index_1:index_2:stride] |
index |
xx <- LETTERS[6:16]
xx[4]
## [1] "I"
xx[[4]]
## [1] "I"
cities_list[2]
## $Paris
## dist pop area country
## 1 1054 2175601 105.4 FR
cities = ['Toronto', 'Santiago', 'Berlin', 'Singapore', 'Kampala', 'New Delhi']
cities[0]
## 'Toronto'
cities[-1]
## 'New Delhi'
cities[1:2]
## ['Santiago']
cities[:2]
## ['Toronto', 'Santiago']
# class data.frame from vectors
cities_df <- data.frame(city = c("Munich", "Paris", "Amsterdam"),
dist = c(584, 1054, 653),
pop = c(1484226, 2175601, 1558755),
area = c(310.43, 105.4, 219.32),
country = c("DE", "FR", "NL"))
cities_df[2] # Data.frame
## dist
## 1 584
## 2 1054
## 3 653
cities_df[,2] # vector
## [1] 584 1054 653
cities_df[[2]] # vector
## [1] 584 1054 653
cities_df[2:3] # Data.frame
## dist pop
## 1 584 1484226
## 2 1054 2175601
## 3 653 1558755
cities_df[,2:3] # dataframe
## dist pop
## 1 584 1484226
## 2 1054 2175601
## 3 653 1558755
cities_tbl <- tibble(city = c("Munich", "Paris", "Amsterdam"),
dist = c(584, 1054, 653),
pop = c(1484226, 2175601, 1558755),
area = c(310.43, 105.4, 219.32),
country = c("DE", "FR", "NL"))
cities_tbl[2] # data frame
## # A tibble: 3 x 1
## dist
## <dbl>
## 1 584
## 2 1054
## 3 653
cities_tbl[,2] # data frame
## # A tibble: 3 x 1
## dist
## <dbl>
## 1 584
## 2 1054
## 3 653
cities_tbl[[2]] # vector
## [1] 584 1054 653
cities_tbl[2:3] # data frame
## # A tibble: 3 x 2
## dist pop
## <dbl> <dbl>
## 1 584 1484226
## 2 1054 2175601
## 3 653 1558755
cities_tbl[,2:3] # data frame
## # A tibble: 3 x 2
## dist pop
## <dbl> <dbl>
## 1 584 1484226
## 2 1054 2175601
## 3 653 1558755
df
## city dist pop area country
## 0 Munich 584 1484226 310.43 DE
## 1 Paris 1054 2175601 105.40 FR
## 2 Amsterdam 653 1558755 219.32 NL
df[1:]
## city dist pop area country
## 1 Paris 1054 2175601 105.40 FR
## 2 Amsterdam 653 1558755 219.32 NL
# position
df.iloc[0, 1]
## 584
df.iat[0, 1]
## 584
# label
df.loc[1:, 'city']
## 1 Paris
## 2 Amsterdam
## Name: city, dtype: object
data = {'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasilia'],
'Population': [11190846, 1303171035, 207847528]}
df_2 = pd.DataFrame(data,columns=['Country', 'Capital', 'Population'])
df_2
## Country Capital Population
## 0 Belgium Brussels 11190846
## 1 India New Delhi 1303171035
## 2 Brazil Brasilia 207847528
df[1:]
# df.iloc([0], [0])
## city dist pop area country
## 1 Paris 1054 2175601 105.40 FR
## 2 Amsterdam 653 1558755 219.32 NL
cities_array <- c(1:16)
dim(cities_array) <- c(4,2,2)
cities_array
## , , 1
##
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
##
## , , 2
##
## [,1] [,2]
## [1,] 9 13
## [2,] 10 14
## [3,] 11 15
## [4,] 12 16
cities_array[1,2,2]
## [1] 13
cities_array[1,2,]
## [1] 5 13
cities_array[,2,1]
## [1] 5 6 7 8
# Python n-dimensional indexing
arr
## array([[[ 1, 2],
## [ 3, 4]],
##
## [[ 10, 20],
## [ 30, 40]],
##
## [[100, 200],
## [300, 400]]])
arr[1,1,1]
## 40
arr[:,1,1]
## array([ 4, 40, 400])
arr[1,:,1]
## array([20, 40])
arr[1,1,:]
## array([30, 40])