jupytext | kernelspec | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
If you're coming from R, then you're probably already familiar with coding and with integrated developer environments. Overall, Python and R are more similar than they are different, especially when it comes to economics and data science.
There are, however, some fundamental differences between the two languages. The biggest difference between Python and R is that Python is a general purpose programming language, while R began as a statistical language. But you won't really notice this unless you're writing something that looks more like production-grade software. (And, despite being a general language, Python has really fantastic support for statistics.) A second difference is that Python is more object-oriented while R is more functional. These are two different programming paradigms. Actually, both languages do a bit of both and, again, you're unlikely to notice any difference most of the time.
Actually, the biggest practical difference I have found in an economics and data science context is that Python has more of a flavour of the bazaar--there are lots of people, and you can find everything under the Sun, but it can be a little bit chaotic--while R has the feel of a curated garden--there is a chief gardener (RStudio) tending a smaller number of more beautiful things, but the garden has boundaries.
There's a list of differences at the end of this chapter, but a couple of important gotchas to be aware of up front: first, R uses vectors, arrays, etc., that are indexed from 1. Like C++, Python is numbered from zero, with, eg, a[0]
as the first element. Second, <-
is the preferred assignment operator in R but in Python it's =
(and <-
isn't used). In fact, in R a<-5
assigns a
the value 5
, while a<-5
in Python would return True
or False
based on whether a
was less than -5
or not!
If you are coming from R, you're likely familiar with dplyr for data analysis and ggplot2 for plotting. There are Python equivalents that have very similar syntax to these that you can use to help you to become productive quickly in Python--though these libraries are not so popular in Python. Here are the Python equivalents to those R libraries:
-
dplyr: This book recommends learning at least some pandas for data analysis in Python because it is a package that is comprehensive and ubiquitous, and it's the most popular Python library that performs similar functions to dplyr. pandas also has unrivaled documentation. However, if you want something closer to dplyr in terms of philosophy and syntax, there are a host of options available to you:
- a Python package that is very similar to dplyr is plydata. It is created to be consistent with visualisation package plotnine (featured below).
- another library that gets close to dplylr is datar. Because it uses a consistent framework for data piping under the hood called pipda, datar seems highly extensible too. It also integrates with plotnine for visualisation.
- tidypolars combines the syntax of dplyr with the best-in-class speed of Python package polars
- siuba is a port of dplyr, tidyr, and other R libraries for data analysis.
- dfplyr is another attempt to replicate the syntax of dplyr, but it doesn't seem to be maintained.
- pyjanitor builds a range of extra features on top of pandas. Two of the things it adds are likely to make anyone coming from dplyr feel a bit more at home: better support for method chaining and some functions with the same names as the ones in dplyr and which do the same things.
-
ggplot2: the most similar Python version of this library is plotnine, while the most popular Python library that performs similar functions is a combination of matplolib and seaborn, which builds on matplotlib. I think either those two together or plotnine are good choices, though plotnine's documentation is not (yet) as good and it's certainly not as widely used. (It's gaining popularity though.)
-
data.table: if you use this library instead of dplyr, have no fear as there's an almost identical library in Python called datatable. It's not nearly as popular in Python as data.table is in R, but it's a very high quality library.
-
here: Lots of people switching from R to Python ask what the equivalent of the
here()
function is. The "best practice" answer is that you shouldn't need one! It's good practice to have your Visual Studio Code (or other IDE) console and interactive Python window automatically start within the directory of your project; that is, you should always be "here" automatically. In Visual Studio Code, you can ensure that the interactive window starts in the root directory of your project by setting "Jupyter: Notebook File Root" to "${workspaceFolder}
" in the Settings menu. For the integrated command line, change "Terminal › Integrated: Cwd" to "${workspaceFolder}
" too. If you still need a replacement forhere
, then the pyprojroot package has you covered.
You can run a full R session from Python (if you already have R installed). Here's an example:
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
base = importr('base')
fit_full = ro.r("lm('mpg ~ wt + cyl', data=mtcars)")
print(base.summary(fit_full))
To install R packages, use this:
from rpy2.robjects.packages import importr
utils = importr('utils')
utils.install_packages('packagename')
Here are some tables of translations between base R and Python code. For more, see hyperpolyglot.
R | Python |
---|---|
new_function <- function(a, b=5) { |
def new_function(a, b=5): |
for (val in c(1,3,5)){ |
for val in [1,3,5]: |
a <- c(1,3,5,7) |
a = [1,3,5,7] |
a <- c(3:9) |
a = list(range(3,9)) |
class(a) |
type(a) |
a <- 5 |
a = 5 |
a^2 |
a**2 |
a%%5 |
a%5 |
a & b |
a and b |
`a | b` |
rev(a) |
a[::-1] |
a %*% b |
a @ b |
paste("one", "two", "three", sep="") |
'one' + 'two' + 'three' |
substr("hello", 1, 4) |
'hello'[:4] |
strsplit('foo,bar,baz', ',') |
'foo,bar,baz'.split(',') |
paste(c('foo', 'bar', 'baz'), collapse=',') |
','.join(['foo', 'bar', 'baz']) |
`gsub("(^[\n\t ]+ | [\n\t ]+$)", "", " foo ")` |
sprintf("%10s", "lorem") |
'lorem'.rjust(10) |
paste("value: ", toString("8")) |
'value: ' + str(8) |
toupper("foo") |
'foo'.upper() |
nchar("hello") |
len('hello') |
substr("hello", 1, 1) |
'hello'[0] |
a = rbind(c(1, 2, 3), c('a', 'b', 'c')) |
a = zip([1, 2, 3], ['a', 'b', 'c']) |
d = list(n=10, avg=3.7, sd=0.4) |
d = {'n': 10, 'avg': 3.7, 'sd': 0.4} |
quit() |
exit() |
Assuming the use of pandas in Python, and the dplyr and tidyr packages in R.
R | Python |
---|---|
head(df) |
df.head() |
tail(df) |
df.tail() |
nrow(df) |
df.shape[0] or len(df) |
ncol(df) |
df.shape[1] or len(df.columns) |
df$col_name |
df['col_name'] or df.col_name |
None | df.info() |
summary(df) |
df.describe() (not exactly the same) |
df %>% arrange(c1, desc(c2)) |
df.sort_values(by=['c1','c2'], ascending=[True, False]) |
df %>% rename(new_col = old_col) |
df.rename(columns={'old_col': 'new_col'}) |
df$smoker <- mapvalues(df$smoker, |
df['smoker'] = df['smoker'].map({'yes':0, 'no':1}) |
df$c1 <- as.character(df$c1) |
df['c1'] = df['c1'].astype(str) |
unique(df$c1) |
df['c1'].unique() |
length(unique(df$c1)) |
len(df['c1'].unique()) |
max(df$c1, na.rm = TRUE) |
df['c1'].max() |
df$c1[is.na(df$c1)] <- 0 |
df['c1'] = df['c1'].fillna(0) |
col_a <- c('a','b','c') |
df = pd.DataFrame(dict(col_a=['a', 'b', 'c'], col_b=[1, 2, 3])) |
df <- read.csv("input.csv", |
df = pd.read_csv("input.csv") |
write.csv(df, "output.csv", row.names = FALSE) |
df.to_csv("output.csv", index = False) |
df[c(4:6)] |
df.iloc[:, 3:6] |
mutate(df, c=a-b) |
df.assign(c=df['a']-df['b']) |
distinct(select(df, col1)) |
df[['col1']].drop_duplicates() |
R | Python |
---|---|
character | string, aka str |
integer | integer, aka int |
logical | boolean, aka bool |
numeric | float or double |
complex | complex |
Single-element vector | Scalar |
Multi-element vector | List |
List of multiple types | Tuple |
Named list | Dict |
Matrix/Array | numpy ndarray |
NULL , TRUE , FALSE |
None , True , False |
Inf | inf |
NaN | nan |
R | Python |
---|---|
<- works as an assignment operator |
= is the assignment operator |
Dots are valid in variable names, eg var.iable |
Dots precede methods, eg ' strip whitespace '.strip() |
use of $ , eg df$col_name |
Equivalent is usually . , eg df.col_name |
Does not have compound assignments | += , -= , *= , etc. are compound assignment operators |
FALSE , F , 0 , and 0.0 are false |
False , None , 0 , 0.0 , '' , [] , and {} are false |
Tends to fail silently, eg a = c() , a[10] evaluates as NA |
Python tends to fail loudly, eg a=[] , a[10] throws an error |
No built-in decorator operator, but see decorator | Function decorator, @ |
Walrus operator, := , used in quasiquations |
Walrus operator, =: , combines an expression with an assignment (Python 3.8+) |
Pipe operator, %>% |
No built-in pipe operator. Method chaining and .pipe used as partial replacements for dataframes, and there are pipe extensions like pipetools and sspipe, but they're not widely used. |