Skip to content

Latest commit

 

History

History
182 lines (127 loc) · 3.79 KB

Pandas.md

File metadata and controls

182 lines (127 loc) · 3.79 KB

Pandas is a great data visualisation library that we can use to do data analysis with our code base

Basics

Dataframes

We can read in some data with

titles = pd.read_csv('data/titles.csv')
titles.head()
  • .head() and .tail() help us to be able to take snapshots from the dataframe itself, we can pass in a number (Eg. .head(4) to take a custom number of elements from the data frame)

We can do filtering on the data that we want by doing

c = cast
c = c[c.title == "Inception"]
c = c[c.n.isnull()]
c

We can then sort by a field by using the field of

c.sort_values('n') # Sort by a Field

We can also use special string methods by using the following syntax

This is for strings

titles[titles['title'].str.contains("Treasure
Island")==True].sort_values("year")

This is for dates

titles[titles['date'].dt.... ]

Series

We can read out a specific column by using

c.year # This gets a series

We can also get out the specific counts for this specific field by using value_counts(). Note that this gives a sorted series, so we'd get the largest -> smallest values by default

c.year.value_counts()

We can then sort by the value indexes by doing

c.year.value_counts().sort_index()
c.year.value_counts().sort_index().plot() # Generates a plot

Plots

We can do plots with the entire data frame or a single series

c = cast
c = c[c.character=="Kermit the Frog"]
c.plot(x='year',y='n',kind='scatter')

We can also generate plots from value_counts by doing

(t.year // 10 * 10).value_counts().sort_index().plot(kind='bar')

Modifying Dataset

We can then take subsets of the entire plot with

c = cast
c = c[["year","n"]]
c

We can then set the index of our dataframe by using

c = cast
c = c.set_index(['title']).sort_index() # A sorted index is much much faster
c = c.set_index(['title','year']).sort_index()

c.loc['Sleuth'] # We can then lookup indexes using the `.loc`

We can remove the indexes by doing

c.reset_index(['year','index'])

Aggregation

We can run aggregations using the .groupBy query

c = cast
c = c[c.name == "George Clooney"]
c.groupby(['year']).size().plot(ylim=0)

We can even run a groupBy on a transformation of the column

c = cast
c = c[c.name == "George Clooney"]
c.groupby(c.year // 10 * 10 ).size().plot(ylim=0)

We can also use a unstack operation. This transposes our rows and columns and is useful when we're using a unstack.

c  = cast
c = c[(c.character == 'Kermit the Frog') | (c.character == 'Oscar the Grouch')]
c = c.groupby(['character',c.year // 10 * 10]).size()
c = c.unstack(0).fillna(0) # note here that fillna helps usto fill in the values

Some common operations that we can work with groupby

# Getting the total size of each subset
c.groupby(['year','type']).size() 

# Get the max of each subset that we've grouped it by
c = c.groupby(['year'])[['n']].max()'

# Filter by the count value of each
c = cast
c = c[c.name == 'Frank Oz']
g = c.groupby(['year','title']).size()
g[g > 1]

Examples of when we might want to work with unstack

# This allows us to do a group by on year-type and then take the diff
c = cast[cast.year < 2115]
c = c.groupby(['year','type']).size()
c = c.unstack('type').fillna(0)
(c.actor/(c.actor + c.actress)).plot(ylim=[0,1])

# Aggregate by multiple examples
# Eg. compare the fraction of actor<>acctress for each year w.r.t n
c = cast
c = c[c.n <=3]
c = c.groupby(['year','type','n']).size()
c = c.unstack('type').fillna(0)
(c.actor / (c.actor+c.actress)).unstack('n').plot(ylim=[0,1])

Merging

We can also merge two separate datasets or a table on itself

c = cast
c =  cast[c.name == 'Ellen Page']
c.merge(release_dates)