Course in data science. Learn to analyze data of all types using the Python programming language. No programming experience is necessary.
Quick links: 📁 lessons ⏬ Lesson Schedule
Software covered:
- IPython environment and Jupyter notebooks
- Conda for package management and virtual environments
- Python 3
Course topics include:
- Introduction to/review of the command line
- Fundamentals of Python and its data types
- Data analysis packages Numpy and Pandas
- Plotting packages Matplotlib and Seaborn
- Statistics
- Regular expressions
- Interactive visualization
- Modules and classes
- Git and GitHub
- Luke Thompson, Ph.D.
- Lecturer at Scripps Institution of Oceanography, Research Associate at NOAA
- Email: [email protected], [email protected]
- Pages: GitHub, Google Scholar, NOAA profile + CV
- GitHub repository: https://github.com/cuttlefishh/python-for-data-analysis
- YouTube channel: https://www.youtube.com/channel/UCVZrIrWtcvTzYlrNx7RcDyg
- Learn Python 3 the Hard Way by Zed Shaw (Addison-Wesley) -- Step-by-step introduction to Python with no prior knowledge assumed; includes appendix Command Line Crash Course.
- Learning Python 3rd Edition by Mark Lutz (O'Reilly) -- Optional; more traditional introduction to Python as a computer language.
- Python for Data Analysis 2nd Edition by Wes McKinney (O'Reilly) -- Manual focused on Pandas, the popular Python package for data analysis, by its creator. GitHub page: https://github.com/wesm/pydata-book.
Note: O'Reilly Media titles are free to UCSD affiliates with Safari Books Online.
- Git for Windows -- BASH emulator and Git software for Windows
- Learning the Shell -- Great intro to the Unix shell
- Unix Tutorial by Julian Catchen -- From Evomics 2015 workshop in Czech Republic
- Learn the Command Line -- Code Academy
- Learn Unix The Hard Way by Zed Shaw -- More detail than you will get from Appendix A of Shaw's LPTHW
- IPython Interactive Computing and Visualization Cookbook
- Learning IPython for Interactive Computing and Data Visualization -- GitHub repo
- 10-minute Tour of Pandas by Wes McKinney -- Basic video tour
- R vs. Python for Data Analysis -- Fun cartoon to abate or fuel your biases
- Python Scripting for Computational Science by H. P. Langtangen -- Deeper and more mathematical treatment
- An Introduction To Applied Bioinformatics by Greg Caporaso
- A Dramatic Tour through Python's Data Visualization Landscape
- Just like anything else, you learn Python by doing. With a few exceptions, you're not going to break your computer by trying new commands. So just try it and see what happens. Print output of commands. Print values of variables. Kick the thing until it works.
- When you don't know how to do something, google it. You'll be amazed by the solutions you'll find to do thing x if you google "python thing x".
- Learn keyboard shortcuts, as many as you can. Tab-complete in the shell and IPython/Jupyter!
- Remember Zed's sage wisdom:
- Practice every day.
- Don't over-do it. Slow and steady wins the race.
- It's alright to be totally lost at first.
- When you get stuck, get more information.
- Try to solve it yourself first.
Weekly take-home assignments will follow the course schedule, reinforcing skills with exercises to analyze and visualize scientific data. Assignments will given out on Thursdays and will be due the following Thursday, using TritonEd.
You will choose a data set of your own or provided in one of the texts and write a Python program (or set of Python programs or mixture of .ipynb and .py/.sh scripts) to carry out a revealing data analysis. Have a look at Shaw Ex43-52 and McKinney Ch10-12 for more ideas.
Requirements:
- Submit your project as either: a Jupyter notebook (or collection of notebooks), a Python script (or collection of scripts), or a combination of the two.
- Use
pandas
and at least three (3) additional libraries/packages, such as:- Plotting:
matplotlib
,seaborn
- Statistics and modeling:
statsmodels
,scikit-learn
- Bioinformatics:
scikit-bio
,biopython
- Climate science:
cdms
,iris
- Other domain-specific libraries/packages
- Plotting:
- Use at least three (3) user-defined functions.
- Optional: Create user-defined modules and classes for use in your code.
- Optional: Share your code on GitHub.
Note: There are no midterm or final exams.
Schedule is subject to change.
The course consists of 20 lessons. It was originally taught as 2 lessons per week for 10 weeks, but the material can be covered at any pace.
Lessons 1-3 will be an introduction to the command line. By the end of this tutorial, everyone will be familiar with basic Unix commands.
Lessons 4-9 will be an introduction to programming using Python. The main text will be Shaw's Learn Python 3 the Hard Way. For those with experience in a programming language other than Python, Lutz's Learning Python will provide a more thorough introduction to programming Python. We will learn to use IPython and IPython Notebooks (also called Jupyter), a much richer Python experience than the Unix command line or Python interpreter.
Lessons 10-18 will focus on Python packages for data analysis. We will work through McKinney's Python for Data Analysis, which is all about analyzing data, doing statistics, and making pretty plots. You may find that Python can emulate or exceed much of the functionality of R and MATLAB.
Lessons 19-20 conclude the course with two skills useful in developing code: writing your own classes and modules, and sharing your code on GitHub.
- Course material is available as .md or .ipynb files by clicking on the lesson number below.
- In addition to doing the readings, please follow along writing code (this is integral to the Shaw readings), and do any Study Drills (Shaw) and Chapter Quizzes (Lutz).
Lesson | Title | Readings | Topics | Assignment |
---|---|---|---|---|
1 | Overview | -- | Introductions and overview of course | Pre-course survey; Acquire texts |
2 | Command Line Part I | Shaw: Introduction, Ex0, Appendix A |
Command line crash course; Text editors | Assignment 1: Basic Shell Commands |
3 | Command Line Part II | Yale: The 10 Most Important Linux Commands | Advanced commands in the bash shell | -- |
4 | Conda, IPython, and Jupyter Notebooks | Geohackweek: Introduction to Conda | Conda tutorial including Conda environments, Python packages, and PIP, Python and IPython in the command line, Jupyter notebook tutorial and Python crash course | Assignment 2: Bash, Conda, IPython, and Jupyter |
5 | Python Basics, Strings, Printing | Shaw: Ex1-10; Lutz: Ch1-7 | Python scripts, error messages, printing strings and variables, strings and string operations, numbers and mathematical expressions, getting help with commands and Ipython | -- |
6 | Taking Input, Reading and Writing Files, Functions | Shaw: Ex11-26; Lutz: Ch9,14-17 | Taking input, reading files, writing files, functions | Assignment 3: Python Fundamentals I |
7 | Logic, Loops, Lists, Dictionaries, and Tuples | Shaw: Ex27-39; Lutz: Ch8-13 | Logic and loops, lists and list comprehension, tuples, dictionaries, other types | -- |
8 | Python and IPython Review | McKinney: Ch1, Ch2, Ch3 | Review of Python commands, IPython review -- enhanced interactive Python shells with support for data visualization, distributed and parallel computation and a browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media | Assignment 4: Python Fundamentals II |
9 | Regular Expressions | Kuchling: Regular Expression HOWTO | Regular expression syntax, Command-line tools: grep , sed , awk , perl -e , Python examples: built-in and re module |
-- |
10 | Numpy, Pandas and Matplotlib Crashcourse | Pratik: Introduction to Numpy and Pandas | Numpy, Pandas, and Matplotlib overview | Assignment 5: Regular Expressions |
11 | Pandas Part I | McKinney: Ch4, Ch5 | Introduction to NumPy and Pandas: ndarray , Series , DataFrame , index , columns , dtypes , info , describe , read_csv , head , tail , loc , iloc , ix , to_datetime |
-- |
12 | Pandas Part II | McKinney: Ch6, Ch7, Ch8 | Data Analysis with Pandas: concat , append , merge , join , set_option , stack , unstack , transpose , dot-notation, values , apply , lambda , sort_index , sort_values , to_csv , read_csv , isnull |
Assignment 6: Pandas Fundamentals |
13 | Plotting with Matplotlib | McKinney: Ch9; Johansson: Matplotlib 2D and 3D plotting in Python | Matplotlib tutorial from J.R. Johansson | -- |
14 | Plotting with Seaborn | Seaborn Tutorial | Seaborn tutorial from Michael Waskom | Assignment 7: Plotting |
15 | Pandas Time Series | McKinney: Ch11 | Time series data in Pandas | -- |
16 | Pandas Group Operations | McKinney: Ch10 | groupby , melt , pivot , inplace=True , reindex |
Assignment 8: Time Series and Group Operations |
17 | Statistics Packages | Handbook of Biological Statistics | Statitics capabilities of Pandas, Numpy, Scipy, and Scikit-bio | -- |
18 | Interactive Visualization with Bokeh | Bokeh User Guide | Quickstart guide to making interactive HTML and notebook plots with Bokeh | Assignment 9: Statistics and Interactive Visualization |
19 | Modules and Classes | Shaw: Ex40-52 | Packaging your code so you and others can use it again | -- |
20 | Git and GitHub | GitHub Guides | Sharing your code in a public GitHub repository | Final Project |