Unleashing your data potential: Wrangling data tables using Python and the pandas package

Woods Hole Oceanographic Institution

May 22, 2020

08:45 - 12:30 EST

Instructor: Karen Soenen

Helpers: Amber York, Brett Longworth, Stace Beaulieu

General Information

Workshops always have the cleanest, best examples of data tables to use, don’t they? These tables always seem to be immediately usable for analysis. Getting a raw table usable for analysis is a process called “data wrangling”. In this workshop we’ll show you how to get to this perfect table using Python and the package Pandas.

Doing this process correctly will not only make you more efficient, but it will also make your data easier to reuse in the future.

The workshop is sponsored by a WHOI Academic Programs Doherty Award and a DDVPR Technical Staff Training Award

Who: This workshop is targeted towards improving project efficiency and building technical skills. The workshop will only be held for 10 people at a time through an online Zoom meeting. Registration is required. Please contact stace@whoi.edu for availability.

Requirements:

Accessibility: We are dedicated to providing a positive and accessible learning environment for all. Please notify the instructors in advance of the workshop if you require any accommodations or if there is anything we can do to make this workshop more accessible to you.

Contact: Please email ksoenen@whoi.edu for more information.


Am I a data wrangler?

The process called “data wrangling”, i.e., manipulating data into a usable form and diagnosing data quality issues often constitutes the most tedious and time-consuming aspect of analysis.

Regular expression xkcd comic
Kandel, S. et al (2011). Research directions in data wrangling: Visualisations and transformations for usable and credible data. Inf. Vis., 10(4),271-288

This workshop is for you if you:


Code of Conduct

We will be using the Carpentries code of conduct for this workshop.

Everyone who participates in this workshop is required to conform to the Code of Conduct.


Surveys

Please be sure to complete these surveys before and after the workshop.

Pre-workshop Survey

Post-workshop Survey


Schedule & Syllabus

This workshop is based on a few workshops developed by the Carpentries (See https://carpentries.org for more information about the Carpentries organisation.) and by Joe Futrelle (WHOI):


Part 1. Preparing your table - Best practices

TIME SUBJECT TOPICS COVERED NOTEBOOK/EXERCISE MORE RESOURCES
08:45 Introduction
09:00 Formatting data tables in Spreadsheets How do we format data in spreadsheets for effective data use? Carpentries: data table
09:10 Excercise How can this table be improved to start analysis in python? Excercise in breakout rooms Datatable for exercise Carpentries exercise and discussion
09:35 Date Notation Good approaches for handling dates in spreadsheets Carpentries: Dates
09:45 Break 15 minute break


Part 2. Python and the Pandas library

TIME SUBJECT TOPICS COVERED NOTEBOOK/EXERCISE MORE RESOURCES
10:00 Starting with Python What is Python?
Data types
Mathematical operations
Lists
Notebook: Starting with Python Carpentries: Intro to Python I
Carpentries: Intro to Python II
First commands, Notebook Joe Futrelle, WHOI
10:20 The Pandas Library What is Pandas?
How do I import data
What is a dataframe?
How can I access specific data within my data set?
Notebook: The Pandas Library Carpentries: Starting with data
Carpentries: Indexing, Slicing and Subsetting DataFrames in Python
Anatomy of a DF, Notebook Joe Futrelle, WHOI
10:35 Excercise
10:45 Break 15 minute break


Part 3. Further manipulation of a data frame

TIME SUBJECT TOPICS COVERED NOTEBOOK/EXERCISE MORE RESOURCES
11:00 Further manipulation of a data frame Sorting
Unique values
Logical conditions
Summary statistics
Groups
Merging dataframes
Notebook: Further manipulation of a dataframe Carpentries: Statistics, groups and basic math
Carpentries: Merging data
Querying and merging DFs, Notebook Joe Futrelle, WHOI
11:30 Excercise
11:45 Break 15 minute break
12:00 Questions?
12:15 Wrap-up


Part 4. Extras


Setup

To participate in this workshop, you will need an up-to-date web browser and access access to a spreadsheet program (Excel, LibreOffice,...), Python and Jupyter notebooks. In addition you will need an up-to-date web browser.

You only need to install these programs:

Detailed set-up instructions for your software can be found here (Instructions from Data Carpentry Ecology workshops-with Python). But only install a spreadsheet program and python and Jupyter notebooks (through Anaconda).

Please make sure you have installed all the required packages before the start of this workshop. We will be holding an on-line data lab with Stace Beaulieu on May 20 and can help you install the packages if necessary.

We maintain a list of common issues that occur during installation as a reference for instructors that may be useful on the Configuration Problems and Solutions wiki page.