Online data to your Python environment
Overview
Teaching: 25 min
Exercises: 20 minQuestions
How do I import an ERDDAP dataset into Python?
How do I interact with the dataset in Python
Objectives
Importing data from an ERDDAP server into your Python environment
Interact with data
ERDDAPY library
In the previous lesson, we downloaded our dataset file to our local machine. Now we will not download it to your local machine, but use in in your python environment directly.
Erddapy is a package that takes advantage of ERDDAP’s RESTful web services and creates the ERDDAP URL for any request, like searching for datasets, acquiring metadata, downloading the data, etc.You can create virtually any request like, searching for datasets, acquiring metadata, downloading data, etc.
Link to static Jupyter Notebook. Copy/Paste the code blocks into your own Jupyter Notebook
Import BCO-DMO temperature dataset - Oregon Coast
Part 1: Create the URL
From the dataset above, we are going to import the variables longitude, latitude, time and Temperature. The time constraints will be netween January 13th and January 16th.
Step 1: Initiate the ERDDAP URL constructor for a server ( erddapy server object).
#Import erddap package into
from erddapy import ERDDAP
e = ERDDAP(
server= "https://erddap.bco-dmo.org/erddap/",
protocol="tabledap",
response="csv",
)
Step 2: Populate the object with a dataset id, variables of interest, and its constraints. We can download the csvp response with the .to_pandas method.
e.dataset_id = "bcodmo_dataset_817952"
e.variables = [
"longitude",
"latitude",
"time",
"Temperature"
]
e.constraints = {
"time>=": "2017-01-13T00:00:00Z",
"time<=": "2017-01-16T23:59:59Z",}
Check the URL
# Print the URL - check
url = e.get_download_url()
print(url)
Part 2: Import your dataset into pandas
We can import the csv response using the erddapy the .to_pandas method.
# Convert URL to pandas dataframe
df_bcodmo = e.to_pandas(
parse_dates=True,
).dropna()
Check out your dataset in pandas
# print the dataframe to check what data is in there specifically.
df_bcodmo.head()
# print the column names
print (df_bcodmo.columns)
There is a weird name in the title, rename the column to correct this
df_bcodmo.rename(columns={df_bcodmo.columns.values[3]: 'Temperature (degrees Celsius)'}, inplace=True)
print (df_bcodmo.columns)
Subset the tabular data further in pandas based on the time Step 1: convert the time to a datetime object to take out the time
import pandas as pd
# convert to datetime object to be able to work with it in pandas
print (df_bcodmo.dtypes)
df_bcodmo["time (UTC)"] = pd.to_datetime (df_bcodmo["time (UTC)"], format = "%Y-%m-%dT%H:%M:%S")
print (df_bcodmo.dtypes)
Only select the rows for January 13th
df_bcodmo_13 = df_bcodmo[df_bcodmo["time (UTC)"].dt.day == 13]
df_bcodmo_13
When you inspect the dataset, you can see that some hours have multiple data points, while others have only 1 data point. Let’s average the dataset over every hour using the groupby function
df_bcodmo_13_average = df_bcodmo_13.groupby(df_bcodmo["time (UTC)"].dt.hour)[['Temperature (degrees Celsius)','longitude (degrees_east)','latitude (degrees_north)']].mean().reset_index()
df_bcodmo_13_average
Plot your averaged dataset in pandas
df_bcodmo_13_average.plot (
x='longitude (degrees_east)',
y='latitude (degrees_north)',
kind = 'scatter',
c='Temperature (degrees Celsius)',
colormap="YlOrRd")
Exercise:
Create the URL for this dataset with the variable POC instead of temperature
Answer
#Import erddap package into from erddapy import ERDDAP e = ERDDAP( server= "https://erddap.bco-dmo.org/erddap/", protocol="tabledap", response="csv", ) e.dataset_id = "bcodmo_dataset_817952" e.variables = [ "longitude", "latitude", "time", "POC" ] e.constraints = { "time>=": "2017-01-13T00:00:00Z", "time<=": "2017-01-16T23:59:59Z",} #Print the URL - check url = e.get_download_url() print(url)
Searching datasets using erddapy
Step 1: Initiate the ERDDAP URL constructor for a server ( erddapy server object).
#searching datasets based on words
from erddapy import ERDDAP
e = ERDDAP(
server="https://erddap.bco-dmo.org/erddap",
protocol="tabledap",
response="csv")
Search with keywords:
import pandas as pd
url = e.get_search_url(search_for="Temperature OC1603B", response="csv")
print (url)
pd.read_csv(url)["Dataset ID"]
Inspect the metadata of dataset with id bcodmo_dataset_817952:
#find the variables
info_url = e.get_info_url(dataset_id="bcodmo_dataset_817952")
pd.read_csv(info_url)
pd.set_option('display.max_rows', None) #make sure that jupyter notebook shows all rows
dataframe = pd.read_csv(info_url)
print (dataframe)
#get the unique variable names with pandas
dataframe["Variable Name"].unique()
Exercise: Inspect this BCO-DMO dataset
- What are the units of POC?
- Who is the Principal Investigator on this dataset?
- What is the start and end time of this dataset?
Exercise
What are the unique variables for “bcodmo_dataset_807119?”
Answer
# find the variables info_url = e.get_info_url(dataset_id="bcodmo_dataset_807119") pd.read_csv(info_url) pd.set_option('display.max_rows', None) #make sure that jupyter notebook shows all rows dataframe = pd.read_csv(info_url) dataframe # get the unique variable names with pandas dataframe["Variable Name"].unique()
RERRDAP: package for R users to work directly with erddap servers
Information an using erddap: https://docs.ropensci.org/rerddap/articles/Using_rerddap.html
Example from the following page: OOI Glider Data (accessed October 11, 2021):
The mission of the IOOS Glider DAC is to provide glider operators with a simple process for submitting glider data sets to a centralized location, enabling the data to be visualized, analyzed, widely distributed via existing web services and the Global Telecommunications System (GTS) and archived at the National Centers for Environmental Information (NCEI). The IOOS Glider Dac is accessible through rerddap
(http://data.ioos.us/gliders/erddap/). Extracting and plotting salinity from part of the path of one glider deployed by the Scripps Institution of Oceanography:
urlBase <- "https://data.ioos.us/gliders/erddap/"
gliderInfo <- info("sp064-20161214T1913", url = urlBase)
glider <- tabledap(gliderInfo, fields = c("longitude", "latitude", "depth", "salinity"), 'time>=2016-12-14', 'time<=2016-12-23', url = urlBase)
glider$longitude <- as.numeric(glider$longitude)
glider$latitude <- as.numeric(glider$latitude)
glider$depth <- as.numeric(glider$depth)
require("plot3D")
scatter3D(x = glider$longitude , y = glider$latitude , z = -glider$depth, colvar = glider$salinity, col = colors$salinity, phi = 40, theta = 25, bty = "g", type = "p",
ticktype = "detailed", pch = 10, clim = c(33.2,34.31), clab = 'Salinity',
xlab = "longitude", ylab = "latitude", zlab = "depth",
cex = c(0.5, 1, 1.5))
Key Points
There are keypackages necessary to import data from ERDDAP into Python: pandas
Data can be downloaded locally or be interacted with directly using erddapy
You can asses your data package in Python