我目前在Google Collab中使用Python进行编码。我正在使用通过NOAA的ERDDAP网站的网址上传的水下滑翔机数据。
url = 'https://gliders.ioos.us/erddap/tabledap/ru28-20150917T1300.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2015-09-18T00%3A00%3A00Z&time%3C=2015-10-06T00%3A00%3A00Z'
url2 = 'https://gliders.ioos.us/erddap/tabledap/ru28-20140815T1405.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2014-08-16T00%3A00%3A00Z&time%3C=2014-09-04T00%3A00%3A00Z'
url3 = 'https://gliders.ioos.us/erddap/tabledap/ru28-20130813T1436.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2013-08-14T00%3A00%3A00Z&time%3C=2013-08-26T00%3A00%3A00Z'
url4 = 'https://gliders.ioos.us/erddap/tabledap/blue-20200819T1433.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2020-08-19T00%3A00%3A00Z&time%3C=2020-08-25T00%3A00%3A00Z'
url5 = 'https://gliders.ioos.us/erddap/tabledap/blue-20190815T1711.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2019-08-16T00%3A00%3A00Z&time%3C=2019-09-24T00%3A00%3A00Z'
url6 = 'https://gliders.ioos.us/erddap/tabledap/blue-20180806T1400.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2018-08-07T00%3A00%3A00Z&time%3C=2018-10-31T00%3A00%3A00Z'
url7 = 'https://gliders.ioos.us/erddap/tabledap/blue-20170831T1436.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2017-09-01T00%3A00%3A00Z&time%3C=2017-09-24T00%3A00%3A00Z'
然后,我加载了数据集:
data1 = pd.read_csv(url, skiprows=[1], parse_dates=['time'], index_col='time')
data2 = pd.read_csv(url2, skiprows=[1], parse_dates=['time'], index_col='time')
data3 = pd.read_csv(url3, skiprows=[1], parse_dates=['time'], index_col='time')
data4 = pd.read_csv(url4, skiprows=[1], parse_dates=['time'], index_col='time')
data5 = pd.read_csv(url5, skiprows=[1], parse_dates=['time'], index_col='time')
data6 = pd.read_csv(url6, skiprows=[1], parse_dates=['time'], index_col='time')
data7 = pd.read_csv(url7, skiprows=[1], parse_dates=['time'], index_col='time')
并将它们合并为一个数据帧:
combined_df = pd.concat([data1, data2, data3, data4, data5, data6, data7], axis = 0)
运行该行combined_df.head()
可对数据进行如下预览:
profile_id latitude longitude depth temperature salinity density
time
2015-09-18 00:02:41+00:00 81 40.350986 -73.871552 20.09 14.0286 32.678837 1024.4777
2015-09-18 00:02:41+00:00 81 40.350986 -73.871552 20.73 13.8871 32.658794 1024.4943
2015-09-18 00:02:41+00:00 81 40.350986 -73.871552 21.05 13.8069 32.680794 1024.5292
2015-09-18 00:04:36+00:00 82 40.350817 -73.871420 21.05 13.8069 32.680794 1024.5292
2015-09-18 00:16:07+00:00 83 40.349812 -73.870636 20.76 13.9284 32.670765 1024.4951
我需要制作一个具有7个单独的箱形图的图表,每个图具有来自每个数据集的值。我专注于温度,盐度和密度。x轴将是时间。任何帮助将不胜感激。
由于似乎每个文件都包含一年的数据,因此我们可以简化此方法,并且seaborn
在这里有很大帮助。为了使代码更具可读性(请阅读:因为我们太懒了,无法键入重复的内容),我们将这些任务放入循环并将必需的变量存储在列表中。
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
url = 'https://gliders.ioos.us/erddap/tabledap/ru28-20150917T1300.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2015-09-18T00%3A00%3A00Z&time%3C=2015-10-06T00%3A00%3A00Z'
url2 = 'https://gliders.ioos.us/erddap/tabledap/ru28-20140815T1405.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2014-08-16T00%3A00%3A00Z&time%3C=2014-09-04T00%3A00%3A00Z'
url3 = 'https://gliders.ioos.us/erddap/tabledap/ru28-20130813T1436.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2013-08-14T00%3A00%3A00Z&time%3C=2013-08-26T00%3A00%3A00Z'
urls = [url, url2, url3] #<---add the remaining urls, this example is just for three of them
#because the download takes a while, we can simulate this with already downloaded files
#urls=["ru28-20140815T1405_0c34_1256_e732.csv", "ru28-20150917T1300_cc34_de4b_4c02.csv", "ru28-20130813T1436_5a0d_6ca1_4df0.csv"]
print("started loading")
#load file 1 into a dataframe and extract year as its identifier
combined_df = pd.read_csv(urls[0], skiprows=[1], parse_dates=['time'], index_col='time')
combined_df["year"] = combined_df.index.year
#we could also add another identifier in case years overlap between files
#combined_df["data_ID"] = 1
print("data file 1 is ready")
#load one url after the other and append it to the combined dataframe
for i, curr_url in enumerate(urls[1:]):
tmp_data = pd.read_csv(curr_url, skiprows=[1], parse_dates=['time'], index_col='time')
tmp_data["year"] = tmp_data.index.year
#tmp_data["data_ID"] = i+2
combined_df = pd.concat([combined_df, tmp_data], axis = 0)
print(f"data file {i+2} is ready")
print("finished downloads")
print("plotting now")
fig, axes = plt.subplots(3, figsize=(8, 10))
sns.set_theme(style="ticks", palette="pastel")
categ = ["temperature", "density", "salinity"]
cat_color = ["grey", "tab:orange", "yellow"]
for i, curr_ax in enumerate(axes.flat):
sns.boxplot(x="year", y=categ[i], data=combined_df, color=cat_color[i], ax=curr_ax)
sns.despine(offset=10, trim=True, ax=curr_ax)
plt.tight_layout(h_pad=2)
plt.show()
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句