使用Python绘制具有多个箱形图的图形

debugcn 发表于 Dev

雅各布代尔

我目前在Google Collab中使用Python进行编码。我正在使用通过NOAA的ERDDAP网站的网址上传的水下滑翔机数据。

url = 'https://gliders.ioos.us/erddap/tabledap/ru28-20150917T1300.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2015-09-18T00%3A00%3A00Z&time%3C=2015-10-06T00%3A00%3A00Z'

url2 = 'https://gliders.ioos.us/erddap/tabledap/ru28-20140815T1405.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2014-08-16T00%3A00%3A00Z&time%3C=2014-09-04T00%3A00%3A00Z'

url3 = 'https://gliders.ioos.us/erddap/tabledap/ru28-20130813T1436.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2013-08-14T00%3A00%3A00Z&time%3C=2013-08-26T00%3A00%3A00Z'

url4 = 'https://gliders.ioos.us/erddap/tabledap/blue-20200819T1433.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2020-08-19T00%3A00%3A00Z&time%3C=2020-08-25T00%3A00%3A00Z'

url5 = 'https://gliders.ioos.us/erddap/tabledap/blue-20190815T1711.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2019-08-16T00%3A00%3A00Z&time%3C=2019-09-24T00%3A00%3A00Z'

url6 = 'https://gliders.ioos.us/erddap/tabledap/blue-20180806T1400.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2018-08-07T00%3A00%3A00Z&time%3C=2018-10-31T00%3A00%3A00Z'

url7 = 'https://gliders.ioos.us/erddap/tabledap/blue-20170831T1436.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2017-09-01T00%3A00%3A00Z&time%3C=2017-09-24T00%3A00%3A00Z'

然后，我加载了数据集：

data1 = pd.read_csv(url, skiprows=[1], parse_dates=['time'], index_col='time')
data2 = pd.read_csv(url2, skiprows=[1], parse_dates=['time'], index_col='time')
data3 = pd.read_csv(url3, skiprows=[1], parse_dates=['time'], index_col='time')
data4 = pd.read_csv(url4, skiprows=[1], parse_dates=['time'], index_col='time')
data5 = pd.read_csv(url5, skiprows=[1], parse_dates=['time'], index_col='time')
data6 = pd.read_csv(url6, skiprows=[1], parse_dates=['time'], index_col='time')
data7 = pd.read_csv(url7, skiprows=[1], parse_dates=['time'], index_col='time')

并将它们合并为一个数据帧：

combined_df = pd.concat([data1, data2, data3, data4, data5, data6, data7], axis = 0)

运行该行combined_df.head()可对数据进行如下预览：


                       profile_id   latitude longitude depth temperature salinity   density
time                            
2015-09-18 00:02:41+00:00   81  40.350986   -73.871552  20.09   14.0286 32.678837   1024.4777
2015-09-18 00:02:41+00:00   81  40.350986   -73.871552  20.73   13.8871 32.658794   1024.4943
2015-09-18 00:02:41+00:00   81  40.350986   -73.871552  21.05   13.8069 32.680794   1024.5292
2015-09-18 00:04:36+00:00   82  40.350817   -73.871420  21.05   13.8069 32.680794   1024.5292
2015-09-18 00:16:07+00:00   83  40.349812   -73.870636  20.76   13.9284 32.670765   1024.4951

我需要制作一个具有7个单独的箱形图的图表，每个图具有来自每个数据集的值。我专注于温度，盐度和密度。x轴将是时间。任何帮助将不胜感激。

T先生

由于似乎每个文件都包含一年的数据，因此我们可以简化此方法，并且seaborn在这里有很大帮助。为了使代码更具可读性（请阅读：因为我们太懒了，无法键入重复的内容），我们将这些任务放入循环并将必需的变量存储在列表中。

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

url = 'https://gliders.ioos.us/erddap/tabledap/ru28-20150917T1300.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2015-09-18T00%3A00%3A00Z&time%3C=2015-10-06T00%3A00%3A00Z'    
url2 = 'https://gliders.ioos.us/erddap/tabledap/ru28-20140815T1405.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2014-08-16T00%3A00%3A00Z&time%3C=2014-09-04T00%3A00%3A00Z'    
url3 = 'https://gliders.ioos.us/erddap/tabledap/ru28-20130813T1436.csv?profile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ctemperature%2Csalinity%2Cdensity&time%3E=2013-08-14T00%3A00%3A00Z&time%3C=2013-08-26T00%3A00%3A00Z'

urls = [url, url2, url3]  #<---add the remaining urls, this example is just for three of them
#because the download takes a while, we can simulate this with already downloaded files
#urls=["ru28-20140815T1405_0c34_1256_e732.csv", "ru28-20150917T1300_cc34_de4b_4c02.csv", "ru28-20130813T1436_5a0d_6ca1_4df0.csv"]

print("started loading")
#load file 1 into a dataframe and extract year as its identifier
combined_df = pd.read_csv(urls[0], skiprows=[1], parse_dates=['time'], index_col='time')
combined_df["year"] = combined_df.index.year
#we could also add another identifier in case years overlap between files
#combined_df["data_ID"] = 1
print("data file 1 is ready")

#load one url after the other and append it to the combined dataframe
for i, curr_url in enumerate(urls[1:]):
    tmp_data = pd.read_csv(curr_url, skiprows=[1], parse_dates=['time'], index_col='time')
    tmp_data["year"] = tmp_data.index.year
    #tmp_data["data_ID"] = i+2
    combined_df = pd.concat([combined_df, tmp_data], axis = 0)
    print(f"data file {i+2} is ready")

print("finished downloads")
print("plotting now")

fig, axes = plt.subplots(3, figsize=(8, 10))

sns.set_theme(style="ticks", palette="pastel")

categ = ["temperature", "density", "salinity"]    
cat_color = ["grey", "tab:orange", "yellow"]

for i, curr_ax in enumerate(axes.flat):
    sns.boxplot(x="year", y=categ[i], data=combined_df, color=cat_color[i], ax=curr_ax)
    sns.despine(offset=10, trim=True, ax=curr_ax)

plt.tight_layout(h_pad=2)
plt.show()

样本输出：