Pandas: Pivoting and plotting workflow

Lemming

Disclaimer: I am very new to Pandas.

I am doing numerical simulations and would like to use Pandas for the final data-evaluation. To keep things simple let's assume the following setup:

My simulations take a few input parameters (E.g. max, and size). The simulation then produces a number of observables as functions of time (E.g. f1(t), f2(t)). In the end, the results of three different simulations could look like this:

t1 = np.linspace(0, 2, 15)
t2 = np.linspace(0, 2, 21)
t3 = np.linspace(0, 1.5, 16)
df1 = pd.DataFrame({'max': t1.max(), 'size': t1.size, 't': t1, 'f1': t1**2+0, 'f2': t1**3+0})
df2 = pd.DataFrame({'max': t2.max(), 'size': t2.size, 't': t2, 'f1': t2**2+1, 'f2': t2**3+1})
df3 = pd.DataFrame({'max': t3.max(), 'size': t3.size, 't': t3, 'f1': t3**2+2, 'f2': t3**3+2})

Where max, and size are the parameters to each simulation, t is the time axis, and f1, and f2 are the observables.


Say, as a first task, I would like to plot the values of f1 as a function of t for each set of parameters. After spending some time with the docs I found that the pivot_table function can rearrange my data in the right way.

df = pd.concat([df1, df2, df3])
df_ms = pd.pivot_table(df, index=['t'], values=['f1', 'f2'], columns=['max', 'size'])

Intermediate question: Is this the best way to do this? I know that DataFrame takes an index argument in its constructor. Would it be better to define t as the index at that point? (I couldn't get it working together with pivot_table)


Now we can use the plot method to plot the resulting data.

df_ms['f1'].plot()

The result, however, is unexpected. I understand that some data is missing, as pandas is forced to introduce NaNs when aligning the different t axes.

My question: Why doesn't the green curve show up at all? And why are the blue and red patches aligned? Is there a simple way to skip the NaNs in the plot, along the lines of what you would get by simply calling plt.plot(t, f1) in matplotlib?

Plot with missing data

I know that it is possible to fill the NaNs by interpolation. For the given case second order splines are quite ideal.

df_ms['f1'].interpolate(method='spline', order=2).plot()

However, I am wondering why this should be necessary for simply plotting the data. Matplotlib's internal linear interpolation would be sufficient...

Plot with interpolation

DrV

The nans behave logically, but not always very intuitively.

If you plot a continuous line, a nan will naturally remove line segments from both sides of the nan point. So, if your data (green line) never has two numbers as adjacent elements, it will not be drawn. For example, if f1 is then [nan, 1, nan, 1.2, nan, nan, 2.3], no segments can be drawn.

Fix #1: draw points instead of lines (plot(t, f1, 'o')), then you'll at least see all your data.

Fix #2: remove all nans from your data before plotting. Let us assume t has all values but f1 is missing values:

import numpy as np
import matplotlib.pyplot as plt

nonnans = -np.isnan(f1)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(t[nonnans], f1[nonnans])

So, just create an array telling which of the samples are good, and use only those samples in plotting. (And in case you are wondering, the ax.plot stuff is equivalent to plt.plot but using the recommended object-oriented interface.)

The way plot treats nans may feel a bit annoying at first, but it is very useful once you grasp it.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related