Disclaimer: I am very new to Pandas.
I am doing numerical simulations and would like to use Pandas for the final data-evaluation. To keep things simple let's assume the following setup:
My simulations take a few input parameters (E.g. max
, and size
). The simulation then produces a number of observables as functions of time (E.g. f1(t)
, f2(t)
). In the end, the results of three different simulations could look like this:
t1 = np.linspace(0, 2, 15)
t2 = np.linspace(0, 2, 21)
t3 = np.linspace(0, 1.5, 16)
df1 = pd.DataFrame({'max': t1.max(), 'size': t1.size, 't': t1, 'f1': t1**2+0, 'f2': t1**3+0})
df2 = pd.DataFrame({'max': t2.max(), 'size': t2.size, 't': t2, 'f1': t2**2+1, 'f2': t2**3+1})
df3 = pd.DataFrame({'max': t3.max(), 'size': t3.size, 't': t3, 'f1': t3**2+2, 'f2': t3**3+2})
Where max
, and size
are the parameters to each simulation, t
is the time axis, and f1
, and f2
are the observables.
Say, as a first task, I would like to plot the values of f1
as a function of t
for each set of parameters. After spending some time with the docs I found that the pivot_table
function can rearrange my data in the right way.
df = pd.concat([df1, df2, df3])
df_ms = pd.pivot_table(df, index=['t'], values=['f1', 'f2'], columns=['max', 'size'])
Intermediate question: Is this the best way to do this? I know that DataFrame
takes an index
argument in its constructor. Would it be better to define t
as the index at that point? (I couldn't get it working together with pivot_table
)
Now we can use the plot
method to plot the resulting data.
df_ms['f1'].plot()
The result, however, is unexpected. I understand that some data is missing, as pandas is forced to introduce NaNs when aligning the different t
axes.
My question: Why doesn't the green curve show up at all? And why are the blue and red patches aligned? Is there a simple way to skip the NaNs in the plot, along the lines of what you would get by simply calling plt.plot(t, f1)
in matplotlib?
I know that it is possible to fill the NaNs by interpolation. For the given case second order splines are quite ideal.
df_ms['f1'].interpolate(method='spline', order=2).plot()
However, I am wondering why this should be necessary for simply plotting the data. Matplotlib's internal linear interpolation would be sufficient...
The nan
s behave logically, but not always very intuitively.
If you plot a continuous line, a nan
will naturally remove line segments from both sides of the nan
point. So, if your data (green line) never has two numbers as adjacent elements, it will not be drawn. For example, if f1
is then [nan, 1, nan, 1.2, nan, nan, 2.3]
, no segments can be drawn.
Fix #1: draw points instead of lines (plot(t, f1, 'o')
), then you'll at least see all your data.
Fix #2: remove all nan
s from your data before plotting. Let us assume t
has all values but f1
is missing values:
import numpy as np
import matplotlib.pyplot as plt
nonnans = -np.isnan(f1)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(t[nonnans], f1[nonnans])
So, just create an array telling which of the samples are good, and use only those samples in plotting. (And in case you are wondering, the ax.plot
stuff is equivalent to plt.plot
but using the recommended object-oriented interface.)
The way plot
treats nan
s may feel a bit annoying at first, but it is very useful once you grasp it.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments