Python Pandas: Find index based on value in DataFrame

debugcn Published at Dev

JLK

Is there a way to specify a DataFrame index (row) based on matching text inside the dataframe?

I am importing a text file from the internet located here every day into a python pandas DataFrame. I am parsing out just some of the data and doing calculations to give me the peak value for each day. The specific group of data I am needing to gather starts with the section headed "RTO COMBINED HOUR ENDING INTEGRATED FORECAST LOAD MW".

I need to specifically only use part of the data to do the calculations I need and I am able to manually specify which index line to start with, but daily this number could change due to text added to the top of the file by the authors.

Updated as of: 05-05-2016 1700 Constrained operations ARE expected in the AEP, APS, BC, COMED, DOM,and PS zones on 05-06-2016. Constrained operations ARE expected in the AEP, APS, BC, COMED, DOM,and PS zones on 05-07-2016. The PS/ConEd 600/400 MW contract will be limited to 700MW on 05-06-16.

Is there a way to match text in the pandas DataFrame and specify the index of that match? Currently I am manually specifying the index I want to start with using the variable 'day' below on the 6th line. I would like this variable to hold the index (row) of the dataframe that includes the text I want to match.

The code below works but may stop working if the line number (index) changes:

def forecastload():
    wb = load_workbook(filename = 'pjmactualload.xlsx')
    ws = wb['PJM Load']    
    printRow = 13
    #put this in iteration to pull 2 rows of data at a time (one for each day) for 7 days max
    day = 239
    while day < 251:
        #pulls in first day only
        data = pd.read_csv("http://oasis.pjm.com/doc/projload.txt", skiprows=day, delim_whitespace=True, header=None, nrows=2)

        #sets data at HE 24 = to data that is in HE 13- so I can delete column 0 data to allow checking 'max'
        data.at[1,13]= data.at[1,1]

        #get date for printing it with max load later on
        newDate = str(data.at[0,0])

        #now delete first column to get rid of date data.  date already saved as newDate
        data = data.drop(0,1)
        data = data.drop(1,1)

        #pull out max value of day
        #add index to this for iteration ie dayMax[x] = data.values.max()
        dayMax = data.max().max()
        dayMin = data.min().min()
        #print date and max load for that date
        actualMax = "Forecast Max"
        actualMin = "Forecast Min"
        dayMax = int(dayMax)
        maxResults = [str(newDate),int(dayMax),actualMax,dayMin,actualMin]
        d = 1
        for items in maxResults:
            ws.cell(row=printRow, column=d).value = items
            d += 1        
        printRow += 1        
        #print maxResults
        #l.writerows(maxResults)    
        day = day + 2
    wb.save('pjmactualload.xlsx')

ej_f

In this case i recommend you to use the command line in order to obtain a dataset that you could read later with pandas and do whatever you want.

To retrieve the data you can use curl and grep:

$ curl -s http://oasis.pjm.com/doc/projload.txt | grep -A 17 "RTO COMBINED HOUR ENDING INTEGRATED FORECAST" | tail -n +5
 05/06/16 am   68640   66576   65295   65170   66106   70770   77926   83048   84949   85756   86131   86089
          pm   85418   85285   84579   83762   83562   83289   82451   82460   84009   82771   78420   73258
 05/07/16 am   66809   63994   62420   61640   61848   63403   65736   68489   71850   74183   75403   75529
          pm   75186   74613   74072   73950   74386   74978   75135   75585   77414   76451   72529   67957
 05/08/16 am   63583   60903   59317   58492   58421   59378   60780   62971   66289   68997   70436   71212
          pm   71774   71841   71635   71831   72605   73876   74619   75848   78338   77121   72665   67763
 05/09/16 am   63865   61729   60669   60651   62175   66796   74620   79930   81978   83140   84307   84778
          pm   85112   85562   85568   85484   85766   85924   85487   85737   87366   84987   78666   72166
 05/10/16 am   67581   64686   62968   62364   63400   67603   75311   80515   82655   84252   86078   87120
          pm   88021   88990   89311   89477   89752   89860   89256   89327   90469   87730   81220   74449
 05/11/16 am   70367   67044   65125   64265   65054   69060   76424   81785   84646   87097   89541   91276
          pm   92646   93906   94593   94970   95321   95073   93897   93162   93615   90974   84335   77172
 05/12/16 am   71345   67840   65837   64892   65600   69547   76853   82077   84796   87053   89135   90527
          pm   91495   92351   92583   92473   92541   92053   90818   90241   90750   88135   81816   75042

Let's use the previous output (in the rto.txt file) to obtain a more readable data using awk and sed:

$ awk '/^ [0-9]/{d=$1;print $0;next}{print d,$0}' rto.txt | sed 's/^ //;s/\s\+/,/g'
05/06/16,am,68640,66576,65295,65170,66106,70770,77926,83048,84949,85756,86131,86089
05/06/16,pm,85418,85285,84579,83762,83562,83289,82451,82460,84009,82771,78420,73258
05/07/16,am,66809,63994,62420,61640,61848,63403,65736,68489,71850,74183,75403,75529
05/07/16,pm,75186,74613,74072,73950,74386,74978,75135,75585,77414,76451,72529,67957
05/08/16,am,63583,60903,59317,58492,58421,59378,60780,62971,66289,68997,70436,71212
05/08/16,pm,71774,71841,71635,71831,72605,73876,74619,75848,78338,77121,72665,67763
05/09/16,am,63865,61729,60669,60651,62175,66796,74620,79930,81978,83140,84307,84778
05/09/16,pm,85112,85562,85568,85484,85766,85924,85487,85737,87366,84987,78666,72166
05/10/16,am,67581,64686,62968,62364,63400,67603,75311,80515,82655,84252,86078,87120
05/10/16,pm,88021,88990,89311,89477,89752,89860,89256,89327,90469,87730,81220,74449
05/11/16,am,70367,67044,65125,64265,65054,69060,76424,81785,84646,87097,89541,91276
05/11/16,pm,92646,93906,94593,94970,95321,95073,93897,93162,93615,90974,84335,77172
05/12/16,am,71345,67840,65837,64892,65600,69547,76853,82077,84796,87053,89135,90527
05/12/16,pm,91495,92351,92583,92473,92541,92053,90818,90241,90750,88135,81816,75042

now, read and reshape the above result with pandas:

df = pd.read_csv("rto2.txt",names=["date","period"]+list(range(1,13)),index_col=[0,1])
df = df.stack().reset_index().rename(columns={"level_2":"hour",0:"value"})
df.index = pd.to_datetime(df.apply(lambda x: "{date} {hour}:00 {period}".format(**x),axis=1))
df.drop(["date", "hour", "period"], axis=1, inplace=True)

At this point you have a beautiful time series :)

In [10]: df.head()
Out[10]: 
                     value
2016-05-06 01:00:00  68640
2016-05-06 02:00:00  66576
2016-05-06 03:00:00  65295
2016-05-06 04:00:00  65170
2016-05-06 05:00:00  66106

to obtain the statistics:

In[11]: df.groupby(df.index.date).agg([min,max])
Out[11]: 
            value       
              min    max
2016-05-06  65170  86131
2016-05-07  61640  77414
2016-05-08  58421  78338
2016-05-09  60651  87366
2016-05-10  62364  90469
2016-05-11  64265  95321
2016-05-12  64892  92583

I hope this can help you.

Regards.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-18

Comments

0 comments

From Dev

Related Related

Article

Python Pandas: Find index based on value in DataFrame

Python Pandas: Find index based on value in DataFrame

Replace Pandas Dataframe Value Based on Index Range

Find and Add Missing Column Values Based on Index Increment Python Pandas Dataframe

Find Pandas dataframe column based on values, in Python

Find Pandas dataframe column based on values, in Python

Python Pandas create column based on value of index

Python pandas shift dataframe with time index value

Python Pandas: Get Index Label for a Value in a DataFrame

Python: Pandas - Separate a Dataframe based on a column value

Python pandas: Find a value in another dataframe and replace it

Pandas dataframe remove rows based on index and column value

Deleting DataFrame row in a multilevel index Pandas based on column value

Separating a dataframe into multiple dataframes based on the index value in pandas

Find index of last true value in pandas Series or DataFrame

Python / Pandas: Renaming several column names in DataFrame based on condition/index

Python / Pandas: Renaming several column names in DataFrame based on condition/index

Python: which is a fast way to find index in pandas dataframe?

How to get value from python dataframe based on column index?

Python pandas: replace values based on location not index value

Finding index of a pandas DataFrame value

change index value in pandas dataframe

new python pandas dataframe column based on value of variable, using function

python pandas dataframe transform based on tag/column value

Python Pandas: Fill value from one dataframe with the value from a second dataframe based on condition

python pandas dataframe index match

Python Pandas change index dataframe

Pandas Dataframe - Count values based on index position

query a pandas dataframe based in index and datacolumns

Get sequences from Pandas dataframe based on index

Python Filling dataframe values based on Column Index present in another dataframe value