How to use the distribution functions


[1]:
import datetime
print('Last updated: {}'.format(datetime.date.today().strftime('%d %B, %Y')))
Last updated: 03 July, 2019

Outline:

This guide will demonstrate how to apply the distribtion_by_wind_speed and the distribution() functions to a sample dataset using the following steps:

  • import the brightwind library and some sample data

  • Using the dist_of_wind_speed() function to plot a frequency distribution from the sample data

  • Using the dist() function to plot other variables such as temperature and bin appropriately

  • Modifying the dist() function to plot two variables and bin appropriately

  • Using a custom aggregation function for binning in the distribution function


Import brightwind and data

[2]:
import brightwind as bw
[3]:
# specify location of existing sample dataset
filepath = r'C:\Users\Stephen\Documents\Analysis\demo_data.csv'
# load data as dataframe
data = bw.load_csv(filepath)
# apply cleaning
data = bw.apply_cleaning(data, r'C:\Users\Stephen\Documents\Analysis\demo_cleaning_file.csv')
# show first few rows of dataframe
data.head(5)
Cleaning applied. (Please remember to assign the cleaned returned DataFrame to a variable.)
[3]:
Spd80mN Spd80mS Spd60mN Spd60mS Spd40mN Spd40mS Spd80mNStd Spd80mSStd Spd60mNStd Spd60mSStd ... Dir78mSStd Dir58mS Dir58mSStd Dir38mS Dir38mSStd T2m RH2m P2m PrcpTot BattMin
Timestamp
2016-01-09 15:30:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 15:40:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 17:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 17:10:00 7.382 7.325 6.818 6.689 6.252 6.174 0.844 0.810 0.897 0.875 ... 4.680 118.8 5.107 115.6 5.189 0.954 100.0 934.0 0.0 12.71
2016-01-09 17:20:00 7.977 7.791 8.110 7.915 8.140 7.974 0.556 0.528 0.562 0.524 ... 3.123 115.9 2.960 113.6 3.540 0.863 100.0 934.0 0.0 12.69

5 rows × 29 columns


Distribution of the wind speed

First we look at getting the frequency distribution of the windspeed column from our demo data. This will show how frequently each windspeed occurs in the dataset that is passed into the function.

[4]:
bw.dist_of_wind_speed(data.Spd80mN)
[4]:
../_images/tutorials_how_dist_function_works_10_0.png

Equally we can use the freq_distribution() function to return the same graph

[5]:
bw.freq_distribution(data.Spd80mN)
[5]:
../_images/tutorials_how_dist_function_works_12_0.png

Next if we wish to return the data from the distribution graph we can assign a plot and a table to the function, and activate the return_data variable. This returns the table of the distribtution by wind speed as the variable table.

[6]:
freq_plot, freq_dist = bw.dist_of_wind_speed(data.Spd80mN, return_data=True)
freq_dist
[6]:
variable_bin
[-0.5, 0.5)      1.082160
[0.5, 1.5)       2.834629
[1.5, 2.5)       5.465434
[2.5, 3.5)       6.863837
[3.5, 4.5)       8.366253
[4.5, 5.5)       9.300273
[5.5, 6.5)      10.001051
[6.5, 7.5)      10.079849
[7.5, 8.5)       9.366464
[8.5, 9.5)       8.017441
[9.5, 10.5)      6.706241
[10.5, 11.5)     5.505358
[11.5, 12.5)     4.463123
[12.5, 13.5)     3.482875
[13.5, 14.5)     2.712755
[14.5, 15.5)     2.030889
[15.5, 16.5)     1.435175
[16.5, 17.5)     0.949779
[17.5, 18.5)     0.563144
[18.5, 19.5)     0.304686
[19.5, 20.5)     0.181761
[20.5, 21.5)     0.111368
[21.5, 22.5)     0.085102
[22.5, 23.5)     0.045178
[23.5, 24.5)     0.021013
[24.5, 25.5)     0.012608
[25.5, 26.5)     0.005253
[26.5, 27.5)     0.004203
[27.5, 28.5)     0.001051
[28.5, 29.5)     0.001051
[29.5, 30.5)     0.000000
Name: %frequency, dtype: float64

Distribution of anything!

The dist_of_wind_speed() function is a distribution function specifically designed for wind speed and is a wrapper of the dist() function. The dist() function can be used to plot the distribution of any variable. Here we plot the frequency of occurence of different temperatures recorded by the temperature sensor. The function automatically finds the minimum and maximum of the data series and bins in units of 1.

[7]:
bw.dist(data.T2m)
[7]:
../_images/tutorials_how_dist_function_works_17_0.png

If we only want to see the data binned for a certain range, we can specify the start and end of the bin values using the bins variable. Also T2m on the x-axis doesnt tell us anything meaningful about the data, so we can set the x_label variable to a useful name like Temperature.

[8]:
bw.dist(data.T2m, bins=[0,1,2,3,4,5,6,7,8,9,10,11,12], x_label='Temperature [deg C]')
[8]:
../_images/tutorials_how_dist_function_works_19_0.png

The dist() function is flexible enough to plot two separate variables, binning by one variable, and plotting against the other. As an example, we have plotted the mean 80m wind speed against temperature. This will show the mean wind speed for specific temperature ranges. The bins variable is used to specify the degrees Celcius to bin the data by, while the bin_labels variable is used to specify the names to represent these bins. The aggregation_method is set to mean to calculate the mean wind speed.

[9]:
bw.dist(data.Spd80mN, var_to_bin_against=data.T2m,
        bins=[-10, 4, 12, 18, 30], bin_labels=['freezing', 'cold', 'mild', 'hot'],
        aggregation_method='mean', x_label='Temperature')
[9]:
../_images/tutorials_how_dist_function_works_21_0.png

The dist() function also provides us with the flexibility to create a custom aggregation method. In the example below we have created a custom aggregation function which finds the mean for each bin and adds two standard deviations.

[10]:
def custom_agg(x):
       return x.mean()+(2*x.std())

bw.dist(data.T2m, bins=[0, 5, 10, 15, 20, 25, 30], aggregation_method=custom_agg)
[10]:
../_images/tutorials_how_dist_function_works_23_0.png