How to become a correlation wizard with the brightwind library!


[1]:
import datetime
print('Last updated: {}'.format(datetime.date.today().strftime('%d %B, %Y')))
Last updated: 05 July, 2019

Outline:

This guide will demonstrate various techniques to correlate a sample site dataset to 4 separate reference datasets using the following steps:

  • Import the brightwind library, load some sample raw data and clean it

  • Import 4 sample reference datasets to use in the correlation

  • Perform a ordinary least squares correlation for 1 dataset and time period

  • Apply a long term adjustment to the measured wind speed time series based on the correlation results

  • Perform a ordinary least squares correlation for multiple datasets and time periods


Import brightwind and data

[2]:
# import the brightwind library
import brightwind as bw
[3]:
# specify the directory of the demo dataset
filepath = r'C:\Users\Stephen\Documents\Analysis\demo_data.csv'
# load the raw data
data = bw.load_csv(filepath)
# apply cleaning to the raw data
data = bw.apply_cleaning(data, r'C:\Users\Stephen\Documents\Analysis\demo_cleaning_file.csv')
# show the first 5 rows of the cleaned data
data.head(5)
Cleaning applied. (Please remember to assign the cleaned returned DataFrame to a variable.)
[3]:
Spd80mN Spd80mS Spd60mN Spd60mS Spd40mN Spd40mS Spd80mNStd Spd80mSStd Spd60mNStd Spd60mSStd ... Dir78mSStd Dir58mS Dir58mSStd Dir38mS Dir38mSStd T2m RH2m P2m PrcpTot BattMin
Timestamp
2016-01-09 15:30:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 15:40:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 17:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 17:10:00 7.382 7.325 6.818 6.689 6.252 6.174 0.844 0.810 0.897 0.875 ... 4.680 118.8 5.107 115.6 5.189 0.954 100.0 934.0 0.0 12.71
2016-01-09 17:20:00 7.977 7.791 8.110 7.915 8.140 7.974 0.556 0.528 0.562 0.524 ... 3.123 115.9 2.960 113.6 3.540 0.863 100.0 934.0 0.0 12.69

5 rows × 29 columns


Import reference datasets

For this example, we will import 4 MERRA-2 reanalysis datasets. These can be found as sample datasets within the brightwind library. We use the load_csv() function to load in each dataset from a csv file.

[4]:
# load each of the 4 reanalysis datasets to its own dataframe
Merra2_NE =  bw.load_csv(r'C:\Users\Stephen\Documents\Analysis\MERRA-2_NE_2000-01-01_2017-06-30.csv')
Merra2_NW =  bw.load_csv(r'C:\Users\Stephen\Documents\Analysis\MERRA-2_NW_2000-01-01_2017-06-30.csv')
Merra2_SE =  bw.load_csv(r'C:\Users\Stephen\Documents\Analysis\MERRA-2_SE_2000-01-01_2017-06-30.csv')
Merra2_SW =  bw.load_csv(r'C:\Users\Stephen\Documents\Analysis\MERRA-2_SW_2000-01-01_2017-06-30.csv')

# show the first 5 lines of NE dataset
Merra2_NE.head(5)
[4]:
WS50m_m/s WD50m_deg T2M_degC PS_hPa
DateTime
2000-01-01 00:00:00 6.840 275 4.06 997.17
2000-01-01 01:00:00 6.506 264 3.44 997.64
2000-01-01 02:00:00 7.095 254 3.12 998.39
2000-01-01 03:00:00 8.088 246 3.19 999.12
2000-01-01 04:00:00 8.386 239 3.18 999.61

Correlate to 1 dataset and using 1 averaging period

Next we’ll perform a basic linear regression correlation using the Ordinary Least Squares method between the 80m north orientated site anemometer and the MERRA-2 dataset located to the North East. We use the ‘averaging_prd’ argument to set the averaging period for which the correlation should be carried out. This can be set to 10-min or multiples of ten minutes (10T, 30T) (depending on the interval of both datasets), hourly or multiples of 1 hour (1H, 2H, 3H, etc), daily or multiples of 1 day (1D, 2D, 3D, 15D, etc), monthly (1M) or yearly (1AS). Here we apply a monthly correlation. Finally we set the ‘coverage_threshold’, this is the minimum coverage of data within each averaging period from both the reference and site dataset that will be passed to the correlation. In this case we use a coverage threshold of 90%.

[5]:
# set up correlation
ord_lst_sq = bw.Correl.OrdinaryLeastSquares(Merra2_NE['WS50m_m/s'], data['Spd80mN'],
                                            averaging_prd='1M', coverage_threshold=0.90)

The correlation function returns a python object. This has to be subsequently run and plotted using the commands run() and plot().

[6]:
# run correlation
ord_lst_sq.run()
{'Num data points': 16,
 'offset': -0.06161432045219376,
 'r2': 0.9407857754972966,
 'slope': 0.9955561116923611}

The correlation has now run and the main results are returned. The slope and offset of the linear regression line are displayed along with the \(R^2\) value. These results can be returned at any time using the .show_params() command.

[7]:
#show parameters
ord_lst_sq.show_params()
{'Num data points': 16,
 'offset': -0.06161432045219376,
 'r2': 0.9407857754972966,
 'slope': 0.9955561116923611}

To show the correlation plot along with the linear regression line we simply input:

[8]:
# show plot
ord_lst_sq.plot()
[8]:
../_images/tutorials_becoming_a_correlation_wizard_20_0.png

Now that we have a correlation we will want to adjust our data accordingly. We can synthesize monthly Spd80mN mean wind speeds from the Merra2_NE data using the synthesize() method. This will synthesize monthly means for months that Spd80mN doesn’t cover and splices it with months Spd80mN does cover.

[9]:
# synthesize monthly means over the long term record
LT_monthly_means = ord_lst_sq.synthesize()
# Show the first 5 values
LT_monthly_means.head(5)
[9]:
Spd80mN_Synthesized
Timestamp
2000-01-01 9.300975
2000-02-01 11.099150
2000-03-01 7.999968
2000-04-01 6.842368
2000-05-01 6.191083

Apply a long term adjustment to the measured wind speed time series based on the correlation results

We can use the results of the above correlation to rescale the Spd80mN wind speed time series and put it into a long term context. First we’ll use the momm() (mean of monthly means) function to derive the long term mean of the dataset we synthesized above. This represents our target long term mean speed.

[10]:
# derive long term target mean speed
LT_target_speed = bw.momm(LT_monthly_means)
# show the value
LT_target_speed
[10]:
7.623917277960366

Next, we’ll derive a scale factor we can apply to the measurements to achieve this long term mean.

[11]:
# derive scale factor as ratio of means
LT_scale_factor = LT_target_speed/data.Spd80mN.mean()
# show the value
LT_scale_factor
[11]:
1.0140026948884429

Then, we can use the scale_wind_speed() function to rescale the measurements based on the above factor, placing them in a long term context. We’ll also allocate this a new variable within the existing dataframe.

[12]:
# scale wind speed
data['Spd80mN_LT'] = bw.scale_wind_speed(data.Spd80mN, LT_scale_factor)
# show first 5 timestamps. The new variable can be seen in the last column of the dataframe.
data.head(5)
[12]:
Spd80mN Spd80mS Spd60mN Spd60mS Spd40mN Spd40mS Spd80mNStd Spd80mSStd Spd60mNStd Spd60mSStd ... Dir58mS Dir58mSStd Dir38mS Dir38mSStd T2m RH2m P2m PrcpTot BattMin Spd80mN_LT
Timestamp
2016-01-09 15:30:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 15:40:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 17:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 17:10:00 7.382 7.325 6.818 6.689 6.252 6.174 0.844 0.810 0.897 0.875 ... 118.8 5.107 115.6 5.189 0.954 100.0 934.0 0.0 12.71 7.485368
2016-01-09 17:20:00 7.977 7.791 8.110 7.915 8.140 7.974 0.556 0.528 0.562 0.524 ... 115.9 2.960 113.6 3.540 0.863 100.0 934.0 0.0 12.69 8.088699

5 rows × 30 columns

Now, let’s confirm that the mean of our new variable is equal to our long term mean speed target.

[13]:
# show ratio of mean values
data.Spd80mN_LT.mean()/LT_target_speed
[13]:
1.0000000000000002

The resulting variable can then be used along with concurrent wind direction data (such as ‘data.Dir78mS’) to generate a tab file for wind flow modelling as shown in the tutorial for exporting a tab file.


Correlation for multiple datasets and time periods

Note: in a future release we plan to create a more user-friendly wrapper to perform multiple correlations at once.

As well as performing one correlation with specific parameters, the brightwind library can be used to correlate to multiple reference datasets using multiple averaging periods in one cell. The code below performs two ‘for’ loops to iterate through each of the MERRA-2 datasets and a number of averaging periods, and returns the results in a dataframe. In order to allocate the results as a dataframe, we need to first import the pandas library.

[14]:
import pandas as pd

We then assign each of the datasets to a python dictionary, so we can easily loop through them later.

[15]:
merra2_nodes = {
    'M2_NE': Merra2_NE,
    'M2_NW': Merra2_NW,
    'M2_SE': Merra2_SE,
    'M2_SW': Merra2_SW
}

The following code can then be used to run each for loop and add results to the new dataframe. Specific steps are explained in the comments throughout.

[16]:
# set the target and reference column names
target_column = 'Spd80mN'
ref_column = 'WS50m_m/s'

# create a list to store all of the results
correl_results = []

# set up first loop, which will step through the 4 reanalysis datasets
for node in merra2_nodes:
    # set up the second loop, nested within the first one. This one will step through a number of averaging periods.
    for avg_prd in ['1D','3D','7D','15D','1M']:

        # set up the correlation object, as done previously
        ord_lst_sq = bw.Correl.OrdinaryLeastSquares(merra2_nodes[node][ref_column], data[target_column],
                                                    averaging_prd=avg_prd, coverage_threshold=0.90)
        # run the correlation
        ord_lst_sq.run()

        # Append the results to the results list
        correl_results.append({
            '1, Reanalysis dataset': node,
            '2, Averaging Period': avg_prd,
            '3, Num data points': ord_lst_sq.params['Num data points'],
            '4, LT Reference MOMM': bw.momm(merra2_nodes[node][ref_column]),
            '5, R^2': ord_lst_sq.params['r2'],
            '6, Slope': ord_lst_sq.params['slope'],
            '7, Offset': ord_lst_sq.params['offset'],
            '8, LT Target wind speed': bw.momm(ord_lst_sq.synthesize()),
            '9, Adjustment of mean [%]': ((bw.momm(ord_lst_sq.synthesize()) / data[target_column].mean())-1)*100,
        })
{'Num data points': 509,
 'offset': -0.42487844744002934,
 'r2': 0.8950111609239269,
 'slope': 1.040347744890449}
{'Num data points': 166,
 'offset': -0.6361907996697532,
 'r2': 0.9347172373809141,
 'slope': 1.065506067469903}
{'Num data points': 71,
 'offset': -0.5882020865023719,
 'r2': 0.9340706234081555,
 'slope': 1.0608473791176414}
{'Num data points': 34,
 'offset': -0.3469504509964136,
 'r2': 0.9580669877674568,
 'slope': 1.0312563688477243}
{'Num data points': 16,
 'offset': -0.06161432045219376,
 'r2': 0.9407857754972966,
 'slope': 0.9955561116923611}
{'Num data points': 509,
 'offset': -0.09566950948358079,
 'r2': 0.8467424753668483,
 'slope': 0.9513356833362564}
{'Num data points': 166,
 'offset': -0.3094075956208962,
 'r2': 0.8958363060049799,
 'slope': 0.9763981012985756}
{'Num data points': 71,
 'offset': -0.2654084062535991,
 'r2': 0.890799351069137,
 'slope': 0.9719866140592953}
{'Num data points': 34,
 'offset': 0.15682108376560136,
 'r2': 0.9206266548035889,
 'slope': 0.9202250143158889}
{'Num data points': 16,
 'offset': 0.6453320138995284,
 'r2': 0.8772421506713155,
 'slope': 0.860974299364037}
{'Num data points': 509,
 'offset': -0.16478996565523255,
 'r2': 0.8456645998836172,
 'slope': 0.9676742260535489}
{'Num data points': 166,
 'offset': -0.38866696097764375,
 'r2': 0.8861973972237285,
 'slope': 0.9939478232735908}
{'Num data points': 71,
 'offset': -0.3904784183852624,
 'r2': 0.8993332053612523,
 'slope': 0.9970145362591194}
{'Num data points': 34,
 'offset': -0.13824125155034708,
 'r2': 0.9311687736097964,
 'slope': 0.9642528458509261}
{'Num data points': 16,
 'offset': 0.21439627305515263,
 'r2': 0.9162048374395771,
 'slope': 0.9216301350074104}
{'Num data points': 509,
 'offset': 0.12359784304781836,
 'r2': 0.7800512366998469,
 'slope': 0.897860529255207}
{'Num data points': 166,
 'offset': -0.13024511222955648,
 'r2': 0.8299573609509825,
 'slope': 0.9274217939487325}
{'Num data points': 71,
 'offset': -0.17839001354473732,
 'r2': 0.8392903588116132,
 'slope': 0.9348187140549352}
{'Num data points': 34,
 'offset': 0.17384790241582168,
 'r2': 0.8834584307562071,
 'slope': 0.8907949268795059}
{'Num data points': 16,
 'offset': 0.8554505335265273,
 'r2': 0.8414235670295263,
 'slope': 0.8092854017974048}

Each set of print results above are the result of the obj.run() function. We have saved a summary of the results under ‘correl_results’, so lets take a look at that now.

[17]:
# put the list into a DataFrame for ease of analysis.
correl_results_df = pd.DataFrame(correl_results)
correl_results_df
[17]:
1, Reanalysis dataset 2, Averaging Period 3, Num data points 4, LT Reference MOMM 5, R^2 6, Slope 7, Offset 8, LT Target wind speed 9, Adjustment of mean [%]
0 M2_NE 1D 509 7.707068 0.895011 1.040348 -0.424878 7.601000 1.095465
1 M2_NE 3D 166 7.707068 0.934717 1.065506 -0.636191 7.576759 0.773044
2 M2_NE 7D 71 7.707068 0.934071 1.060847 -0.588202 7.596172 1.031245
3 M2_NE 15D 34 7.707068 0.958067 1.031256 -0.346950 7.600418 1.087716
4 M2_NE 1M 16 7.707068 0.940786 0.995556 -0.061614 7.623917 1.400269
5 M2_NW 1D 509 8.114629 0.846742 0.951336 -0.095670 7.630865 1.492676
6 M2_NW 3D 166 8.114629 0.895836 0.976398 -0.309408 7.612278 1.245460
7 M2_NW 7D 71 8.114629 0.890799 0.971987 -0.265408 7.629251 1.471205
8 M2_NW 15D 34 8.114629 0.920627 0.920225 0.156821 7.624570 1.408951
9 M2_NW 1M 16 8.114629 0.877242 0.860974 0.645332 7.643720 1.663650
10 M2_SE 1D 509 8.087812 0.845665 0.967674 -0.164790 7.666124 1.961635
11 M2_SE 3D 166 8.087812 0.886197 0.993948 -0.388667 7.645255 1.684067
12 M2_SE 7D 71 8.087812 0.899333 0.997015 -0.390478 7.676901 2.104962
13 M2_SE 15D 34 8.087812 0.931169 0.964253 -0.138241 7.656873 1.838585
14 M2_SE 1M 16 8.087812 0.916205 0.921630 0.214396 7.678354 2.124289
15 M2_SW 1D 509 8.409316 0.780051 0.897861 0.123598 7.677981 2.119335
16 M2_SW 3D 166 8.409316 0.829957 0.927422 -0.130245 7.662884 1.918537
17 M2_SW 7D 71 8.409316 0.839290 0.934819 -0.178390 7.686714 2.235478
18 M2_SW 15D 34 8.409316 0.883458 0.890795 0.173848 7.664287 1.937200
19 M2_SW 1M 16 8.409316 0.841424 0.809285 0.855451 7.671619 2.034711

We can easily scan down through the results to see the different \(R^2\) values for different averaging periods and reanalysis datasets, and see other key metrics such as the predicted long term wind speed and the percentage adjustment from the measured site wind speed to the long term wind speed.

Now lets scale the wind speed once again, this time based on the correlation with the highest \(R^2\) value in our table. First, let’s show all the parameters of the correlation with the max \(R^2\).

[18]:
# show correlation results when r2 is equal to the max r2
correl_results_df[correl_results_df['5, R^2'] == max(correl_results_df['5, R^2'])]
[18]:
1, Reanalysis dataset 2, Averaging Period 3, Num data points 4, LT Reference MOMM 5, R^2 6, Slope 7, Offset 8, LT Target wind speed 9, Adjustment of mean [%]
3 M2_NE 15D 34 7.707068 0.958067 1.031256 -0.34695 7.600418 1.087716

We can pull out the index of this correlation from our table by adding on .index[0] on the end amd assigning it to a variable.

[19]:
# get row with the best correlation
row = correl_results_df[correl_results_df['5, R^2'] == max(correl_results_df['5, R^2'])].index[0]
# show index
row
[19]:
3

Then we can refer to the long term target speed from this correlation as follows.

[20]:
correl_results_df['8, LT Target wind speed'][row]
[20]:
7.600417545842798

This can be used to derive a scale factor in the same way we did before.

[21]:
# derive scale factor as ratio of means
LT_scale_factor = correl_results_df['8, LT Target wind speed'][row] / data[target_column].mean()
# show the value
LT_scale_factor
[21]:
1.010877163638877

Finally, we can create a second LT adjusted time series using the same commands as before, once again add this to our existing DataFrame alongside the previous long term variable.

[22]:
# scale wind speed
data['Spd80mN_LT_2'] = bw.scale_wind_speed(data[target_column], LT_scale_factor)
# show first 5 timestamps. The new variable can be seen to the right of the dataframe.
data.head(5)
[22]:
Spd80mN Spd80mS Spd60mN Spd60mS Spd40mN Spd40mS Spd80mNStd Spd80mSStd Spd60mNStd Spd60mSStd ... Dir58mSStd Dir38mS Dir38mSStd T2m RH2m P2m PrcpTot BattMin Spd80mN_LT Spd80mN_LT_2
Timestamp
2016-01-09 15:30:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 15:40:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 17:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-09 17:10:00 7.382 7.325 6.818 6.689 6.252 6.174 0.844 0.810 0.897 0.875 ... 5.107 115.6 5.189 0.954 100.0 934.0 0.0 12.71 7.485368 7.462295
2016-01-09 17:20:00 7.977 7.791 8.110 7.915 8.140 7.974 0.556 0.528 0.562 0.524 ... 2.960 113.6 3.540 0.863 100.0 934.0 0.0 12.69 8.088699 8.063767

5 rows × 31 columns

Of course, \(R^2\) is not the only parameter than matters when deciding on a robust correlation method, so the analyst still has an important role! Nonetheless, this shows how easy the library makes it to compare large numbers of reference datasets and get back useful results fast. Upcoming updates to the library will aim to reduce the code required to perform the looping procedure, so watch this space!


Other correlation methods

The library also includes a host of other correlation functions such as:

  • SimpleSpeedRatio()

  • OrthogonalLeastSquares()

  • MultipleLinearRegression()

  • Speedsort()

  • SVR()

The SimpleSpeedRatio() is simply a ratio of the overlapping target and reference wind speed time series.

The OrthogonalLeastSquares() function carries out a orthogonal least squares regression correlation.

The MultipleLinearRegression() function allows linear regression to be carried out between multiple reference datasets and the site data.

The Speedsort() function is a directional correlation method developed by Brian Hurley and Ciaran King as outlined in the paper “The Speedsort, DynaSort and Scatter wind correlation methods”, Wind Engineering Volume 29, No.3, pp 217-241.

The SVR() function is a support vector machine correlation method which is a type of Machine Learning alogorithim.

These functions will be the subject of future tutorials relating to correlation methods.