How to become a correlation wizard with the brightwind library!¶

[1]:

import datetime
print('Last updated: {}'.format(datetime.date.today().strftime('%d %B, %Y')))

Last updated: 05 July, 2019

Outline:¶

This guide will demonstrate various techniques to correlate a sample site dataset to 4 separate reference datasets using the following steps:

Import the brightwind library, load some sample raw data and clean it
Import 4 sample reference datasets to use in the correlation
Perform a ordinary least squares correlation for 1 dataset and time period
Apply a long term adjustment to the measured wind speed time series based on the correlation results
Perform a ordinary least squares correlation for multiple datasets and time periods

Import brightwind and data¶

[2]:

# import the brightwind library
import brightwind as bw

[3]:

# specify the directory of the demo dataset
filepath = r'C:\Users\Stephen\Documents\Analysis\demo_data.csv'
# load the raw data
data = bw.load_csv(filepath)
# apply cleaning to the raw data
data = bw.apply_cleaning(data, r'C:\Users\Stephen\Documents\Analysis\demo_cleaning_file.csv')
# show the first 5 rows of the cleaned data
data.head(5)

Cleaning applied. (Please remember to assign the cleaned returned DataFrame to a variable.)

[3]:

	Spd80mN	Spd80mS	Spd60mN	Spd60mS	Spd40mN	Spd40mS	Spd80mNStd	Spd80mSStd	Spd60mNStd	Spd60mSStd	...	Dir78mSStd	Dir58mS	Dir58mSStd	Dir38mS	Dir38mSStd	T2m	RH2m	P2m	PrcpTot	BattMin
Timestamp
2016-01-09 15:30:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2016-01-09 15:40:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2016-01-09 17:00:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2016-01-09 17:10:00	7.382	7.325	6.818	6.689	6.252	6.174	0.844	0.810	0.897	0.875	...	4.680	118.8	5.107	115.6	5.189	0.954	100.0	934.0	0.0	12.71
2016-01-09 17:20:00	7.977	7.791	8.110	7.915	8.140	7.974	0.556	0.528	0.562	0.524	...	3.123	115.9	2.960	113.6	3.540	0.863	100.0	934.0	0.0	12.69

5 rows × 29 columns

Import reference datasets¶

For this example, we will import 4 MERRA-2 reanalysis datasets. These can be found as sample datasets within the brightwind library. We use the load_csv() function to load in each dataset from a csv file.

[4]:

# load each of the 4 reanalysis datasets to its own dataframe
Merra2_NE =  bw.load_csv(r'C:\Users\Stephen\Documents\Analysis\MERRA-2_NE_2000-01-01_2017-06-30.csv')
Merra2_NW =  bw.load_csv(r'C:\Users\Stephen\Documents\Analysis\MERRA-2_NW_2000-01-01_2017-06-30.csv')
Merra2_SE =  bw.load_csv(r'C:\Users\Stephen\Documents\Analysis\MERRA-2_SE_2000-01-01_2017-06-30.csv')
Merra2_SW =  bw.load_csv(r'C:\Users\Stephen\Documents\Analysis\MERRA-2_SW_2000-01-01_2017-06-30.csv')

# show the first 5 lines of NE dataset
Merra2_NE.head(5)

[4]:

	WS50m_m/s	WD50m_deg	T2M_degC	PS_hPa
DateTime
2000-01-01 00:00:00	6.840	275	4.06	997.17
2000-01-01 01:00:00	6.506	264	3.44	997.64
2000-01-01 02:00:00	7.095	254	3.12	998.39
2000-01-01 03:00:00	8.088	246	3.19	999.12
2000-01-01 04:00:00	8.386	239	3.18	999.61

Correlate to 1 dataset and using 1 averaging period¶

Next we’ll perform a basic linear regression correlation using the Ordinary Least Squares method between the 80m north orientated site anemometer and the MERRA-2 dataset located to the North East. We use the ‘averaging_prd’ argument to set the averaging period for which the correlation should be carried out. This can be set to 10-min or multiples of ten minutes (10T, 30T) (depending on the interval of both datasets), hourly or multiples of 1 hour (1H, 2H, 3H, etc), daily or multiples of 1 day (1D, 2D, 3D, 15D, etc), monthly (1M) or yearly (1AS). Here we apply a monthly correlation. Finally we set the ‘coverage_threshold’, this is the minimum coverage of data within each averaging period from both the reference and site dataset that will be passed to the correlation. In this case we use a coverage threshold of 90%.

[5]:

# set up correlation
ord_lst_sq = bw.Correl.OrdinaryLeastSquares(Merra2_NE['WS50m_m/s'], data['Spd80mN'],
                                            averaging_prd='1M', coverage_threshold=0.90)

The correlation function returns a python object. This has to be subsequently run and plotted using the commands run() and plot().

[6]:

# run correlation
ord_lst_sq.run()

{'Num data points': 16,
 'offset': -0.06161432045219376,
 'r2': 0.9407857754972966,
 'slope': 0.9955561116923611}

The correlation has now run and the main results are returned. The slope and offset of the linear regression line are displayed along with the \(R^2\) value. These results can be returned at any time using the .show_params() command.

[7]:

#show parameters
ord_lst_sq.show_params()

{'Num data points': 16,
 'offset': -0.06161432045219376,
 'r2': 0.9407857754972966,
 'slope': 0.9955561116923611}

To show the correlation plot along with the linear regression line we simply input:

[8]:

# show plot
ord_lst_sq.plot()

[8]:

../_images/tutorials_becoming_a_correlation_wizard_20_0.png

Now that we have a correlation we will want to adjust our data accordingly. We can synthesize monthly Spd80mN mean wind speeds from the Merra2_NE data using the synthesize() method. This will synthesize monthly means for months that Spd80mN doesn’t cover and splices it with months Spd80mN does cover.

[9]:

# synthesize monthly means over the long term record
LT_monthly_means = ord_lst_sq.synthesize()
# Show the first 5 values
LT_monthly_means.head(5)

[9]:

	Spd80mN_Synthesized
Timestamp
2000-01-01	9.300975
2000-02-01	11.099150
2000-03-01	7.999968
2000-04-01	6.842368
2000-05-01	6.191083

Apply a long term adjustment to the measured wind speed time series based on the correlation results¶

We can use the results of the above correlation to rescale the Spd80mN wind speed time series and put it into a long term context. First we’ll use the momm() (mean of monthly means) function to derive the long term mean of the dataset we synthesized above. This represents our target long term mean speed.

[10]:

# derive long term target mean speed
LT_target_speed = bw.momm(LT_monthly_means)
# show the value
LT_target_speed

[10]:

7.623917277960366

Next, we’ll derive a scale factor we can apply to the measurements to achieve this long term mean.

[11]:

# derive scale factor as ratio of means
LT_scale_factor = LT_target_speed/data.Spd80mN.mean()
# show the value
LT_scale_factor

[11]:

1.0140026948884429

Then, we can use the scale_wind_speed() function to rescale the measurements based on the above factor, placing them in a long term context. We’ll also allocate this a new variable within the existing dataframe.

[12]:

# scale wind speed
data['Spd80mN_LT'] = bw.scale_wind_speed(data.Spd80mN, LT_scale_factor)
# show first 5 timestamps. The new variable can be seen in the last column of the dataframe.
data.head(5)

[12]:

	Spd80mN	Spd80mS	Spd60mN	Spd60mS	Spd40mN	Spd40mS	Spd80mNStd	Spd80mSStd	Spd60mNStd	Spd60mSStd	...	Dir58mS	Dir58mSStd	Dir38mS	Dir38mSStd	T2m	RH2m	P2m	PrcpTot	BattMin	Spd80mN_LT
Timestamp
2016-01-09 15:30:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2016-01-09 15:40:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2016-01-09 17:00:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2016-01-09 17:10:00	7.382	7.325	6.818	6.689	6.252	6.174	0.844	0.810	0.897	0.875	...	118.8	5.107	115.6	5.189	0.954	100.0	934.0	0.0	12.71	7.485368
2016-01-09 17:20:00	7.977	7.791	8.110	7.915	8.140	7.974	0.556	0.528	0.562	0.524	...	115.9	2.960	113.6	3.540	0.863	100.0	934.0	0.0	12.69	8.088699

5 rows × 30 columns

Now, let’s confirm that the mean of our new variable is equal to our long term mean speed target.

[13]:

# show ratio of mean values
data.Spd80mN_LT.mean()/LT_target_speed

[13]:

1.0000000000000002

The resulting variable can then be used along with concurrent wind direction data (such as ‘data.Dir78mS’) to generate a tab file for wind flow modelling as shown in the tutorial for exporting a tab file.

Correlation for multiple datasets and time periods¶

Note: in a future release we plan to create a more user-friendly wrapper to perform multiple correlations at once.

As well as performing one correlation with specific parameters, the brightwind library can be used to correlate to multiple reference datasets using multiple averaging periods in one cell. The code below performs two ‘for’ loops to iterate through each of the MERRA-2 datasets and a number of averaging periods, and returns the results in a dataframe. In order to allocate the results as a dataframe, we need to first import the pandas library.

[14]:

import pandas as pd

We then assign each of the datasets to a python dictionary, so we can easily loop through them later.

[15]:

merra2_nodes = {
    'M2_NE': Merra2_NE,
    'M2_NW': Merra2_NW,
    'M2_SE': Merra2_SE,
    'M2_SW': Merra2_SW
}

The following code can then be used to run each for loop and add results to the new dataframe. Specific steps are explained in the comments throughout.

[16]:

# set the target and reference column names
target_column = 'Spd80mN'
ref_column = 'WS50m_m/s'

# create a list to store all of the results
correl_results = []

# set up first loop, which will step through the 4 reanalysis datasets
for node in merra2_nodes:
    # set up the second loop, nested within the first one. This one will step through a number of averaging periods.
    for avg_prd in ['1D','3D','7D','15D','1M']:

        # set up the correlation object, as done previously
        ord_lst_sq = bw.Correl.OrdinaryLeastSquares(merra2_nodes[node][ref_column], data[target_column],
                                                    averaging_prd=avg_prd, coverage_threshold=0.90)
        # run the correlation
        ord_lst_sq.run()

        # Append the results to the results list
        correl_results.append({
            '1, Reanalysis dataset': node,
            '2, Averaging Period': avg_prd,
            '3, Num data points': ord_lst_sq.params['Num data points'],
            '4, LT Reference MOMM': bw.momm(merra2_nodes[node][ref_column]),
            '5, R^2': ord_lst_sq.params['r2'],
            '6, Slope': ord_lst_sq.params['slope'],
            '7, Offset': ord_lst_sq.params['offset'],
            '8, LT Target wind speed': bw.momm(ord_lst_sq.synthesize()),
            '9, Adjustment of mean [%]': ((bw.momm(ord_lst_sq.synthesize()) / data[target_column].mean())-1)*100,
        })

{'Num data points': 509,
 'offset': -0.42487844744002934,
 'r2': 0.8950111609239269,
 'slope': 1.040347744890449}
{'Num data points': 166,
 'offset': -0.6361907996697532,
 'r2': 0.9347172373809141,
 'slope': 1.065506067469903}
{'Num data points': 71,
 'offset': -0.5882020865023719,
 'r2': 0.9340706234081555,
 'slope': 1.0608473791176414}
{'Num data points': 34,
 'offset': -0.3469504509964136,
 'r2': 0.9580669877674568,
 'slope': 1.0312563688477243}
{'Num data points': 16,
 'offset': -0.06161432045219376,
 'r2': 0.9407857754972966,
 'slope': 0.9955561116923611}
{'Num data points': 509,
 'offset': -0.09566950948358079,
 'r2': 0.8467424753668483,
 'slope': 0.9513356833362564}
{'Num data points': 166,
 'offset': -0.3094075956208962,
 'r2': 0.8958363060049799,
 'slope': 0.9763981012985756}
{'Num data points': 71,
 'offset': -0.2654084062535991,
 'r2': 0.890799351069137,
 'slope': 0.9719866140592953}
{'Num data points': 34,
 'offset': 0.15682108376560136,
 'r2': 0.9206266548035889,
 'slope': 0.9202250143158889}
{'Num data points': 16,
 'offset': 0.6453320138995284,
 'r2': 0.8772421506713155,
 'slope': 0.860974299364037}
{'Num data points': 509,
 'offset': -0.16478996565523255,
 'r2': 0.8456645998836172,
 'slope': 0.9676742260535489}
{'Num data points': 166,
 'offset': -0.38866696097764375,
 'r2': 0.8861973972237285,
 'slope': 0.9939478232735908}
{'Num data points': 71,
 'offset': -0.3904784183852624,
 'r2': 0.8993332053612523,
 'slope': 0.9970145362591194}
{'Num data points': 34,
 'offset': -0.13824125155034708,
 'r2': 0.9311687736097964,
 'slope': 0.9642528458509261}
{'Num data points': 16,
 'offset': 0.21439627305515263,
 'r2': 0.9162048374395771,
 'slope': 0.9216301350074104}
{'Num data points': 509,
 'offset': 0.12359784304781836,
 'r2': 0.7800512366998469,
 'slope': 0.897860529255207}
{'Num data points': 166,
 'offset': -0.13024511222955648,
 'r2': 0.8299573609509825,
 'slope': 0.9274217939487325}
{'Num data points': 71,
 'offset': -0.17839001354473732,
 'r2': 0.8392903588116132,
 'slope': 0.9348187140549352}
{'Num data points': 34,
 'offset': 0.17384790241582168,
 'r2': 0.8834584307562071,
 'slope': 0.8907949268795059}
{'Num data points': 16,
 'offset': 0.8554505335265273,
 'r2': 0.8414235670295263,
 'slope': 0.8092854017974048}

Each set of print results above are the result of the obj.run() function. We have saved a summary of the results under ‘correl_results’, so lets take a look at that now.

[17]:

# put the list into a DataFrame for ease of analysis.
correl_results_df = pd.DataFrame(correl_results)
correl_results_df

[17]:

	1, Reanalysis dataset	2, Averaging Period	3, Num data points	4, LT Reference MOMM	5, R^2	6, Slope	7, Offset	8, LT Target wind speed	9, Adjustment of mean [%]
0	M2_NE	1D	509	7.707068	0.895011	1.040348	-0.424878	7.601000	1.095465
1	M2_NE	3D	166	7.707068	0.934717	1.065506	-0.636191	7.576759	0.773044
2	M2_NE	7D	71	7.707068	0.934071	1.060847	-0.588202	7.596172	1.031245
3	M2_NE	15D	34	7.707068	0.958067	1.031256	-0.346950	7.600418	1.087716
4	M2_NE	1M	16	7.707068	0.940786	0.995556	-0.061614	7.623917	1.400269
5	M2_NW	1D	509	8.114629	0.846742	0.951336	-0.095670	7.630865	1.492676
6	M2_NW	3D	166	8.114629	0.895836	0.976398	-0.309408	7.612278	1.245460
7	M2_NW	7D	71	8.114629	0.890799	0.971987	-0.265408	7.629251	1.471205
8	M2_NW	15D	34	8.114629	0.920627	0.920225	0.156821	7.624570	1.408951
9	M2_NW	1M	16	8.114629	0.877242	0.860974	0.645332	7.643720	1.663650
10	M2_SE	1D	509	8.087812	0.845665	0.967674	-0.164790	7.666124	1.961635
11	M2_SE	3D	166	8.087812	0.886197	0.993948	-0.388667	7.645255	1.684067
12	M2_SE	7D	71	8.087812	0.899333	0.997015	-0.390478	7.676901	2.104962
13	M2_SE	15D	34	8.087812	0.931169	0.964253	-0.138241	7.656873	1.838585
14	M2_SE	1M	16	8.087812	0.916205	0.921630	0.214396	7.678354	2.124289
15	M2_SW	1D	509	8.409316	0.780051	0.897861	0.123598	7.677981	2.119335
16	M2_SW	3D	166	8.409316	0.829957	0.927422	-0.130245	7.662884	1.918537
17	M2_SW	7D	71	8.409316	0.839290	0.934819	-0.178390	7.686714	2.235478
18	M2_SW	15D	34	8.409316	0.883458	0.890795	0.173848	7.664287	1.937200
19	M2_SW	1M	16	8.409316	0.841424	0.809285	0.855451	7.671619	2.034711

We can easily scan down through the results to see the different \(R^2\) values for different averaging periods and reanalysis datasets, and see other key metrics such as the predicted long term wind speed and the percentage adjustment from the measured site wind speed to the long term wind speed.

Now lets scale the wind speed once again, this time based on the correlation with the highest \(R^2\) value in our table. First, let’s show all the parameters of the correlation with the max \(R^2\).

[18]:

# show correlation results when r2 is equal to the max r2
correl_results_df[correl_results_df['5, R^2'] == max(correl_results_df['5, R^2'])]

[18]:

	1, Reanalysis dataset	2, Averaging Period	3, Num data points	4, LT Reference MOMM	5, R^2	6, Slope	7, Offset	8, LT Target wind speed	9, Adjustment of mean [%]
3	M2_NE	15D	34	7.707068	0.958067	1.031256	-0.34695	7.600418	1.087716

We can pull out the index of this correlation from our table by adding on .index[0] on the end amd assigning it to a variable.

[19]:

# get row with the best correlation
row = correl_results_df[correl_results_df['5, R^2'] == max(correl_results_df['5, R^2'])].index[0]
# show index
row

[19]:

Then we can refer to the long term target speed from this correlation as follows.

[20]:

correl_results_df['8, LT Target wind speed'][row]

[20]:

7.600417545842798

This can be used to derive a scale factor in the same way we did before.

[21]:

# derive scale factor as ratio of means
LT_scale_factor = correl_results_df['8, LT Target wind speed'][row] / data[target_column].mean()
# show the value
LT_scale_factor

[21]:

1.010877163638877

Finally, we can create a second LT adjusted time series using the same commands as before, once again add this to our existing DataFrame alongside the previous long term variable.

[22]:

# scale wind speed
data['Spd80mN_LT_2'] = bw.scale_wind_speed(data[target_column], LT_scale_factor)
# show first 5 timestamps. The new variable can be seen to the right of the dataframe.
data.head(5)

[22]:

	Spd80mN	Spd80mS	Spd60mN	Spd60mS	Spd40mN	Spd40mS	Spd80mNStd	Spd80mSStd	Spd60mNStd	Spd60mSStd	...	Dir58mSStd	Dir38mS	Dir38mSStd	T2m	RH2m	P2m	PrcpTot	BattMin	Spd80mN_LT	Spd80mN_LT_2
Timestamp
2016-01-09 15:30:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2016-01-09 15:40:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2016-01-09 17:00:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2016-01-09 17:10:00	7.382	7.325	6.818	6.689	6.252	6.174	0.844	0.810	0.897	0.875	...	5.107	115.6	5.189	0.954	100.0	934.0	0.0	12.71	7.485368	7.462295
2016-01-09 17:20:00	7.977	7.791	8.110	7.915	8.140	7.974	0.556	0.528	0.562	0.524	...	2.960	113.6	3.540	0.863	100.0	934.0	0.0	12.69	8.088699	8.063767

5 rows × 31 columns

Of course, \(R^2\) is not the only parameter than matters when deciding on a robust correlation method, so the analyst still has an important role! Nonetheless, this shows how easy the library makes it to compare large numbers of reference datasets and get back useful results fast. Upcoming updates to the library will aim to reduce the code required to perform the looping procedure, so watch this space!

Other correlation methods¶

The library also includes a host of other correlation functions such as:

SimpleSpeedRatio()
OrthogonalLeastSquares()
MultipleLinearRegression()
Speedsort()
SVR()

The SimpleSpeedRatio() is simply a ratio of the overlapping target and reference wind speed time series.

The OrthogonalLeastSquares() function carries out a orthogonal least squares regression correlation.

The MultipleLinearRegression() function allows linear regression to be carried out between multiple reference datasets and the site data.

The Speedsort() function is a directional correlation method developed by Brian Hurley and Ciaran King as outlined in the paper “The Speedsort, DynaSort and Scatter wind correlation methods”, Wind Engineering Volume 29, No.3, pp 217-241.

The SVR() function is a support vector machine correlation method which is a type of Machine Learning alogorithim.

These functions will be the subject of future tutorials relating to correlation methods.