How to transform your data


[1]:
import datetime
print('Last updated: {}'.format(datetime.date.today().strftime('%d %B, %Y')))
Last updated: 26 June, 2019

Outline:

This tutorial demonstrates how to use the following functions to transform a sample dataset in various ways. The tutorial covers the following steps:

  1. Importing the brightwind library and loading some sample data

  2. Using the scale_wind_speed() function

  3. Using the average_data_by_period() function

  4. Using the adjust_slope_offset() function

  5. Using the offset_wind_direction() function

  6. Using the offset_timestamps() function

  7. Using the selective_avg() function


1. Importing the brightwind library and loading some sample data

[2]:
import brightwind as bw

The following commands will load the exisiting dataset and show the first few timestamps.

[3]:
# load existing sample dataset
data = bw.load_csv(bw.datasets.demo_data)

# show first few rows of dataframe
data.head(5)
[3]:
Spd80mN Spd80mS Spd60mN Spd60mS Spd40mN Spd40mS Spd80mNStd Spd80mSStd Spd60mNStd Spd60mSStd ... Dir78mSStd Dir58mS Dir58mSStd Dir38mS Dir38mSStd T2m RH2m P2m PrcpTot BattMin
Timestamp
2016-01-09 15:30:00 8.370 7.911 8.160 7.849 7.857 7.626 1.240 1.075 1.060 0.947 ... 6.100 110.1 6.009 112.2 5.724 0.711 100.0 935.0 0.0 12.94
2016-01-09 15:40:00 8.250 7.961 8.100 7.884 7.952 7.840 0.897 0.875 0.900 0.855 ... 5.114 110.9 4.702 109.8 5.628 0.630 100.0 935.0 0.0 12.95
2016-01-09 17:00:00 7.652 7.545 7.671 7.551 7.531 7.457 0.756 0.703 0.797 0.749 ... 4.172 113.1 3.447 111.8 4.016 1.126 100.0 934.0 0.0 12.75
2016-01-09 17:10:00 7.382 7.325 6.818 6.689 6.252 6.174 0.844 0.810 0.897 0.875 ... 4.680 118.8 5.107 115.6 5.189 0.954 100.0 934.0 0.0 12.71
2016-01-09 17:20:00 7.977 7.791 8.110 7.915 8.140 7.974 0.556 0.528 0.562 0.524 ... 3.123 115.9 2.960 113.6 3.540 0.863 100.0 934.0 0.0 12.69

5 rows × 29 columns


2. Using the scale_wind_speed() function

The following commands specify an arbitrary scale factor and apply this to one of the speed variables.

[4]:
# specify scale factor
scale_factor = 1.03

# apply scale factor to Spd80mN variable
Spd80mN_scaled = bw.scale_wind_speed(data.Spd80mN, scale_factor)

# print first 5 rows of result
Spd80mN_scaled.head(5)
[4]:
Timestamp
2016-01-09 15:30:00    8.62110
2016-01-09 15:40:00    8.49750
2016-01-09 17:00:00    7.88156
2016-01-09 17:10:00    7.60346
2016-01-09 17:20:00    8.21631
Name: Spd80mN, dtype: float64

Now we’ll repeat the same step, this time saving the resulting data in the existing dataframe. The new variable appears as the last column on the right hand side of the returned dataframe.

[5]:
# apply scale factor to Spd80mN variable, creating a new variable in the existing dataframe
data['Spd80mN_scaled'] = bw.scale_wind_speed(data.Spd80mN, scale_factor)

# show first few rows of updated dataframe, scaled wind speed is included in the last column
data.head(5)
[5]:
Spd80mN Spd80mS Spd60mN Spd60mS Spd40mN Spd40mS Spd80mNStd Spd80mSStd Spd60mNStd Spd60mSStd ... Dir58mS Dir58mSStd Dir38mS Dir38mSStd T2m RH2m P2m PrcpTot BattMin Spd80mN_scaled
Timestamp
2016-01-09 15:30:00 8.370 7.911 8.160 7.849 7.857 7.626 1.240 1.075 1.060 0.947 ... 110.1 6.009 112.2 5.724 0.711 100.0 935.0 0.0 12.94 8.62110
2016-01-09 15:40:00 8.250 7.961 8.100 7.884 7.952 7.840 0.897 0.875 0.900 0.855 ... 110.9 4.702 109.8 5.628 0.630 100.0 935.0 0.0 12.95 8.49750
2016-01-09 17:00:00 7.652 7.545 7.671 7.551 7.531 7.457 0.756 0.703 0.797 0.749 ... 113.1 3.447 111.8 4.016 1.126 100.0 934.0 0.0 12.75 7.88156
2016-01-09 17:10:00 7.382 7.325 6.818 6.689 6.252 6.174 0.844 0.810 0.897 0.875 ... 118.8 5.107 115.6 5.189 0.954 100.0 934.0 0.0 12.71 7.60346
2016-01-09 17:20:00 7.977 7.791 8.110 7.915 8.140 7.974 0.556 0.528 0.562 0.524 ... 115.9 2.960 113.6 3.540 0.863 100.0 934.0 0.0 12.69 8.21631

5 rows × 30 columns

Finally, we can compare the mean values of the scaled and unscaled speed time series. The ratio of these mean values is shown to equal the intended scale factor.

[6]:
# print unscaled mean speed
print('The unscaled mean speed is: \t {} m/s'.format(round(data.Spd80mN.mean(),2)))

# print scaled mean speed
print('The scaled mean speed is: \t {} m/s'.format(round(data.Spd80mN_scaled.mean(),2)))

# calculate ratio of mean values
ratio = data.Spd80mN_scaled.mean()/data.Spd80mN.mean()

# print ratio of mean values
print('The ratio of these values is: \t {}'.format(round(ratio,2)))
The unscaled mean speed is:      7.5 m/s
The scaled mean speed is:        7.72 m/s
The ratio of these values is:    1.03

3. Using the average_data_by_period() function

This function can be used to apply a window average to data, using a window which can be specified as any number of minutes (‘min’), hours (‘H’), days (‘D’), weeks (‘W’). ‘1M’ and ‘1AS’ can also used for monthly and annual averages respectively. The output is created within a new series.

[7]:
# then derive monthly averages
data_monthly = bw.average_data_by_period(data.Spd80mN, period='1M')

# show monthly averaged data for first 5 months
data_monthly.head(5)
[7]:
Timestamp
2016-01-01    9.252377
2016-02-01    8.904382
2016-03-01    6.395166
2016-04-01    6.598875
2016-05-01    8.729657
Freq: MS, Name: Spd80mN, dtype: float64

4. Using the adjust_slope_offset() function

Now, let’s imagine that data from the northern anemometer at 80 m was logged with standard slope and offset values of 0.765 and 0.35, and we need to rescale this data considering slope and offset values from a calibration certificate which is specific to this anemometer. The following command can be used to do this, once again adding a new variable to the dataframe.

[8]:
# Apply slope and offset adjustments
data['Spd80mN_adj'] = bw.adjust_slope_offset(data.Spd80mN,0.765, 0.35, 0.7642, 0.352)

Now let’s view the raw and adjusted speed values side-by-side.

[9]:
# concatenate raw and adjusted variables
summary = data[['Spd80mN', 'Spd80mN_adj']]

# show first few values
summary.head(5)
[9]:
Spd80mN Spd80mN_adj
Timestamp
2016-01-09 15:30:00 8.370 8.363613
2016-01-09 15:40:00 8.250 8.243739
2016-01-09 17:00:00 7.652 7.646364
2016-01-09 17:10:00 7.382 7.376646
2016-01-09 17:20:00 7.977 7.971024

5. Using the offset_wind_direction() function

Similarly, the following function can be used to apply a scalar offset to direction data from a given wind vane.

[10]:
# apply directional offset
data['Dir38mS_adj'] = bw.offset_wind_direction(data.Dir38mS, -10)

# concatenate raw and adjusted variables
summary = data[['Dir38mS', 'Dir38mS_adj']]

# show first few values
summary.head(5)
[10]:
Dir38mS Dir38mS_adj
Timestamp
2016-01-09 15:30:00 112.2 102.2
2016-01-09 15:40:00 109.8 99.8
2016-01-09 17:00:00 111.8 101.8
2016-01-09 17:10:00 115.6 105.6
2016-01-09 17:20:00 113.6 103.6

6. Using the offset_timestamps() function

This function can be used to apply an offset to the timestamps within a dataset. A few examples are provided below, first, let’s print the first timestamp of the raw data for comparison.

[11]:
# print first timestamp
print(data.index[0])
2016-01-09 15:30:00

Now we’ll add 1 hour and 30 minutes to the timestamps of the data and print the first timestamp once again.

[12]:
# add 90 minutes to timestamps
data = bw.offset_timestamps(data, '1.5H')

# print first timestamp
print(data.index[0])
2016-01-09 17:00:00

Negative offsets can also be applied.

[13]:
# subtract 2 hours from timestamps
data = bw.offset_timestamps(data, '-2H')

# print first timestamp
print(data.index[0])
2016-01-09 15:00:00

Offsets can also be applied in minutes (positive or negative).

[14]:
# add 7 minutes to timestamps
data = bw.offset_timestamps(data, '7min', overwrite=True)

# print first timestamp
print(data.index[0])
2016-01-09 15:07:00

Time offsets can also be restricted to a certain time period during the record using the following extra arguments. In this case we can see that the final adjustment has just been applied from the 2nd to the 4th timestamps.

[15]:
# subtract 7 minutes from timestamps between the datetime range below
data = bw.offset_timestamps(data, '-7min', date_from='2016-01-09 15:17:00', date_to='2016-01-09 16:47:00')

# show all data for first 5 timestamps
data.head(5)
[15]:
Spd80mN Spd80mS Spd60mN Spd60mS Spd40mN Spd40mS Spd80mNStd Spd80mSStd Spd60mNStd Spd60mSStd ... Dir38mS Dir38mSStd T2m RH2m P2m PrcpTot BattMin Spd80mN_scaled Spd80mN_adj Dir38mS_adj
2016-01-09 15:07:00 8.370 7.911 8.160 7.849 7.857 7.626 1.240 1.075 1.060 0.947 ... 112.2 5.724 0.711 100.0 935.0 0.0 12.94 8.62110 8.363613 102.2
2016-01-09 15:10:00 8.250 7.961 8.100 7.884 7.952 7.840 0.897 0.875 0.900 0.855 ... 109.8 5.628 0.630 100.0 935.0 0.0 12.95 8.49750 8.243739 99.8
2016-01-09 16:30:00 7.652 7.545 7.671 7.551 7.531 7.457 0.756 0.703 0.797 0.749 ... 111.8 4.016 1.126 100.0 934.0 0.0 12.75 7.88156 7.646364 101.8
2016-01-09 16:40:00 7.382 7.325 6.818 6.689 6.252 6.174 0.844 0.810 0.897 0.875 ... 115.6 5.189 0.954 100.0 934.0 0.0 12.71 7.60346 7.376646 105.6
2016-01-09 16:57:00 7.977 7.791 8.110 7.915 8.140 7.974 0.556 0.528 0.562 0.524 ... 113.6 3.540 0.863 100.0 934.0 0.0 12.69 8.21631 7.971024 103.6

5 rows × 32 columns

If the timestamp is adjustment is large enough to overlap the unadjusted data before/after the ‘date_from’/’date_to’, by default the function will not overwrite the unadjusted data. The argument ‘overwrite=True’ can be specified to change this default and overwrite the neighbouring timestamps.

Finally, note that the adjusted timestamps can be allocated to a completely new dataframe, leaving the input dataset unchanged.

[16]:
# create new dataset with 6H added to the original timestamps
data_adj = bw.offset_timestamps(data, '6H')

7. Using the selective_avg() function

This function can be used to create a time series of wind speed using data from two anemometers mounted at the same height, along with one wind vane. For each timestamp in the dataset, the function either averages the two wind speed values or only includes the upstream wind speed value if the other is in the wake of the mast. The wake is defined by winds approaching the mast from the opposite site of the boom within a directional sector of a given size (in this case 60 degrees).

[17]:
# derive selective average variable based on two 80 m anemeters and 78 m wind vane
data['sel_avg_80m'] = bw.selective_avg(data.Spd80mN, data.Spd80mS, wdir=data.Dir78mS, boom_dir_1=0, boom_dir_2=180, sector_width=60)
[18]:
# concatenate inputs and new output variable
summary = data[['Spd80mN', 'Spd80mS', 'Dir78mS','sel_avg_80m']]

# show the first 5 entries, none of which include waked anemometers
summary.head(5)
[18]:
Spd80mN Spd80mS Dir78mS sel_avg_80m
2016-01-09 15:07:00 8.370 7.911 114.2 8.1405
2016-01-09 15:10:00 8.250 7.961 114.4 8.1055
2016-01-09 16:30:00 7.652 7.545 117.8 7.5985
2016-01-09 16:40:00 7.382 7.325 124.5 7.3535
2016-01-09 16:57:00 7.977 7.791 120.9 7.8840