How to get some useful statistics using the brightwind library


[1]:
import datetime
print('Last updated: {}'.format(datetime.date.today().strftime('%d %B, %Y')))
Last updated: 26 June, 2019

Outline:

This guide will demonstrate how to get some useful statistics from a sample dataset using the following steps:

  • Import the brightwind library and some sample data

  • Find time continuity gaps within the sample data

  • Get some basic statistics on each of the columns from the sample dataset

  • Find the monthly coverage of the dataset or the coverage of any time period.

  • Return the mean of monthly means of a anemometer or of a range of anemometers


[2]:
import brightwind as bw
[3]:
# specify location of existing sample dataset
filepath = r'C:\...\brightwind\datasets\demo\demo_data.csv'
# load data as dataframe
data = bw.load_csv(filepath)
# show first few rows of dataframe
data.head(5)
[3]:
Spd80mN Spd80mS Spd60mN Spd60mS Spd40mN Spd40mS Spd80mNStd Spd80mSStd Spd60mNStd Spd60mSStd ... Dir78mSStd Dir58mS Dir58mSStd Dir38mS Dir38mSStd T2m RH2m P2m PrcpTot BattMin
Timestamp
2016-01-09 15:30:00 8.370 7.911 8.160 7.849 7.857 7.626 1.240 1.075 1.060 0.947 ... 6.100 110.1 6.009 112.2 5.724 0.711 100.0 935.0 0.0 12.94
2016-01-09 15:40:00 8.250 7.961 8.100 7.884 7.952 7.840 0.897 0.875 0.900 0.855 ... 5.114 110.9 4.702 109.8 5.628 0.630 100.0 935.0 0.0 12.95
2016-01-09 17:00:00 7.652 7.545 7.671 7.551 7.531 7.457 0.756 0.703 0.797 0.749 ... 4.172 113.1 3.447 111.8 4.016 1.126 100.0 934.0 0.0 12.75
2016-01-09 17:10:00 7.382 7.325 6.818 6.689 6.252 6.174 0.844 0.810 0.897 0.875 ... 4.680 118.8 5.107 115.6 5.189 0.954 100.0 934.0 0.0 12.71
2016-01-09 17:20:00 7.977 7.791 8.110 7.915 8.140 7.974 0.556 0.528 0.562 0.524 ... 3.123 115.9 2.960 113.6 3.540 0.863 100.0 934.0 0.0 12.69

5 rows × 29 columns

Time Continuity

First we want to see if there are any gaps in the data. We can use the time_continuity_gap function to identify periods where there are gaps in the timestamp that are not consistent with typical gap seen between timestamps in the file(s). The function returns a pandas dataframe showing the timestamp at the start of the missing period and the timestamp at the end of the missing period. An additional column shows how many days were lost in the missing period.

[4]:
bw.time_continuity_gaps(data)
[4]:
Date From Date To Days Lost
1 2016-01-09 15:40:00 2016-01-09 17:00:00 0.055556
17750 2016-05-11 23:00:00 2016-05-31 15:20:00 19.680556

Basic Statistics

Next we may want to get some basic statistics of each of the columns found in the wind data file. The basic_stats function returns the count, mean, standard deviation, minimum and maximum of each column. This can be useful for a variety of checks, one example is confirming calibrations have been applied to the anemometers by checking if the minimum value for each anemometer matches the corresponding calibration offset.

[5]:
bw.basic_stats(data)
[5]:
count mean std min max
Spd80mN 95629.0 7.498665 3.998231 0.215 29.000
Spd80mS 95629.0 6.474298 4.457503 0.000 29.270
Spd60mN 95629.0 7.033594 3.809893 0.214 28.220
Spd60mS 95629.0 7.113664 3.905644 0.080 29.030
Spd40mN 95629.0 6.742682 3.738940 0.228 27.380
Spd40mS 95629.0 6.800116 3.816079 0.092 28.450
Spd80mNStd 95629.0 1.005663 0.540208 0.000 5.056
Spd80mSStd 95629.0 0.820888 0.596739 0.000 5.151
Spd60mNStd 95629.0 1.015741 0.536483 0.000 5.043
Spd60mSStd 95629.0 0.942060 0.535222 0.000 5.185
Spd40mNStd 95629.0 1.002585 0.515037 0.000 4.919
Spd40mSStd 95629.0 0.936986 0.522567 0.000 5.143
Spd80mNMax 95629.0 9.845375 5.137878 0.215 38.620
Spd80mSMax 95629.0 8.473476 5.754762 0.000 39.450
Spd60mNMax 95629.0 9.467539 5.007623 0.214 39.060
Spd60mSMax 95629.0 9.440672 5.066036 0.080 39.830
Spd40mNMax 95629.0 9.170213 4.936084 0.228 38.440
Spd40mSMax 95629.0 9.147638 4.996349 0.092 38.770
Dir78mS 95629.0 198.259766 78.632518 0.003 360.000
Dir78mSStd 95629.0 6.603149 5.931689 0.000 78.910
Dir58mS 95629.0 232.994314 76.145192 0.014 360.000
Dir58mSStd 95629.0 4.259346 6.249012 0.000 78.490
Dir38mS 95629.0 197.835100 84.050190 0.031 360.000
Dir38mSStd 95629.0 8.923607 6.420406 0.000 80.100
T2m 95629.0 7.116077 4.908406 -6.663 25.420
RH2m 95629.0 93.857024 9.649367 25.730 100.000
P2m 95629.0 952.968077 23.537472 592.200 1002.000
PrcpTot 95629.0 0.014461 0.085502 0.000 5.200
BattMin 95629.0 13.416010 0.565756 12.240 15.180

Data Coverage

Next we can see check the coverage of each column in the dataset. By default, the coverage function returns the monthly coverage.

[6]:
bw.coverage(data)
[6]:
Spd80mN_Coverage Spd80mS_Coverage Spd60mN_Coverage Spd60mS_Coverage Spd40mN_Coverage Spd40mS_Coverage Spd80mNStd_Coverage Spd80mSStd_Coverage Spd60mNStd_Coverage Spd60mSStd_Coverage ... Dir78mSStd_Coverage Dir58mS_Coverage Dir58mSStd_Coverage Dir38mS_Coverage Dir38mSStd_Coverage T2m_Coverage RH2m_Coverage P2m_Coverage PrcpTot_Coverage BattMin_Coverage
Timestamp
2016-01-01 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534 ... 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534
2016-02-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-03-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-04-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-05-01 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367 ... 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367
2016-06-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-07-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-08-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-09-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-10-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-11-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-12-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-01-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-02-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-03-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-04-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-05-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-06-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-07-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-08-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-09-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-10-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-11-01 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611 ... 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611

23 rows × 29 columns

Returning the coverage of all of the columns is more information that we need in this case! So we can assign each of the anemometers to a list, by specifiying the column headings from the table that correspond to the average 10-min values from the anemometers, and then passing them through the coverage function.

[7]:
anemometers = ['Spd80mN','Spd80mS', 'Spd60mN', 'Spd60mS', 'Spd40mN', 'Spd40mS']
bw.coverage(data[anemometers])
[7]:
Spd80mN_Coverage Spd80mS_Coverage Spd60mN_Coverage Spd60mS_Coverage Spd40mN_Coverage Spd40mS_Coverage
Timestamp
2016-01-01 0.719534 0.719534 0.719534 0.719534 0.719534 0.719534
2016-02-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-03-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-04-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-05-01 0.365367 0.365367 0.365367 0.365367 0.365367 0.365367
2016-06-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-07-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-08-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-09-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-10-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-11-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2016-12-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-01-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-02-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-03-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-04-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-05-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-06-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-07-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-08-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-09-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-10-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
2017-11-01 0.748611 0.748611 0.748611 0.748611 0.748611 0.748611

But what if we dont want monthly coverage? We can then use the period variable to return whatever time period we want, whether that is 10-min (period=‘10min’), hourly (period=‘1H’), daily (period=‘1D’), weekly (period=‘1W’) or yearly (period=‘1AS’). Here we have opted to return the yearly coverage.

[8]:
bw.coverage(data[anemometers],period='1AS')
[8]:
Spd80mN_Coverage Spd80mS_Coverage Spd60mN_Coverage Spd60mS_Coverage Spd40mN_Coverage Spd40mS_Coverage
Timestamp
2016-01-01 0.922492 0.922492 0.922492 0.922492 0.922492 0.922492
2017-01-01 0.894406 0.894406 0.894406 0.894406 0.894406 0.894406

Mean of monthly means

The mean of monthly means is a method of adjusting the average to take account of seasonal bias. For example this would remove the upward bias of having a 1.5 year dataset that covers two windier winter periods and one calm summer period. We can call the function in two ways, either by passing a specific column from the dataset which will return a value, or sending a list of column names (in this case anemometers) which will return the mean of monthly means for each column name as a dataframe.

[9]:
bw.momm(data.Spd80mN)
[9]:
7.556588194559553
[10]:
bw.momm(data[anemometers])
[10]:
MOMM
Spd80mN 7.556588
Spd80mS 6.587765
Spd60mN 7.081094
Spd60mS 7.163933
Spd40mN 6.785035
Spd40mS 6.844676