Solar Data Tools User Guide#

This guide serves as a reference for the typical Solar Data Tools user. We will go over the main user-facing classes and methods, with examples and relevant details for you to get started. As you will see, most of the methods in Solar Data Tools are automated, so you can get started with minimal effort and very little input from your side.

For a more comprehensive list of all methods and functions, you can check the API reference. However, this user guide likely provides you with all the functionality you need, while the rest of the functions listed in the API reference are more internal/helper functions that were not meant for users to interact with.

If any part of this guide is unclear, or you’d like to improve on these docs, please consider submitting an Issue or a PR!

Getting started with the DataHandler class#

Most users will only need to interact with the DataHandler class. To instantiate, a Data Handler object takes a DataFrame containing the power data timeseries (with a timestamps and power columns) as an input. Note that the DataFrame must have a DatetimeIndex, or the user must set the datetime_col kwarg.

The timestamps are recommended to be in the local timezone of the data. If there is a small shift in the timestamps, the pipeline will attempt to correct it. If the shift is large (8-10 hours), the pipeline will likely fail to adjust the shift.

Let’s say we have a CSV file with power data that we want to analyze. We can load it into a DataFrame:

import pandas as pd
from solardatatools import DataHandler

df = pd.read_csv('path/to/your/data.csv')

Then we can create a DataHandler object with this DataFrame:

dh = DataHandler(df)

If you know that your data is affected by daylight savings, you can run the following method to correct for it:

dh.fix_dst()

The DataHandler object is now ready to be used for data processing and analysis.

A note on long-form vs. wide-form data#

Timeseries data is often in wide-form, where you have for example a DataFrame that has a timestamp column and one or more data columns. That’s what the DataHandler typically expects. However, it also can take data in long-form, such as for example what we have in the Redshift data where some sites have more than one inverter. In this case, you will want to instantiate the DataHandler object with the convert_to_ts flag set to True:

dh = DataHandler(df, convert_to_ts=True)

This prompts the DataHandler to convert the data to wide-form before running the pipeline, given default index and column names intended to work with GISMo’s VADER Cassandra database implementation (see solardatatools.time_axis_manipulation.make_time_series).

For more information on long-form vs. wide-form, you can check out this nice writeup from the Seaborn documentation.

Running the pipeline#

The DataHandler.run_pipeline method is the main data processing and analysis pipeline offered by Solar Data Tools. It includes preprocessing, cleaning (e.g. fixing time shifts), and scoring data quality metrics (e.g. finding clear days, capacity changes and any clipping.

To run the pipeline, simply call the method:

dh.run_pipeline()

This method can be passed a number of optional arguments to customize the pipeline. For example, you may need to specify the timezone of the data, or the solver to use for the capacity change detection. Most commonly, you will want to specify the power column name and whether to run a timeshift correction. Here is an example of how to run the pipeline with these arguments:

dh.run_pipeline(power_col='power', fix_shifts=True)

Note that the pipeline can take a while to run, depending on the size of the dataset and the solver you are using (from a couple of seconds up to a minute).

Once the pipeline is run, the DataHandler object will have a number of attributes that you can access to view the results of the analysis. The top-level report is accessed by calling the report attribute:

dh.report()

This will print a summary of the results of the pipeline, including the data quality metrics. We can also make a machine-readable version, which is useful when processing many files/columns for creating a summary table of all data sets.

dh.report(return_values=True, verbose=False)

Plotting some pipeline results#

The DataHandler object has a number of plotting methods that can be used to visualize the results of the pipeline. Here is a full list of the plotting methods available after running the main pipeline:

Method

Description

DataHandler.plot_heatmap

Plot a heatmap of the data

DataHandler.plot_bundt

Make a “Bundt plot” of the data

DataHandler.plot_circ_dist

Plot the circular distribution of the data

DataHandler.plot_daily_energy

Plot the daily energy

DataHandler.plot_daily_signals

Plot the daily signals

DataHandler.plot_density_signal

Plot the density signal

DataHandler.plot_data_quality_scatter

Plot the data quality scatter

DataHandler.plot_capacity_change_analysis

Plot the capacity change analysis

DataHandler.plot_time_shift_analysis_results

Plot the time shift analysis results

DataHandler.plot_clipping

Plot the clipping

DataHandler.plot_cdf_analysis

Plot the CDF analysis

DataHandler.plot_daily_max_cdf_and_pdf

Plot the daily max CDF and PDF

DataHandler.plot_polar_transform

Plot the polar transform

Note that the timeshift correction method must be run (by passing fix_shifts=True to run_pipeline) before calling the plot_time_shift_analysis_results method.

Examples of some of these plotting methods are shown below in the notebooks and examples section, such as the demo and the tutorial.

Running loss factor analysis#

Once the main pipeline is run, you can run the loss factor analysis to estimate the loss factor of the system. This is done by calling the run_loss_factor_analysis method:

dh.run_loss_factor_analysis()

This method will estimate the loss factor of the system by running a Monte Carlo sampling to generate a distributional estimate of the degradation rate. The results are stored in dh.loss_analysis.

Once it terminates, you can visualize some of the results by calling the following functions:

Method

Description

DataHandler.loss_analysis.loss_analysis.plot_pie

Create a pie plot to visualize the breakdown of energy loss factors

DataHandler.loss_analysis.plot_waterfall

Create waterfall plot to visualize the breakdown of energy loss factors

DataHandler.loss_analysis.plot_decomposition

Plot the estimated signal components found through decomposition

Head over to our demo and tutorial to see these functions in action on real data.

Running clear sky model estimation#

After the main pipeline is run, a clear sky model of the PV system power can be estimated by running:

dh.fit_statistical_clear_sky_model()

This fits a smooth, multiperiodic model of the instantaneous 90th percentile of the power data, as explained in this paper. Under the hood, this invokes the spcqe package. Check out this demo for more information.

Running clear sky data labeling#

The clear sky labeling subroutine leverages the results from the main pipeline, the loss factor estimation, and the clear sky model fitting, all described above. After the main pipeline is run, the user may run:

dh.detect_clear_sky()

If either or both of the loss factor estimation and clear sky estimation modules have not been run, the Data Handler will run those modules automatically when detect_clear_sky is called. (The Data Handler will not re-run these modules if they’ve already been invoked and will just make use of the outputs.)

Check out this demo for more details.

Other features#

Orientation and Location estimation#

The DataHandler also includes methods to estimate the position of the solar panels based on the data. This includes the location (latitude and longitude) and orientation (tilt and azimuth) of the system. The available methods are:

Method

Description

DataHandler.setup_location_and_orientation_estimation

Sets up the location and orientation estimation for the system given a GMT offset

DataHandler.estimate_latitude

Estimates latitude

DataHandler.estimate_longitude

Estimates longitude

DataHandler.estimate_location_and_orientation

Estimates latitude, longitude, tilt and azimuth

DataHandler.estimate_orientation

Estimates tilt and azimuth

To call the estimation methods, first you need to run the `setup_location_and_orientation_estimation`` method and provide a GMT offset value by passing it to the method. After that, you can call any of the four estimation methods. A demo of this feature can be found in the tutorial in cells 13-15.

Loading data#

You can load data into a DataFrame using standard pandas functions:

import pandas as pd

df = pd.read_csv('path/to/your/data.csv', index_col=0, parse_dates=True)