Data IO#

Data IO Module

This module contains functions for obtaining data from various sources.

solardatatools.dataio.get_pvdaq_data(sysid=2, api_key='DEMO_KEY', year=2011, delim=',', standardize=True)#
This function queries one or more years of raw PV system data from NREL’s PVDAQ data service:

https://openei.org/wiki/PVDAQ/PVData_Map

Parameters:
  • sysid (int, optional) – The system ID to query. Default is 2.

  • api_key (str, optional) – The API key for authentication. Default is “DEMO_KEY”.

  • year (int or list of int, optional) – The year or list of years to query. Default is 2011.

  • delim (str, optional) – The delimiter used in the CSV file. Default is “,”.

  • standardize (bool, optional) – Whether to standardize the time axis. Default is True.

Returns:

A dataframe containing the concatenated data for all queried years.

Return type:

pd.DataFrame

solardatatools.dataio.load_cassandra_data(siteid, column='ac_power', sensor=None, tmin=None, tmax=None, limit=None, cluster_ip=None, verbose=True)#

Deprecated since version 1.5.0: dataio.load_cassandra_data is deprecated. Starting in Solar Data Tools 2.0, it will be removed. This function is deprecated. Please use load_redshift_data function instead.

solardatatools.dataio.load_constellation_data(file_id, location='s3://pv.insight.misc/pv_fleets/', data_fn_pattern='{}_20201006_composite.csv', index_col=0, parse_dates=[0], json_file=False)#

Load constellation data from a specified location.

This function reads a CSV file from a given location and optionally loads additional JSON metadata.

Parameters:
  • file_id (str) – Identifier for the file to load.

  • location (str, optional) – The base location where the data files are stored. Default is “s3://pv.insight.misc/pv_fleets/”.

  • data_fn_pattern (str, optional) – The pattern for the data file name. Default is “{}_20201006_composite.csv”.

  • index_col (int, optional) – Column to use as the row labels of the DataFrame. Default is 0.

  • parse_dates (list, optional) – List of column indices to parse as dates. Default is [0].

  • json_file (bool, optional) – Whether to load additional JSON metadata. Default is False.

Returns:

A tuple containing the DataFrame and the JSON metadata (if json_file is True), otherwise just the DataFrame.

Return type:

tuple[pd.DataFrame, dict] or pd.DataFrame

solardatatools.dataio.load_pvo_data(file_index=None, id_num=None, location='s3://pv.insight.nrel/PVO/', metadata_fn='sys_meta.csv', data_fn_pattern='PVOutput/{}.csv', index_col=0, parse_dates=[0], usecols=[1, 3], fix_dst=True, tz_column='TimeZone', id_column='ID', verbose=True)#

Wrapper function for loading data from NREL partnership. This data is in a secure, private S3 bucket for use by the GISMo team only. However, the function can be used to load any data that is a collection of CSV files with a single metadata file. The metadata file contains a sequential file index as well as a unique system ID number for each site. Either of these may be set by the user to retreive data, but the ID number will take precedent if both are provided. The data files are assumed to be uniquely identified by the system ID number. In addition, the metadata file contains a column with time zone information for fixing daylight savings time.

Parameters:
  • file_index – the sequential index number of the system

  • id_num – the system ID number (non-sequential)

  • location – string identifying the directory containing the data

  • metadata_fn – the location of the metadata file

  • data_fn_pattern – the pattern of data file identification

  • index_col – the column containing the index (see: pandas.read_csv)

  • parse_dates – list of columns to parse dates (see: pandas.read_csv)

  • usecols – columns to load from file (see: pandas.read_csv)

  • fix_dst – boolean, if true, use provided timezone information to correct for daylight savings time in data

  • tz_column – the column name in the metadata file that contains the timezone information

  • id_column – the column name in the metadata file that contains the unique system ID information

  • verbose – boolean, print information about retreived file

Returns:

pandas dataframe containing system power data

solardatatools.dataio.load_redshift_data(siteid: str, api_key: str, column: str = 'ac_power', sensor: int | list[int] | None = None, tmin: datetime | None = None, tmax: datetime | None = None, limit: int | None = None, verbose: bool = False) DataFrame#

Queries a SunPower dataset by site id and returns a Pandas DataFrame.

Request an API key by registering at https://pvdb.slacgismo.org and emailing slacgismotutorials@gmail.com with your information and use case.

Parameters:
  • siteid (str) – Site id to query

  • api_key (str) – API key for authentication to query data

  • column (str, optional) – Meas_name to query, defaults to “ac_power”

  • sensor (int | list[int] | None, optional) – Sensor index to query based on the number of sensors at the site id, defaults to None

  • tmin (datetime | None, optional) – Minimum timestamp to query, defaults to None

  • tmax (datetime | None, optional) – Maximum timestamp to query, defaults to None

  • limit (int | None, optional) – Maximum number of rows to query, defaults to None

  • verbose (bool, optional) – Option to print out additional information, defaults to False

Returns:

Pandas DataFrame containing the queried data.

Return type:

pd.DataFrame