PapcoDoc/documents/DataSetSpec
From PapcoWiki
Audience: papco developers
Purpose: Official document describing papco's dataset model. This is similar to the QDataSet model used in Autoplot, and this document may serve that model as well.
Introduction
Papco's dataset is an IDL structure with a prescribed set of tags. With these conventions we hope to provide:
- transparent support for multi-dimensional data
- metadata for automatic labeling and data discovery
- operations like slice, collapse, and append work conveniently, efficiently, and correctly.
Tags
DATA (dblarr,required)
Data is stored in this variable. This is a intarr, fltarr, or dblarr with dimension N. N refers to the "rank" of the dataset.
DEPEND_0, DEPEND_1, ..., DEPEND_<N-1> (string,required)
these are strings pointing to another tag. This tag will be another dataset that "tags" the dimension. These are required in Papco's model, but in Autoplot they may be missing, in which case [0,1,2,3,...] is used.
NAME (string)
indicates the IDL-identifier to use for the dataset. Note this must be the same as the tag containing the dataset. This must be [a-z][a-z0-9]*.
LABEL (string)
Human-consumable label for labeling the data. This should be concise and my rely title to provide context. NAME may be used for a label when LABEL is not found.
UNITS (string)
String identifying units providing context for the data. See papco_ds_units for canonical units. Times are represented with units like "t2000". If not specified, units are dimensionless (same as "")
VALIDMIN, VALIDMAX (double)
data is valid if gt than validmin and lt validmax. If a whole number (such as zero) is to be included in the range, then an arbitrarily small amount should be subtracted and added to the min and max. Note in QDataSet, validmin is inclusive, and FILL should be used to exclude the min.
FILLVAL (double)
indicates fill value. If not found, then -1e31 is used.
SCALEMIN, SCALEMAX (double)
recommended range for plotting and histograms.
SCALETYP (string)
if "log", this indicates the data is more uniformly spaced in a log space rather than linear ("lin"). When this is "log", bin_width will be specified in the log space. "lin" is the default. In QDataSet this is scaleType.
CORRELATE_0, CORRELATE_1, ... (string)
identify tags that are other datasets taken at the same time. These are called planes in QDataSet. These must have the same geometry as the dataset.
BIN_WIDTH (double or dblarr)
This is the width of the bins of a tags dataset. Bins refer to the integration interval for the tagged measurement. If the scaling is log, then this is the log10 ratio.
BIN_PLUS, BIN_MINUS (double or dblarr)
These are deprecated and should not be used.
BINS (boolean)
If 1 (true), then the data will be [2,n], and each pair indicates the bounds of the integration intervals (bins).
FORMAT (string)
IDL-ready format string for formatting. Default is "(f10.2)"
FRAME (string)
When present and not empty, indicates that the DATA is a fltarr(n,3) and the second dimension are three-element vectors in the specified frame. Such datasets have rank 1. DEPEND_1 tags dataset may be "labels" dataset, with ds.data a string array of labels. Note this is inconsistent with BINS, which is fltarr(2,n).
DIM_SIZES (intarr,reserved)
should be the same as size(ds.DATA) and is used to query the dataset without a copy of the data.
Operators may modify datasets
- dataset operators may replace values deemed invalid by fillval, validmin and validmax with NaN.
- dataset operators may introduce new metadata for efficiency, e.g. DIM_SIZES.
To be considered
- caa model, which is array of pointers to dataset structures.
- IDL Object that hides implementation details. This is more like the Java QDataSet used in Autoplot.
- dataset and tags are tags of dataset and tag "DATA" (String) identifies the tag that is the default dataset.
- make as similar as possible to QDataSet as possible.
- Rank 0 datasets. Use for describing BIN_WIDTH, etc.
DataSet Rank
DataSet "rank" refers to the number of indices the dataset has. This is often the same as the number of physical dimensions the data occupy. For example, the rank 2 dataset Flux( Time, Energy ) occupies the two physical dimensions of time and energy. However this is not necessarily the case. For example the dataset B_gsm( Time, Frame=3 ) is rank 2 but occupies four physical dimensions (time, x-gsm, y-gsm, z-gsm). Dataset rank is reduced by slicing or collapsing. Dataset rank can be zero, and a scalar Dataset operator implementation limits rank to 4.
DataSet types
independent variable
- For example, "the time tags"
- must be rank one dataset
- if [2,n] and has the tag "bins", then the bins are explicitly identified.
- may not contain invalid values identified by valid_range or fill.
dependent variable
- has metadata depend_0 .. depend_1 to indicate independent variables for each dimension
- if [3,n] and has tag "frame," then the 3-element dimension is the coordinate frame basis (x,y,z). (TODO: inconsistent with documentation above, check on this...)
DataSet Properties (Attributes)
- represented as tags of structure.
validmin, validmax
- valid if greater than validmin and less than validmax
- papco_ds_valid_range returns [ min, max, scale type ]
- scale type=0 for linear
- scale type=1 for log
- makes a guess if dataset doesn't specify.
DataSet Definition--Groping the Elephant
what a dataset is
- structured storage of data
- allows for operator definition
- as simple as possible
- preserves information of many data models
- data plus metadata
- extends IDL/Matlab arrays to allow for higher abstraction
what a dataset isn't
- a model for a particular dataset
- applicable to all data models
Like XML, it is not a language, it's more a syntax. Application means applying semantics to it.

