PapcoDoc/documents/DataSets

From PapcoWiki

Jump to: navigation, search

Contents

PaPCo’s Dataset Model

A dataset abstraction is introduced in papco 12, with several goals in mind. First, many of the modules read, store and plot datasets very similar in form. So we introduce a dataset abstraction that will greatly simplify such modules, making maintenance easier and module development quicker. Second, the dataset abstraction includes metadata (e.g. units) that allows a dataset coming from an unknown source to be used for science analysis. This idea of “data discovery” allows a papco module to utilize a new dataset without any new code.

Papco datasets are IDL structures with a number of required tags, inspired by cdf file conventions. The simplist dataset contains the tag “data,” which must be one-dimensional and contain monotonically increasing values:

  xtags= { data:findgen(200)/10 }

This type of dataset serves only to be independent tags for another dataset. This dataset is a tag of the other dataset, and is identified by “depend_N” where N is the dimension number:

  sineWave= { data:sin( x.data ), depend_0:’xtags’, xtags:xtags }

or:

 energyTags= { data:findgen(30) )
 z= { data:randomu( s, 200, 30 ), depend_0:’xtags’, depend_1:’energy’, $
   xtags:xtags, energy:energyTags }

Any other tags are considered properties of the dataset. In addition, there are a number of properties used to label axes and identify fill data:

  units, a string that identifies the units of the data.  We’ll discuss units a little later.
  label, a string that is identifies the dataset, and is suitable for labling an axis.
  binWidth: a double or double array that identifies the length of the interval to which the measurement is relevant, or the resolution of the measurement.
  format: an IDL format specifier, such as "(f10.3)"
  scalemin, scalemax
  scaletyp, "linear" or "log"

units property

Units are strings that tag the doubles of a dataset to make them meaningful quantities. They are displayed on plot axes, of course, but they can also be used for automatic units conversion for overplots. In addition, a set of “time location units” is defined and these are used to precisely locate data on a time axis. For example, “mjd2000” indicates that the double indicates the number of decimal days elapsed since midnight Jan 1, 2000. The conventional papco unit strings are returned in a structure by the routine papco_units:

 units= papco_units()
 timeTags= { data:14+dindgen(86400)/86400., units:units.mjd2000 }

dataset plotters

 papco_spectrogram, ds
 papco_lineplot, ds
 papco_radial_spectrogram, ds, radius="energy", angle="pitch"

dataset operators

 sliceds= papco_ds_slice( ds, time=34 )

Slices the N dimensional data set at time index 34 to produce a N-1 dimensional dataset. The range of the slice is documented as a string in the property "time_range." This is analogous to slice= reform( z[34,*,*] )

 energySpec= papco_ds_collapse( ds, 'pitch' )

Collapses the N dimensional data set by averaging over the pitch dimension, to return an N-1 dimensional dataset. The range of the pitch angles is returned in "pitch_range." This is analogous to z= total( z, 2 )

 ds= papco_ds_trim( ds, time=[1000,2000], energy=[0,10] )

Trim the number of elements in a dimension. This is similar to z = z[1000:2000,*,*]

 ds= papco_ds_transpose( ds )

Transpose the dataset dimensions. This is analogous to z= transpose(z)

 idx= lindgen(n_elements(ds.time.data)/5) * 5
 subds= papco_ds_array_index( ds, 0, idx )

Extracts a dataset using a subset of the indices.

Analogs to IDL Array Operators

Here's a table that tries to equate familiar IDL array operations with papco dataset equivalents. In these examples, suppose that ds is a rank 3 dataset of FLUX(time,energy,pitch). z is a 3-d IDL array, with time as the first index, energy as the second index and pitch as the third index. (Note that we could say z=ds.data.)

array equiv dataset equiv
reform(z[*,*,3]) papco_ds_slice( ds, pitch=3 )
total(z,3) papco_ds_collapse( ds, 'pitch', /total )
z[2:4,*,*] papco_ds_trim( ds, time=[2,4] )
z[[1,2,3],*,*] papco_ds_array_index( ds, 0, [1,2,3] )
transpose(z,[1,2,0]) papco_ds_transpose( z, [ 1,2,0] )
interpolate( z, findgen(23),findgen(3), findgen(5), /grid ) papco_ds_grid( z, findgen(23),findgen(3), findgen(5), /interpolate )
rebin( z, 4, 1 ) papco_ds_rebin( z, [ 4, 1 ] )
[ z, z ] papco_ds_append( z, z )
size(z) papco_ds_size( z )

other functions

 if ( not papco_ds_valid( ds ) ) then stop

Returns 1 if the dataset is well-formed, 0 otherwise.

 properties= papco_ds_properties( ds )

Returns a structure containing just the properties of the dataset. If the dataset contains no properties then { name:’’} is returned.

 idx= papco_ds_tagindex( ds, ‘bin_width’ ) 

Returns the index of the tag, or -1 if the tag is not present.

dataset builder

The routine papco_ds_builder is provided to make creating valid datasets easier.

 bin_width= 0.8
 log_energy_center= findgen(30)*bin_width+1.2
 units= papco_ds_units()
 energy= papco_ds_builder( 10^log_energy_center, units=units.eV, $
   label='energy', log=1, bin_width=bin_width )
 bin_width=0.42D/86400
 time= papco_ds_builder( 230+findgen(2000)*bin_width, units=units.mjd2000,$      
   label='time', bin_width=bin_width )
 pitch= papco_ds_builder( 10*(findgen(18)+0.5), units=units.degrees, $
   label='pitch angle', bin_width=10 )
 input_data= papco_ds_builder( fltarr( 2000,30,18 ), label=data_label,$
   depend_0='time', depend_1='energy', depend_2='pitch', $
   time=time, energy=energy, pitch=pitch )