The process of creating new GBT pointing models involves the processing of astronomical observations (e.g. pointing runs), an integration of various log file data. The analysis program requires auxiliary data from various log files. This MR specifies a utility which will be able to construct a single file which contains the log data re-sampled to a common time-interval, collated into a single binary format file.
This is a specification for creating a binary data file from FITS formatted files, in support of GBT pointing and focus analysis. This file will serve as one of the inputs to the (matlab-based) pointing analysis program.
All the metrology data, for example from the quadrant detector, accelerometers, inclinometers, are normally stored into FITS files with a separate FITS file for each device or in some cases, several FITS files per device. They are organized in terms of the device, manager, and sampler. The proposal here is to define a flexible way that one can specify the FITS files and parameters within those FITS files that are to be extracted and placed into a binary formatted file. All the input FITS files are FITS binary tables with the first field being the MJD time tag.
The request is that the output database be a flat file, with all the data in binary form as double precision (64 bits) IEEE-754 floating point. All of the data is to be selected for a specified time range and interpolated onto a specified time interval. Each 'row' of the output file will contain a timestamp (in MJD and fraction of day) as the first field, followed by all other specified fields.
This would be a general system to consolidate any group of FITS files into a common file, and it can serve many purposes, not just PTCS. Any analysis of GBT monitoring data can use this system. We suggest an option to put the collected data into a big FITS file, as an alternate option. In this form it can be easily imported into idl programs, or plotted with "fv".
The output database will consist of two files, the first an ascii file giving the selection and list of samplers, as described below; the second is a binary file containing all the selected and interpolated data.
The basic requirements are to allow the specification of a directory, time range, resampling interval, a subdirectory such as 'Weather-Weather1-weather1', fields within the sampler, and interpolation method parameters in a specification file.
The application shall read this file, and read the necessary files which match the time range, and re-sample the data. The data shall be written to either a FITS binary table or a native binary format file.
The input specification to specify the time-range, log file set, and resampling interval. Comments will be delimited by a '#' in the first column, and will end at a newline.
The selection specification will contain the following elements:
The start-time, end-time and interval are set by assignment of values to the variable 'Starttime', 'Endtime', and 'Interval' respectively as in the example shown below.
Starttime=Start_Date Start_Time
Endtime=End_Date End_Time
Interval=0.1
Where: the dates are in yyyy-mm-dd, and the times are in HH:MM:SS referenced to UTC. The resampling interval is specified in seconds.
A Path prefix is normally required to set the location of the log directories. The 'Path' variable may be set more than once, as needed. The actual pathname applied to each 'with' statement is the most recent value. The Path variable must be set at least once.
3.1.3 FITS directory name, and optional FITS extension
A log subdirectory may be added by the 'with' keyword. Fields of the sampler are added with the 'add' keyword. If desired the input field map be mapped to a different name using the 'as' keyword. The basic syntax to specify fields is intended to be readable:
with sub_dir_name add field1,field2 [as field_xyz], ... [using interpolation_method [ window=window_size ] ]
Square brackets indicate optional clauses. Variations on this basic statement are described below.
To add the wind velocity column from the weather1 sampler:
with Weather-Weather1-weather1 add WINDVEL
Additional fieldnames may be specified by providing a comma separated field list, or additional 'add' statements.
The 'using' keyword allows specification of an interpolation method, and if specified requires the window size to be set.
with Weather-Weather1-weather1 add WINDVEL using mean window 9
There may be cases when writing a FITS format output file, that column names may clash. The add statement may contain the 'as' keyword, followed by the new field name. If during processing a name clash exists, a message will be output to the console.
with Accelerometer-Accelerometer1-AccelerometerData add X as X_1, Y as Y_1
Planned Enhancements not supported in the initial version:
- An 'add' statement may optionally specify a 'from' clause, which specifies a FITS table in a file.
- A custom routine may be specified by providing the procedure name in the 'using' clause.
This construct may be repeated as needed for multiple FITS files. A full example is shown in section 3.2.
As an example, consider the following:
#=========================================================
# Example: log data specification
# comments may begin with #
# UT date and time range, and sampling interval in seconds.
#
Starttime=2006-11-29 07:00
Stoptime=2006-11-30 14:45
Interval=0.1
# Set the path prefix. This may be set more than once, so it is
# context-sensitive.
Path=/home/gbtlogs
# The first set of fields from the sampler
with Inclinometer-Inclinometer-InclinometerData
add x1_angle, y1_angle, x2_angle, y2_angle using mean window 5
# Example use of the 'as' construct. This renames the column in the output file.
with Accelerometer-Accelerometer1-AccelerometerData
add X as X_1,
add Y as Y_1,
add Z as Z_1
with Accelerometer-Accelerometer2-AccelerometerData
add X as X_2,
add Y as Y_2,
add Z as Z_2
# Example specifying different methods for each field
with Accelerometer-Accelerometer3-AccelerometerData
add X as X_3 using mean,
add Y as Y_3 using median,
add Z as Z_3 using nearest
#
with Weather-Weather2-weather2 add WINDVEL, WINDDIR using median window 3
# Change Path to another directory
Path=/home/gbtdata/AGBT_XXX/Archivist
with ServoMonitor-ServoMonitor-Az_El_1Hz
add El_1Hz_Az, El_1Hz_El
# Include dcr data calling a custom interpolation method with a window of 3
# (Note: Future enhancement)
# Path=/home/gbtdata/AGBT_XXX/DCR
# Note fieldname and extension name are both 'DATA' in the DCR
# with DCR add DATA from DATA using custom_dcr_mean window 3
#=========================================================
This example has specified 17 quantities from 6 different samplers. The output file will consist of groups of 18 double precision floats, i.e, the MJD time stamp followed by the 17 selected quantities. There will thus be 1,143,000 groups, corresponding to tenth of a second sampling over the 31.75 hour time range. Thus a total file size of about 165 MBytes.
The output format depends upon the output filename extension. If the extension is '.fits' then a FITS file will be generated. If any other extension is given, a binary file is generated.
If the specification file above were saved into a file named "specification.input", then a session with the collation utility might look like:
$ log_collator -o output_file_name.bin -i specification.input
Or if a FITS format outputfile is desired, just change the extension name:
$ log_collator -o output_file_name.fits -i specification.input
There are three general cases:
- If a time series is to be interpolated onto a finer grid, then unless otherwise specified, a linear interpolation between the nearest neighbors shall be used.
- If a time series is to be changed to a coarser sampling, then unless otherwise specified, the average of the finer spaced data shall be used unless the number of samples is 5 or less, otherwise a median shall be used.
- If the time series sampling matches the time grid, then unless otherwise specified, the data shall be copied without modification to the output, using a nearest neighbor approach. Note: if the data is sampled at the same rate, but skewed from the time grid, this default may cause a phase shift in the the output data of up to interval /2 seconds. If this is problematic, then 'linear' interpolation should be specified instead.
A requirement to support custom interpolation methods, is under consideration, but will not be included in the initial version. The built-in interpolation methods will be available:
- mean
- median
- linear
- nearest (i.e. nearest neighbor)
If a field specfication has set the window size, then the request window will be used to select window_size samples centered on the time grid. The input data is assumed to be regularly sampled. If not specified the following heuristics will be used:
- The input sample rate will be calculated, and a window size of N shall be determined by N = (input_rate/output_rate) Once the window is set it will select the nearest window_size points for use by the interpolator/filter.
- If the output sampling rate is greater than the data source, values will be interpolated using linear interpolation.
- If the output sampling rate is greater than the data source, the window parameter is ignored. (i.e. a window size of two is used.)
There will be cases where for some reason, data for a portion of the time-interval is unavailable. In this case, the affected fields shall be set to a NAN (not-a-number) value to indicate no data was available. During processing if the case occurs the program should print alert information to the standard-error stream. In this case the program will generate the output, filling the invalid fields with NAN's.
3.5.1 FITS Formated Output
A mode will be available to write a FITS formatted file, with minimal header information such as TTYPE, TFORM, and TUNIT for each column, so that tools such as fv can be used to verify/view the final data product.
The output file in this mode will consist of rows of data, where the first field in each row is a MJD, followed by the fields in the order specified in the selection specification. All fields will be in IEEE-754 64bit double precision format. FYI: It should be noted that the byte ordering will be the native format for an IA-32 instruction set machine. This is exactly opposite of the byte order for binary FITS tables.
Frank notes:
1. FITS files allow a "field" to be an array. If one of the specified fields is actually an array, the whole array is to be inserted into the output binary file.
(This case doesn't occur in the normal log files.)
I'm not sure if this is going to work. The semantics of multi-valued data are encoded in the FITS headers, and are mode or setup dependent. A consumer of the output file (without the original FITS header) would not have a complete specification of what the values mean. Specifically:
- Backend timestamps have different definitions than ordinary log files. For example, log file timestamps indicate the 'midpoint' of the data sampling period. However, most backends time-tag the FITS data with either the beginning or end of an integration. In addition the integration may include states of reference or cal signals.
- Backend data files often are multi-valued, where log files are not. As an example, a DCR 'datapoint' contains a block of data, which is the product of switching signal phases and the number of channels used.
In the interest of getting this out this cycle, I'd like to exclude the requirement to allow user-defined interpolation methods and multi-valued fields, at least for the initial version. We can revisit this later as an enhancement. Of course the design shall not preclude these enhancements.
This program will be implemented in python, using a combination of existing modules as a start. The output will be written using PyFITS, and in the case of a raw binary table, a final post-processing step will be performed to extract the information from the output FITS file.
|
| Object Diagram |
Description: The Executive uses the SpecificationParser to read the textual specification, and creates a number of FieldSpecification objects. The Executor then creates the OutputDataWriter, and uses the list of FieldSpecifications to create processing pipelines, each of which consist of an InputDataField, an Interpolator, and an OutputDataField. These thress components are managed and abstracted by the Pipeline object. Processing is driven by the Executive which loops over the time interval, and over each pipeline producing a new time grid value for a given time t.
There is only one output file, but many files may need to be read, each which contain a number of columns which are the input data for one or more pipelines. This means that the Executive maintains an ordered list of the InputDataSrc objects, independent in number from the list of Pipeline objects.
What has to get done to integrate this completely into the system. This checklist must be completed before Cycle Integration Testing begins.
- Communication with Computing group needed? None.
- What documentation needs to be updated? Create on-line documentation reference/man page.
- Training Needed? Is this being released to staff astronomers or everyone right now? (Intended for PTCS staff only.)
- Notification Needed? No.
- User Manual is located here.
- Sparrow unit tests shall continue to pass
- Additional unit tests shall be written to verify components
- Run the program with the selection specification example in section above, to create both a FITS file and a binary table.
- Verify that the data contained in the output file:
- the data table should begin and end at the requested times and have no gaps
- is not time-shifted (to verify global time indexing)
- is correctly interpolated (to verify the interpolation selection algorithm, window sizing and interpolation method)
- is arranged with other data at the same interval (to verify time indexing between files)
- contains the correct fields in the correct order
- contains the correct column names (to verify column renaming/labelling)
- The sponsor shall compare the dataset produced with a dataset generated by another tool, such as gbtidl.
See Sponsor testing.
APPROVED: I acknowledge that my request is fully contained in this MR, and if the SDD delivers exactly what I specified, I will be happy.
ACCEPTED: I acknowledge that I have validated the completed code according to the acceptance tests, and I am happy with the results.
Symbols:
- Use
%X% if MR is not complete (will display
)
- Use
%Y% if MR iscomplete (will display
)
CCC Discussion Area
|
Revision r1.20 - 27 Mar 2007 - 16:07 GMT - FrankGhigo
|
Content copyright © 1999-2007 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
|
| |