NRAO Home  >  Green Bank  |  Wiki Topic:    GB > Software > AdassData
   Changes | Index | Contents | Search | Statistics | Go

Increasing the Accessibility of GBT Data

KarenONeil, NicoleRadziwill, RonMaddalena

Abstract

The Green Bank Telescope (GBT) currently outputs its raw data as a suite of binary FITS files, approximately one per component device on the telescope, which are then consolidated and pre-processed before being written into an AIPS++ Measurement Set for more extensive analysis. This design decision by the GBT project had essentially restricted astronomers to a single data analysis package and reduced the productivity of those who prefer other analysis packages. To maximize the scientific returns from the unique features of the GBT, and to support a broader cross-section of observers' backgrounds and interests, work is being done to combine raw GBT data from the disparate FITS files into a variety of standardized FITS file formats such as SDFITS and CLASS FITS. These files can then be analyzed using tools such as IDL, CLASS, Mathematica and Matlab, for example. This poster describes prototyping exercises that were initiated in Green Bank during the summer of 2003 for the purpose of identifying how to make GBT data more readily accessible to a wider variety of data reduction tools. Although further refinement is needed to support the standard observing modes of the GBT in a production capacity, early results from the investigation demonstrate the feasibility and applicability of the approach.

1. Background

At present, a typical data set resulting from a GBT observation is organized within an observer's /home/gbtdata/projectname directory tree, composed of subdirectories for many devices, each of which contains FITS files from that device. There are subdirectories for the antenna, the receiver, the IF and LO systems, and the backends used. There is also a ScanLog.fits file which indexes all of the device FITS files according to scans. GBT data can be assimilated into the AIPS++ DISH utility by using d.import, or an AIPS++ Measurement Set can be created by using gbtmsfiller from the UNIX command line. Either step transforms the raw data into a representation that is sensible from the astronomical perspective.

Because the GBT was designed to produce its raw data as a collection of FITS files, it is a challenge to combine the information for analysis by any data reduction package. To fill data into an AIPS++ Measurement Set, the development team for that product spent up to two years resolving issues associated with the data itself, and was eventually able to produce the gbtmsfiller which is in use today. Prior to the launch of the GBT data accessibility exploration, IDL users (for example) had to follow a similar process independently, writing their own modules to extract and preprocess relevant information from the collection of GBT FITS files. Users of other packages are still faced with this barrier.

Because the raw data output from the GBT is segmented, several common issues are encountered regardless of which data analysis package an observer wishes to use. These data preprocessing functions, which currently exist in gbtmsfiller, could be componentized for general use. These include, but are not limited to, the following:

  1. the ability to select subsets of data to process
  2. associating an appropriate antenna pointing with each data sample
  3. generating a description of the frequency axis and polarization information for each data sample
  4. extracting Tcal values from receiver FITS files
  5. converting lags to spectra (required for spectrometer data only)

Users of other packages are not able to take advantage of solutions for these issues that have already been resolved in the process of making GBT data available to AIPS++. By creating a suite of data preprocessing components for general use, all that remains is to wrap the components within a script that also contains a data translation component, and output formats can be generated that are palatable to various data analysis packages (see Figure 1).

The demand for greater accessibility has been expressed within NRAO as well as by visiting observers. Several astronomers at Green Bank have expressed a desire to do experiments in IDL, making use of IDL modules relevant to astronomers that have been developed by third parties. This includes the Goddard Astronomy Users' Library which can be accessed at http://idlastro.gsfc.nasa.gov. Engineers working on the Precision Telescope Control System (PTCS) project (a major initiative currently underway which will provide the pointing, collimation and surface accuracy required to allow the GBT to operate effectively at 3mm) do much of their analysis in Matlab. Recently, these individuals have expressed the need to access data from astronomical observations within the Matlab application.

2. Goals and Objectives

The primary goal for this effort is to make GBT more readily accessible to various data analysis packages including IDL, CLASS, AIPS++, AIPS, Matlab and Mathematica (based on the demands of NRAO astronomers and visiting observers). It is understood that each package has its own unique strengths and limitations, and not all packages may be able to reduce all types of GBT observations. However, with a clear understanding of what is possible with each package, an astronomer will have greater leverage in choosing the tool that best suits his or her needs for a particular investigation.

This is not exclusively a data format issue, although knitting together the disparate FITS files currently produced into one cohesive structure is one important step to enable many of the data paths. The intention is not to create a new, all-encompassing data format for the GBT, but to arrive at a reasonable representation that will make it straightforward to transition to future, standardized single dish data formats. (One possibility is the MBFITS specification that is under discussion by ALMA.)

Meeting several objectives will facilitate the accomplishment of these goals:

  1. Find an easier way to get data out of FITS files. This step has been accomplished through the development of FITS Query Language (FQL).
  2. Extract the preprocessing steps from gbtmsfiller and rewrite them in Python, so they can be used by multiple programs. This s
  3. Validate the preprocessing components against previously verified parts of gbtmsfiller.
  4. Use the Python preprocessing components to generate a unified representation for GBT data.
  5. Reduce basic continuum and spectral line observations; plot them using various analysis packages and examine for correctness

Once this process is complete, we will be able to verify the consistency of scientific results between data analysis packages (e.g. IDL vs. AIPS++, AIPS++ vs CLASS, CLASS vs. IDL); until now we have not had two or more packages with which cross-comparisons can be performed. Being able to perform cross-comparisons will aid the process of commissioning data reduction for new capabilities on the GBT, ensuring that errors are captured well in advance of live observations using a new device.

3. Prototyping Exercises

Three types of data were evaluated during the initial exercises: continuum data taken with the Digital Continuum Receiver (DCR), and spectral line data from both the GBT Spectrometer and Spectral Processor.

Python was chosen as the programming language for all accessibility prototypes. This was done for several reasons: first, Python is a powerful language with the array handling needed for working with GBT data. It has a reasonably quick learning curve, and skilled software engineers in Green Bank with no prior knowledge of Python were able to produce useful results within 2-3 days of beginning to work with the language. Also, several ALMA prototypes are being written in Python, indicating that Python could become a core competency among software engineers throughout NRAO.

As of September 2003, proof of concept exercises have already been performed using IDL, and Matlab experiments are in progress. These experiments take advantage of FITS Query Language (FQL) and an intermediary data format that has been based on SDFITS. The next phase of prototype work to be completed by the end of the year will explore data accessibility by classic AIPS and CLASS.

3.1 FITS Query Language (FQL)

FQL is an implementation of Structured Query Language (SQL), the standard means of accessing data from any relational database, for the purpose of easily extracting data from FITS files. It takes advantage of cfitsio by adding a valuable data manipulation layer on top of it. Using FQL, a user can establish a connection to a simulated relational database of one or more FITS files. Using standard SQL syntax, data can be easily extracted based on user defined constraints. The development of FQL makes it much simpler to work with multiple GBT FITS files, since the files can be treated as one logical data set. However, data preprocessing must still be performed before the data is made astronomically sensible, in terms of scans and observing procedures.

FQL development has been very successful, and is now used in a production capacity. It plays a critical role in most of the prototype Python scripts that have been written. Future plans include optimizing its implementation for enhanced performance.

3.2 Data Format

Single Dish FITS (SDFITS), which was conceived by a group of representatives from several observatories in the late 1980's, is a convention on the use of FITS binary tables to exchange single dish data between different analysis packages. Variants of SDFITS are in use by Arecibo, JCMT, Parkes Multibeam and Parkes Mopra (in addition to being an export format from AIPS++) so the format is familiar and recognizable by many astronomers. For these prototype exercises, the SDFITS format was considered a target for unifying GBT data into a cohesive representation, due in part to the extensive expertise available in-house regarding the specification and its history.

A prototype Python script was written, using prototype data preprocessing components also in Python, to knit together the collection of FITS files that presently comprise GBT output data in much the same style as the AIPS++ gbtmsfiller. The output of the current prototype is a single FITS file containing data for any of the backends a user requests. To be precise, the single FITS file can contain data from any one, any two, or all three backends. The output format is inspired by the SDFITS format, containing many similar keywords, but is not yet SDFITS-compliant.

To generate the output, the user specifies how data are to be combined prior to being written to the output FITS file. The available options are state, integration and average. The state option preserves all the data from each switching state and no data are combined. In the integration option, the individual switching states are combined for each integration. In the average option, each integration combination is averaged over a scan.

The unified FITS file that is being produced by the prototype script for basic continuum and spectral line data, although sufficient to allow data to be imported and plotted within IDL, is not yet astronomically reasonable, so several changes are in process. The output of the next prototype, already being developed, will be a set of FITS files with each file corresponding to only one backend. To be precise, there could be one, two, or three FITS files as output. The output format will be strictly SDFITS-compliant for spectral line data, and SDFITS-inspired for continuum data. At this stage of evaluation, it appears that the output format will probably contain any number of HDUs: one primary HDU, followed by a sequence of binary table HDUs.

3.3 IDL Results

Data was assimilated into IDL in two steps: first, by obtaining a unified FITS file from disparate GBT FITS data using the Python prototype, and then, by including code in an IDL script to read in that FITS file. MRDFITS, from NASA Goddard, was used successfully. In addition, an enhanced IDL FITS reader (READFITS) was written that reads in a unified GBT FITS file, and returns a data structure containing two elements: an array of the data in the GBT FITS file, and a structure with keys corresponding to all of the header keywords and values corresponding to the values referenced by those keys. By evaluating MRDFITS and READFITS, it was determined that either is sufficient to import data into IDL.

A coarse grained evaluation was done for a handful of cases to determine the feasibility of easy access to GBT data using IDL. For one pointing observation, data from a single scan was plotted and fitted with a Gaussian (Figure 3). A map at L-Band, completed as an assignment in the 2003 Single Dish Summer School held in Green Bank, was produced in both IDL (Figure 4) and AIPS++ (Figure 5) with visually similar results.

Because of the promise of these early results, deeper investigations are in progress. For example, because data viewed in IDL may have gone through some preprocessing steps (such as a Van Vleck correction and converting lags to spectra in observed spectrometer data) the accuracy of the results from preprocessing is critical. Differences between AIPS++ and Python preprocessing components were examined for a typical spectrometer case. Early results show that the results are the same to within 1e-6 for 3-level data but only 1.5e-4 for 9-level data. The difference, which has been isolated to the FFT step, can be partially attributed to AIPS++ using a single precision FFT in contrast to the Python which uses a double precision FFT. Further investigation to clarify the meaning of these differencies is certainly warranted.

3.4 Matlab Results

Recent work making GBT data accessible to Matlab illustrates that accessibility is not solely an issue of data format. Matlab users in Green Bank demand real-time information from the telescope control system in addition to the ability to analyze data taken at a prior time, making this a more complicated data path to enable.

To satisfy the need for fast configuration of the telescope, another project is underway that required a new Python adapter to interface with the telescope monitor and control system. Recognizing that a global solution could be constructed, the project team is creating a standard interface using SOAP (Simple Object Access Protocol) which can be used by any language that understands SOAP. In addition to Python, this includes Perl, Ruby, and Java. Because Matlab allows the use of any Java object, and Java understands SOAP, a Java object was written to allow Matlab real-time access to all telescope control parameters.

A next step will be to make observation data accessible in the Matlab environment in a simple, streamlined manner. The PTCS Project Engineer currently accesses data by using internally developed Matlab modules to extract information from a single FITS file, which is then populated into an ODBC-compliant database. At this point, SQL is used to select and manipulate the data.

4. Accessibility Strategy

Making GBT data accessible to additional data analysis packages in being done in a staged approach, aligned with demand from visiting observers and other development priorities of the GBT project. IDL is being targeted immediately, because of the strong demand that has been expressed by visiting observers and local astronomers alike. Accessibility of GBT data to Matlab is also being addressed at the present time to support critical PTCS experiments. In the next stage, access to CLASS will be investigated to support a wider audience of radio astronomers, and accessibility to AIPS will be explored, in part to support research for GBT development projects now in their earliest stages. Mathematica, which has the fewest identified users to date, will be explored once solutions are in place for other packages which are used more widely.

Standard data preprocessing components can be reused for many of these cases, making the entire system more maintainable while granting easier access to a larger variety of data analysis programs. See Figure 6 for details.

5. Current Status and Future Plans

Despite the demonstrated ability to import and plot data in IDL, and access raw data in Matlab, there is still much work remaining to be done. Errors in content and form in the unified output data format are being resolved during October 2003. The prototype programs and a memo describing their use is on track to be presented to a wider audience within NRAO by the middle of Q3 2003. NRAO astronomers can then provide comments based on the applicability of these programs to their own GBT data sets. After this time the feasibility of making the data sets a production offering can be accurately evaluated.

In Q4 2003 the following tasks are planned:

  1. Optimize FQL implementation for enhanced speed and performance
  2. Refine the implementation of SDFITS for GBT continuum and spectral line data, and circulate a memo detailing the prototype application for comments to the wider NRAO audience
  3. Complete the validation of all data preprocessing components
  4. Successfully view and manipulate GBT data from fundamental observing modes using AIPS and CLASS
  5. Evaluate whether data preprocessing components can be used to make future versions of gbtmsfiller more maintainable

AIPS++ is still being actively used in Green Bank for many purposes, including individual research and the commissioning of new instrumentation. Enhancements to the package will be made as appropriate, in response to specific demands from GBT development projects. Updated versions of AIPS++ will be made available three to four times a year for internal astronomers as well as visiting observers. IDL versions 5.5 and 6 are installed locally and available for public use, and CLASS has been installed but is not yet available. Matlab and Mathematica are licensed to individual staff members, who have purchased the packages for their own needs.

5. Acknowledgements

Scientific validity of this activity has relied, and will continue to rely upon, the contributions of NRAO astronomers Bob Garwood and Jim Braatz, in consultation with Bill Cotton. Bob Garwood is also leading the work to qualitatively and quantitatively assess the accuracy and viability of reusable preprocessing components, and contributes extensive knowledge about the processing of GBT data and internals of gbtmsfiller, which he wrote. Technical development has been made possible thanks to the work of Green Bank Software Engineer Eric Sessoms, who conceived the idea and developed the FQL utility, and built all initial versions of data preprocessing components in Python. Additional work was completed by Andrew Cowan, a 2003 Green Bank summer student from the University of Iowa, who is responsible for producing the comparison plots. The technical efforts for producing a suitable evolutionary data format are now being led by David Fleming, also a Software Engineer in Green Bank. Work to access GBT data in Matlab is being done by Software Engineers Ramon Creager and Paul Marganian. We also thank Kim Constantikes who is the lead user of Matlab as PTCS Project Engineer, as well as Carl Heiles and Tim Robishaw who have supplied us with tremendous insight about how they currently use IDL to analyze GBT data.

Topic AdassData . { Edit | Attach | Ref-By | Printable | Diffs | r1.9 | > | r1.8 | > | r1.7 | More }
Revision r1.9 - 03 Oct 2003 - 18:24 GMT - NicoleRadziwill Content copyright © 1999-2007 by the contributing authors.
All material on this collaboration platform is the property of the contributing authors.