NRAO Home  >  Green Bank  |  Wiki Topic:    GB > Software > IDLPlanUpdate > ModificationRequest2C506
   Changes | Index | Contents | Search | Statistics | Go

GBTIDL Autoflagger

Modification Request #2 (C05 2006)



1. Introduction

GBTIDL needs the ability to flag large blocks of data according to one or more rules. An autoflagger is something which automatically flags data based on these rules. It can be invoked with one or perhaps a few procedures typed by the user at the command line without the user needing to explore the data in detail.

2. Background

GBTIDL has the capability to flag specific parts of the data on disk. These flag rules are then applied when the data is read in to GBTIDL. The rules can be fairly complicated and ultimately any bit of data can, in principle, be flagged. See the Introduction to Flagging and Blanking in GBTIDL for more details. In practice, it can be quite tedious to know what data should be flagged and to set those flag rules by hand from the GBTIDL command line. User of dish in the aips++ system have found the aips++ autoflagger tool to be useful in flagging large blocks of data. GBTIDL needs a similar capability in order both to make GBTIDL as capable as dish and thus attract dish users who would otherwise be unlikely to move to GBTIDL and to provide necessary enhanced flagging tools to make flagging data in GBTIDL as easy as possible.

3. Requirements

This is at the moment simply a description of the aips++ autoflagger. We may want to modify these requirements to bring them more in to line with what we really want to provide to GBT users. Another similar tool that it has been suggested we look at is the ATNF "pieflag" program

The aips++ autoflagger has a number of possible flagging methods. The user selects any number of these methods (including the same method more than once, with different parameters each time) and then directs that the methods be "run" on the data. All selected methods are then performed by that one "run". Each method either flags the data directly based on a calculation applied directly to that data (e.g. flag data outside of some range) or it constructs a statistic (or statistics) using multiple data values and values the data based on that statistic (e.g. median flagging in the time domain). In the latter case, it isn't clear to me that order of the specified methods matters (e.g. if you ask to flag data based on a clip level and a median filtering in that order, does it first flag on the clip level and then calculate the statistic based on the just-flagged values?).

The aips++ filler treats each field and spectral window as independent. We need to think hard about what data selection in the GBT data model are independent in the autoflagger (probably if number and feed number and polarization - although the later gets into combination of data used in flagging which I'll discuss shortly - possibly we should simply allow any additional data selection to happen prior to the autoflagger being run, the aips++ autoflagger does allow additional data selection as well).

The data values that can be used in setting the flags are more than simply the data values found in the measurement set. The user can also choose to comibine data values. In addition, because the calibrated data is closely associated with the raw data, the user can chose to use the calibrated data to set the flags. Then, when the data is recalibrated, the raw data will have the same flags and the calibrated data will change as a result of the flagging operation. Autoflagging is often an iterative process. There is a fair amount of flexibility in how the data values are combined to form the quantity used in the testing. The default is to use the abs(I), where I is XX+YY or RR+LL. Other choices are done by giving a string expression, where the following types of expressions are supported:

The following values of func are recognized: ABS, ARG, RE, IM, and NORM. Of these, the last 3 are not relevent for GBT data where we deal with real quantities in our data model. I think "ARG" is just the argument, but let me know if I'm missing something. There are two complications with this as it relates to the GBT data model. First, we do not closely associate polarizations of the same integration, switching phase, feed, and if and second we do not closely associate the calibrated data with the raw data. On the former we could probably make do with basic data selection rules and, certainly on raw data, come up with a reasonable guess as to paired rows based on polarizations.

The following methods relevent to single dish data are supported by the aips++ autoflagger (see the autoflagger documentation for more details).

Autoflag produces a graphical flagging report which consists of a summary and several plots showing the distribution of flags in frequency and time (and etc, not sure exactly what that might mean).

A small survey of dish users has found that users most often use timemed and freqmed to flag data. They flag based on calibrated and raw data, based on single polarizations independently as well as using the default "I" combination. Autoflag can apparently reject entire rows if there are too many individual flags within a row (that wasn't obvious from the documentation, but one user swears by that). The flagging report is very valuable.

When asked if there were missing features in autoflagger, only one user (out of the 4 who replied) had any suggestions. He suggested flagging based on the RMS in a given channel (this seems similar to timemed) and RMS in frequency after doing a local spline fit (seems like sprej). We should also go back and look over Toney's e-mail note from several months ago where he had procedure that is very much like what autoflagger should do.

4. Design

I think we should do this in phases. The first phase is to implement the methods using simple data selection without any combination of the data in the file. Since it is often useful to flag based on calibrated data. I think this phase should include the ability to "import" flags from one data file to another so that you could run the autoflagger on calibrated data writen to a file and then import those flags to the raw data and recalibrate your data. "import" should include the ability to merge and replace flags, otherwise this iterative process could lead to a very large flag file that would seriously impact I/O performance. Many of these methods really don't make sense on the real raw data where simply adding in the cal signal peridically makes timemed not particularly useful unless you consider each switching phase independently.

We'l likely need to worry about I/O performance and probably we should develop a toolbox utility that would allow you to select the data without getting it and then iterate through that data, getting some "reasonable" chunk of the data at a time for these sorts of bulk processing operations. getchunk is useful now, but it leaves the choice of "reasonable" chunk up to the user (we default to an entire scan for the chosen IF and polarization).

Do we want to support multiple methods on a single pass through the data in this first phase or should we leave that to a subsequent phase? I think we could support this in a single pass by adding fields to the guide structure that would hold each method and parameters to be applied and then the invocation of autoflag would use those value (the other option would be to use an IDL object, but I believe we don't want to expose those to the user - still, we could use that at a low level and store the object in the guide structure like we store the i/o objects there).

5. Add Any More Sections That Are Appropriate

6. Deployment Checklist

What has to get done to integrate this completely into the system. This checklist must be completed before Cycle Integration Testing begins.

7. Test Plan

Critical!! ChangeControlCommittee will be reviewing these.

Don't forget to include/acquire any additional GBT test time needed outside integration/regression testing! Get your requests in early!

Important! If possible, you should conduct as many of your tests as possible in offline modes and/or with a simulator. We should constantly endeavor to minimize our use of telescope time for testing!

7.1 Internal Testing

This section covers things like unit testing, simulator testing, and any other tests required to make sure this MR is ready for sponsor/integration/regression testing.

7.2 Sponsor Testing

This section is for the sponsor. What do you need to do in order to ensure that the MR is complete and correct? These tests are the prerequisite for sign off for the "accepted/delivered by sponsor" item in the "signatures" section.

7.3 Integration/Regression Tests

What do the integration/regression testers need to do in order to test this MR.


Signatures

APPROVED: I acknowledge that my request is fully contained in this MR, and if the SDD delivers exactly what I specified, I will be happy.

ACCEPTED: I acknowledge that I have validated the completed code according to the acceptance tests, and I am happy with the results.

Written symbol - name - date
Checked symbol - name - date
Approved by Sponsor symbol - name - date
Approved by CCC symbol - name - date
Accepted/Delivered by Sponsor symbol - name - date

Symbols:


CCC Discussion Area

Topic ModificationRequest2C506 . { Edit | Attach | Ref-By | Printable | Diffs | r1.2 | > | r1.1 | More }
Revision r1.2 - 25 Jul 2006 - 21:22 GMT - BobGarwood
Parents: IDLPlanUpdate
Content copyright © 1999-2007 by the contributing authors.
All material on this collaboration platform is the property of the contributing authors.