This document demonstrates FITS Query Lanaguage through a series of progressively more interesting examples. Some familiarity with FITS files is necessary. Some knowledge of Python or Structured Query Language (SQL) is helpful. The examples assume you are working with data from the Green Bank Telescope (GBT), but there is really nothing GBT-specific about FQL.
In Green Bank, source /home/sparrow/release/sparrow.bash (or sparrow.tcsh, as appropriate) to setup your computing environment to run FQL.
The idea of FQL grew out of an analogy between tables stored in FITS files and tables in a relational database system. It helps to understand that analogy before trying to work with the system.
FITS is a standard format for astronomical data; the name itself stands for "Flexible Image Transport System", but there is a lot more to FITS files than just images. Specifically, FITS files may hold zero or more ascii or binary tables containing arbitrary data.
A table is made up of some number of named and typed columns and some number of rows of data. The intersection of a row and a column (hereafter, a cell) contains data belonging to the row but described by the column. Because FITS is a self-describing data format, all the information needed to manipulate (read/write) those data cells is contained within the file and therefore accessible to a program.
A FITS table is also associated with some number of header keywords that are effectively constant across all the rows in the table. These keywords can be thought of as "virtual columns"; they do not exist in the physical table, and happen to always contain the same value, but there is no reason they can't be read and queried in the same manner as the rest of the data. We define one additional "virtual column", the ROW_NO (row number, 1 .. n) that uniquely identifies each row in the table.
A single table isn't very interesting. Not only are there already a number of tools for manipulating single FITS tables (e.g., fitsio for Python), a "relational" database with only one table isn't. A relational database stores both data and relationships between data items expressed in the values of the cells in certain columns.
The GBT is designed "as a laboratory filled with devices or instruments which have to be coordinated to accomplish observations." (See the <a href="http://www.gb.nrao.edu/~mclark/article/article.html">The Control Software Architecture for the Green Bank Telescope.) In the course of an observation, some set of devices is active, each recording data to its own FITS file. To make sense of this data, it is necessary to combine information from multiple files, and the relationships between those files becomes vitally important. Thus, the data files output by the telescope strongly resemble a relational database (in intent, if not form).
Because relationships between data items are so important to a relational database, the primary means of interacting with a database is through queries. A query asks the system for some subset of data defined by certain specified relations. Since there is a standard for database query languages (SQL), we have adopted SQL as the basis for FQL rather than construct an ad-hoc language.
Having said that queries against single tables are inherently uninteresting, they are still the simplest case, so that's where we'll start. A ScanLog? is a record of the devices used and the files generated in the course of a scan with the GBT. That seems like a good place to begin, so we'll write queries against a ScanLog?. Get one here (you do not need to uncompress it, FQL is happy to work with gzip compressed data). The full source for this example is in <a href="example1.py">example1.py
FQL is implemented in <a href="http://www.python.org/">Python, so FQL queries take the form of strings embedded within a Python script. The first step is just to let the Python interpreter know we want to use FQL, and we do this with a Python import statement.
The Python interface to FQL is modelled on standard database APIs in the same way FQL is itself modelled on SQL. Strictly speaking, the interface does not conform to the Python DB-API standard (it may, someday), but if you have used DB-API, ODBC, or JDBC, it will seem familiar. The key objects you will need to be concerned with are Connections and ResultSets?.
A Connection object represents a connection to a database. There is no real database in FQL, so we need to do a little bit of work to let FQL know what files we're interested in. The open() method informs the connection that we are interested in running queries agains a file and attaches a label to that file for use in the query text. It is not an error to open multiple files and, in fact, that is the usual mode of operation.
The code so far:
import fql
cnn = fql.Connection()
cnn.open("ScanLog.fits.gz", "ScanLog")
Now we get into FQL proper. A minimal query selects a set of columns from a set of tables. A FITS file can contain more than one table, so we specify the tables in our query using a dotted-name syntax of the form X.Y where X is the label attached to the file in an open() statement, and Y specifies the table (HDU) within the file. You can specify the table either by number (1-based, as is the FITS way) or by name (given by the EXTNAME keyword). A query can include multiple tables from the same file, and we'll see examples of that later on.
Columns are also specified in a dotted-name syntax of the form X.Y.Z, where X.Y is the name of a table and Z is the name of a column in that table. FQL does not support the SQL "*" syntax for selecting all columns. There is also no mechanism in FQL for finding the names of the columns in a table (if you don't count trial-and-error), but there are other tools for that, such as <a href="http://heasarc.gsfc.nasa.gov/docs/software/ftools/fv/">fv.
Run your query by asking the Connection to execute it.
Here's the code. Triple-quotes (""") are simply the Python syntax for multi-line strings. The query string is pure FQL.
query = """
select ScanLog.2.DATE-OBS
, ScanLog.2.SCAN
, ScanLog.2.FILEPATH
from ScanLog.2
"""
rst = cnn.executeQuery(query)
The result of executeQuery is a ResultSet. In Python, a ResultSet looks and acts just like a read-only list, so it is trivial to iterate through your results. Each individual result item contains a list of the values for the columns specified in your query, in order.
This code displays the results of a query:
while rst.hasMoreData():
print rst.getCurrentResult()
rst.moveNext()
If you've been following along, you now have enough code to run an FQL query and see the results. Try it now. I'll wait...
Running that code against the supplied ScanLog? dumps some 792 lines of data to your screen. That's a bit much to be helpful right now, so it would be good to have some way to restrict the data returned by the query.
A where clause specifies a logical condition that must be true of a data row in order for that row to be returned as part of the ResultSet. To keep things managable, lets just look at the data for the last scan, number 96.
If you run the program again with this query, you should get a set containing only 9 results.
query = """
select ScanLog.2.DATE-OBS
, ScanLog.2.SCAN
, ScanLog.2.FILEPATH
from ScanLog.2
where ScanLog.2.SCAN = 96
"""
Some people find it useful, when analysing data, to know something about what that data means. For instance, in the case of spectrometer data, we are interested in what spectral window that data corresponds to. When observing with the GBT, this data is split between three different files: the spectrometer FITS file, the IF manager FITS file, and the LO1 FITS file. How to make sense of this is documented in a memo on <a href="http://www.gb.nrao.edu/GBT/MC/doc/dataproc/iflo/iflo/iflo.html"> LO/IF/Frequency Handling.
In this section of the tutorial, we will deal with two of the three pieces of the spectral window puzzle. (We don't have all the tools at this point to do a complete example. Relax, we'll get there.) To play along, you'll need a sample <a href="Spectrometer.fits.gz"> spectrometer (26MB) file and a sample IF file. (Again, it is unnecessary to decompress the files.) The source for this section is in example2.py. Please see the GBT <a href="http://www.gb.nrao.edu/GBT/MC/doc/dataproc/index.html"> Data Processing Documentation for details on these files.
Now that we're dealing with more than one file, the FQL execution model comes into play. The conceptual model for FQL is the same as for SQL; create the cross-product of all the rows in all the tables listed in the from clause, test each row of the cross-product for inclusion in the result set using the where clause, and then extract from each passing row the columns specified in the select clause.
In writing FQL queries (or SQL queries, for that matter) it is frequently best to approach them from the inside-out. Take the process in three steps.
Which files and which tables contain the data relevant to your current problem? At the moment, we're interested in the frequency information for a spectrometer observation. For each port in use, we want to know its bandwidth, multiplier, sideband, and offset. This data is divided between the PORT table in the Spectrometer file and the primary table in the IF file. (Again, this is all documented in <a href="http://www.gb.nrao.edu/GBT/MC/doc/dataproc/iflo/iflo/iflo.html"> LO/IF/Frequency Handling).
This knowledge allows us to write the from clause of our query: from PORT, IF.2.
Now that we know where to look, how do we identify the specific data items we need? The IF manager file uses the column BACKEND to relate a data row to a specific backend device, in this case the spectrometer, then we can just look for the matching BANK and PORT.
We can now write the where clause of our query: where IF.2.BACKEND = 'Spectrometer' and IF.2.BANK = PORT.BANK and IF.2.PORT = PORT.PORT.
This step is usually the easiest. In this case, we want the Spectromter BANK, PORT, and BANDWIDTH, and the IF SFF_MULTIPLIER, SFF_SIDEBAND, and SFF_OFFSET.
Our select clause becomes: select PORT.BANK, PORT.PORT, PORT.BANDWIDTH, IF.2.SFF_MULTIPLIER, IF.2.SFF_SIDEBAND, IF.2.SFF_OFFSET.
We build up the query from the inside-out, now we need to assemble the pieces. Here's the full query:
select Spectrometer.PORT.BANK
, Spectrometer.PORT.PORT
, Spectrometer.PORT.BANDWDTH
, IF.2.SFF_MULTIPLIER
, IF.2.SFF_SIDEBAND
, IF.2.SFF_OFFSET
from Spectrometer.PORT
, IF.2
where IF.2.BACKEND = 'Spectrometer' and
IF.2.BANK = Spectrometer.PORT.BANK and
IF.2.PORT = Spectrometer.PORT.PORT
For this query to run, we only need to add the required files to the Connection under the names we used for them in the query.
cnn.open("Spectrometer.fits.gz", "Spectrometer")
cnn.open("IF.fits.gz", "IF")
The full source for this example is in example2.py.
Life as a query language would get dull pretty quickly if there were only one-to-one relations between data items. Fortunately (well, depending on your point of view), this isn't the situation. It is often the case that one data item in table A will correspond to a varying number of data items in table B. Sometimes we want to know all the details, but sometimes we would just like to see the data summarized in a useful form.
One place we encounter a one-to-many mapping in GBT data is with antenna pointing information. We'll collect data in a backend device over some period of time, but the antenna is not going to be focused on the same point in space throughout that time period. The antenna's position is sampled ten times each second, and that position is stored in the antenna FITS file. We can easily write a query to match spectrometer data with the set of antenna positions that were sampled while that data was being collected.
cnn = fql.Connection()
cnn.open("Spectrometer.fits.gz", "Spectrometer")
cnn.open("Antenna.fits.gz", "Antenna")
query = """
select Spectrometer.DATA.DMJD
, Antenna.2.RAJ2000
, Antenna.2.DECJ2000
from Spectrometer.DATA, Antenna.2
where Spectrometer.DATA.DMJD <= Antenna.2.DMJD and
Antenna.2.DMJD < Spectrometer.DATA.DMJD +
Spectrometer.DATA.HBTLNGTH *
Spectrometer.DATA.HBTPERSW *
Spectrometer.DATA.SWPERINT / (24 * 3600)
"""
rst = cnn.executeQuery(query)
Most of this query is straightforward. Return, for each spectrometer integration, the antenna positions sampled during that integration. However, the concept of "during that integration" is interesting because it illustrates a non-trivial arithmetic expression. While we know the start time of the integration, we don't have easy access to the end time. We can, however, calculate the length of the integration. Given the number of switching periods per integration (SWPERINT), the number of heartbeats per switching period (HBTPERSW), and the length of a heartbeat in seconds (HBTLNGTH), we just multiply and convert the result from seconds to fractions of a day...
The important points to take home are (a) FITS Query Language doesn't know anything about the meaning of your data, so instead (b) it supports arbitrary expressions with which you can express the meaning of your data.
If you run this query now, your initial reaction should be, "My god, is that slow!" It takes about 17.5 seconds on an Athalon
Two points made earlier bear repeating now. First, FQL is not GBT-specific and, in fact, knows absolutely nothing about your data. Second, in the conceptual execution model, FQL must form the cross-product of all the rows in all the tables as part of the execution of a query.
A third point is, in the absence of specific knowledge to the contrary, FQL is forced to implement the conceptual execution model literally.
That's fine, really. Though it may try to pretend, FQL is not a relational database and, in working with FITS files, is at a rather severe disadvantage when compared to a true database. What we can do, though, is this: cheat. We'll give FQL a hint about the meaning and structure of the data that it can use to speed-up query execution.
Like many GBT data files, the antenna FITS file is a log file. The entries are timestamped, and in order by timestamp. That's valuable knowledge, so let's give FQL the hint:
cnn.sort("Antenna", 2, "DMJD")
Sort(), in this case, tells the connection object that the second table (2) in the "Antenna" file (previously opened), is in sorted order by the column "DMJD" (the timestamp). Running the query with this hint now takes only 0.35 seconds, for a factor-of-50 speedup. It's now only
You might expect that uncompressing the data files would give an additional speedup, but this is not the case on modern hardware. The time saved by not performing the decompression is compensated for by the additional time required to read the uncompressed file from disk. Go ahead and save your disk space.
That's fine and good, but in this case we don't really want all the details of the antenna positions, we would just like to see the data summarized in a useful form. In fact, what we would like to know is the average antenna position for each integration. We need a way to tell FQL what we want averaged and how. The "what" is easy. Use the keyword AVG to request the average of the values in a certain column.
select Spectrometer.DATA.DMJD
, AVG(Antenna.2.RAJ2000)
, AVG(Antenna.2.DECJ2000)
The "how" takes a bit more thought. We don't want the average position over the course of the entire scan; we do want the average position over the course of an individual integration. So, we'll ask FQL to group the data by integration before computing the averages. In this example, we're identifying each integration by its start time, so:
group by Spectrometer.DATA.DMJD
The source for this example is in example4.py.
| Topic FitsQueryLanguage . { Edit | Attach | Ref-By | Printable | Diffs | r1.3 | > | r1.2 | > | r1.1 | More } |
| Revision r1.3 - 13 Aug 2008 - 20:10 GMT - AmyShelton |
Content copyright © 1999-2007 by the contributing authors. All material on this collaboration platform is the property of the contributing authors. |