NRAO Home  >  Green Bank  |  Wiki Topic:    GB > Main > TWikiUsers > RamonCreager > GrailNotes
   Users | Groups | Offices | Changes | Index | Contents | Notify | Search | Jump to Topic:

Grail Development Notes

12/13/2006

Field Callback improvements

Grail has been modified to improve the field level callback services.

In addition to providing callbacks where an entire parameter or sampler may be returned, Grail now accepts callback subscriptions for individual sampler/parameter fields. This can significantly simplify the processing of the callback, and also significantly reduce the Grail client's processing overhead, especially when all that is needed is one or two fields of a complex sampler.

All examples that follow use the quadrant detector and assume the following:

>>> from gbt.ygor import GrailClient
>>> from pprint import pprint
>>> cl = GrailClient("goauld", 18000, cb_port=19592)
>>> # Call-back function:
>>> def cat(device, sampler, value):
...      pprint(value)
...
>>> qd = cl.create_manager('QuadrantDetector')

To register a callback for a sampler/parameter field, one provides the desired field(s) -- minus the root sampler/parameter name -- as an extra parameter to both versions of GrailClient 's reg_sampler() or reg_param(). To illustrate, this call registers a callback for the entire 'monitorData' sampler:

>>> qd.reg_sampler('monitorData', cat)

but this call registers a callback only for the 'n12VOK' field of the 'monitorData' sampler:

>>> qd.reg_sampler('monitorData', cat, 'n12VOK')

The value will be returned just as if qd.get_sampler_value('monitorData,n12VOK') had been called:

'1'

(this indicates that this power supply is OK).

Multiple fields for the sampler/parameter may be specified in the same subscription. These are separated by semicolons:

>>> qd.reg_sampler('monitorData', cat, 'n12VOK;n15VOK;p5VOK')

The values returned will now be a semicolon delimited list of values, with the ordering matching the order in which they were subscribed:

'1;1;1'

(Note that the callback still will be called every time the sampler is updated, not when the field is updated. So if a sampler updates, but the registered field has not changed, callbacks are received nonetheless.)

Using field callbacks for samplers/parameters with no fields

The field callback syntax requires a field name to specify a field. What of parameters/samplers that have only one root field? Up do now, there was no way to specify this. Grail has been modified to allow for this. For such samplers/parameters, one specifies the root field using '.'. The following example registers a field callback for the quadrant detector's 'state' parameter:

>>> qd.reg_param('state', cat, '.')

This is desirable because now there is no need to parse the entire 'state' parameter to get the value; the value is returned just as if qd.get_value('state') had been called:

>>> qd.off()
'OK'
>>> 'Off'
'Off'

>>> qd.on()
'OK'
>>> 'Activating'
'Activating'
'Aborting'
'Ready'

(Note that the 'cat' callback function above does nothing but print the value.)

Time stamps and field callbacks

In Ygor, samplers are returned with a time stamp. The time stamp is not an integral part of the sampler structure itself, however; it is merely returned with the sampler structure. Therefore, the normal sampler callback mechanism does not return one (yet). However, the time stamp values may be obtained using the field callback mechanism. Two special fields are recognized: TS:MJD and TS:seconds. If these fields are provided during the field callback subscription, the MJD and seconds-in-the-day values will be returned along with the values of any other subscribed fields:

>>> qd.reg_sampler('monitorData', cat, 'n12VOK;n15VOK;p5VOK;TS:MJD;TS:seconds')
>>>
'1;1;1;54082;69355.437475'

Here, returned with the three requested values, are the MJD and number of seconds elapsed during the day (UT).

NOTE: Grail can return time stamps this way for both samplers and parameters. It should be noted however that Ygor only supports time stamps for samplers; Grail adds the parameter time stamp when it receives the parameter callback from the manager.

05/30/2006

Grail Call Execution Speed

Occasionally, questions arise about the speed with which Grail can handle an individual SOAP request. This appears to be prompted by a perception that SOAP is slow. Therefore I have conducted a series of tests to lay this issue to rest. In order to differentiate between the performance of Grail and the performance of the Python clients, I completed the C++ Grail client library, at least enough to conduct the tests. I assumed that the tests using the C++ Grail client would provide a minimum baseline time. The call times would be shared between both client and server, but would be the smallest possible times that could reasonably be expected to perform an RPC call to Grail. If RPC calls using Python clients show a significant time difference, these then could be ascribed to the Python client, and not to Grail. The test was conducted with two versions of Grail: non-multithreaded SOAP, with HTTP 1.0 protocol, and multithreaded SOAP, which can support HTTP 1.1 (SOAPpy clients use HTTP 1.0 only). I wrote a test program using the C++ Grail client which did the following: This amounts to 33 RPC calls to Grail in all. The calls were made from a client running on colossus to a Grail instance running on goauld. In the Python clients, RPC get and set calls were timed separately (more on this below). The times are as follows: These RPC calls were fairly simple, but most calls to Grail are. The big difference between the two C++ times is that for an HTTP 1.0 call, a new connection is made for every RPC call; for HTTP 1.1, the same connection is reused for every RPC call. (This is why the single-threaded version of Grail's SOAP server must be HTTP 1.0; otherwise, one client would lock out all others.) The cause of the difference observed between the times obtained with the Python clients vs. the C++ client is quite different. Python is inherently slower, which accounts for the base difference between the C++ and Python client calls. But note the difference in the set calls between Ygor GrailClient.py and Sparrow GrailClient.py. Set RPC calls are the ones that modify some parameter or state in a manager. The Sparrow GrailClient runs these calls through the Sparrow security module, which checks the gateway file for permissions on every RPC set call. A different paradigm may have to be used here to improve response times.

11/22/2005

11/18/2005

These changes help speed the telescope configuration by removing the delays and by allowing the config_tool to use the set_values_array() interface as needed, rather than having to make individual set_value calls.

08/16/2005

02/24/2005

An issue arose concerning the total number of threads that Grail can spawn. When Ron DuPlain made an attempt to register every manager in the GBT system, Grail failed to allocate more than 253 threads. Since Grail starts at least 3 threads per Manager client, and there are currently 142 managers in the current GBT Telescope system (release 5.1), Grail was attempting to create some 426 threads. The problem turns out to be that the default stack size per thread is 8MB. Since the process space is 3GB, this works out to about 380 threads before process space is exhausted, if the process space was only given over to thread stacks. In reality, other parts of the process get process space, so this number is closer to 250. Note that this memory is not actually used! The process space is simply the amount of memory that the process could conceivably address given the memory addressing hardware of the machine's architecture. Only portions of memory actually used are mapped to physical memory by the hardware. Thus, one can conceivably exhaust the entire process space by reserving portions of it for potential use (as happened here) but only be actually using a few megabytes of physical memory.

I tested to see if this was the problem by using ulimits to temporarily set the default thread stack size to 128K and ran Grail on leeloo. This time Grail was able to create all the needed threads. I am implementing a more permanent fix by explicitly setting the stack size for threads created in Grail by using the pthread_attr_setstacksize() call before every call to pthread_create(). I am also modifying the Thread template class in ygor/libraries/Threads to allow this size to be specified (if not, the default is the ulimits set default.) I am leaving alone threads spawned by other libraries, such as RPC++ (in ShVxClient), to minimize impact on any other applications. Though one of these is created for every manager client Grail creates, the savings elsewhere should allow Grail to meet its needs.

02/22/2005

Toney Minter found a bug in the way Grail handles dynamic String parameters. Ygor has two kinds of String parameters. Static and Dynamic. The static kind are the most common. These parameters use a programmer defined hard limit on their length. The dynamic kind are much more rare. Their size is set by the user when the user sets the parameter. Grail was mishandling these so that they could not be set at all (for example the polycoDatFile parameter in the SpectralProcessor.) Grail was only allowing access to dynamic parameters by requiring an index to access their individual elements, but the individual string elements are not of interest, the entire string is (thus no index is needed to get/set the parameter.)

Fixed this by testing the parameter to see if it is a BasicType::String, and if so, handling it differently. Dynamic non-String parameters are still handled the same as before.

01/06/2005

12/22/2004

12/16/2004

Grail Quality of Service Improvements

Grail suffered from not being able to handle large loads, and also did not insulated itself and its clients from a problem client. This means that heavy use of callbacks, or a misbehaving client, could freeze Grail and any other Grail clients. (see also Grail Test Plan.)

The following has been done to fix this problem:

One issue remains: SOAP Cleanup on the server side if a client dies is not yet working properly. This should be completed soon.

11/04/2004

Grail memory leaks fixed

The following Grail SOAP interface functions produced memory leaks:

In addition, callbacks from samplers and parameters also produced memory leaks.

These have all been fixed, and in the process I've streamlined the SOAP data structures I use, along with some of the code in these functions. None of these changes require any changes to any Grail clients. All have been tested by sending up to 100,000 requests to each of these functions and observing Grail memory use with top. All changes apply to both the Linux and Solaris Grail.

The problem was that I misunderstood how gSOAP manages memory in the SOAP serializer. After carefully reading of the gSOAP manual (A comprehensive though not very direct document, and short on meaningful examples) I was able to figure out how to manage memory in dynamic gSOAP arrays. More details about gSOAP memory management can be found here.

10/28/2004

Grail ported to Linux!

Grail has been ported to Linux. In the Solaris Grail the SOAP interface thread would create the DeviceClients (including the recipient RPC server file handles) while handling a client request, and a different thread, the M&C RPC++ recipient server thread, would service the DeviceClient recipient servers. In Solaris, file handles are global and can be used in any thread like this. Under Linux, file handles to be used by a select() call must have been created in the same thread that runs the select() call. Somehow the SOAP client thread had to be able to notify the RPC select() thread (waiting on socket file handles) to create the DeviceClient for it. The solution was for Grail to make an RPC call to itself, one thread to the other, to have the RPC thread jump out of the select() call and create DeviceClient on behalf of the SOAP service thread. This was done by adding a CREATE_CLIENT RPC method to the Grail Status Service RPC service. The GrailStatusService class in turn has been merged into the DeviceClientMap class to allow this to be easily done.

Other Grail enhancements

08/03/2004

Grail now has a multithreaded SOAP server. Grail no longer blocks out service requests until the previous service request is handled. Instead, it starts a new thread to handle the request and immediately resumes listening for more requests. This allows Grail to support by default HTTP 1.1 connections, which keep the socket open by default until the client closes it. It also allows Grail to be more responsive to requests. Now a client will block only if the client desires access to a device that anoter client is currently accessing. In this case, as soon as the previous client finishes with the device, the next client can access it, even if the previous client keeps its connection open.

I have also modified the SOAP interface for Grail to fully support WSDL and also to support anonymous (as before) and named parameters (new) simultaneously. I am trying to bring the Grail SOAP interface up to the latest standards while getting it ready to migrate to gSOAP 2.6. WSDL has been successfuly tested with SOAPpy and Grail built with gSOAP 2.3 and 2.6.

Part of the motivation to use HTTP 1.1 Keep Alive connections was to improve throughput and turn-around times for clients who wish to periodically and repeatedly make requests to Grail, as noted in an earlier entry. I found that SOAPpy does not support this, but does allow different transport classes to be specified in the SOAPProxy class. So I wrote an HTTP 1.1 transport class (based largely on the older one) to do some timing tests. The results were surprising. First, here is the test code

from grailclient import *
from time import time

cl = GrailClient("titan", 18000, cb_port=19591)

def test():
   begin = time();
   cl.get_value('Accelerometer', 'state')
   end = time();
   print "Call took", end - begin, "seconds"

On executing test() with the old HTTP 1.0 transport class, this call typically took about 10 mS. Using the new HTTP11Transport class, the client behaved as expected: the connection to Grail remained open between calls. However, the timing went up by an order of magnitude, taking approximately 100 mS per call!

This counter-intuitive result led me to try another Python SOAP client, ZSI, with the same results. I also tried to test this using Perl and SOAP::Lite, but gave up (for now) because of my lack of Perl knowlege. I measured the time Grail was taking to process the requests, hoping to find something. Grail was spending most of its time waiting in the soap_recv() call. This means that the client still may be at fault, if it does not finish the transmission in good time. Building Grail with gSOAP 2.6 did not improve things. The bottom line is that with respect to execution times Grail will behave as before when called from an unmodified SOAPpy library.

07/22/2004

Optional verbosity re-introduced

Amy Shelton requested that I add back into Grail some of the verbosity that I eliminated for release 4.4. The reason she wanted this is for feedback during testing of turtle on the Antenna simulator. For this, she runs Grail interactively, and this feedback is useful in ensuring that Grail requests are going to this Grail and not to the real system on vortex.

To accomodate this request, I added the command line switches

    -v, --verbose

to Grail. This allows Grail to remain quiet when being run by TaskMaster, but print out useful information when run interactively.

Inconsistent state bug fixed

Bug found and fixed and patched in Grail 4.4 on 7/20. The bug manifested itself whenever an M&C device (say, DCR) went down and came back on-line again. The DeviceClient on Grail for that device would be left by this in a state where it believed it was subscribed to some parameters on the device, but was not. Thus values for 'state' would eventually become inconsistent with the actual value on the device and could conflict with values reported by CLEO, for example. Control of the device was unaffected, but, as far as previously registered parameters were concenrned, Grail was blind (new parameter registrations worked OK).

The problem was caused by the recovery method not having been refactored to the new model of Grail parameter handling. This was fixed by having the recovery method re-register all subscribed parameters with the new recipient client on the resurrected device.

The TIME_WAIT problem

The current Grail has a single threaded SOAP server, and requires clients to connect, transact, and disconnect for every transaction made with Grail (This is the default model for HTTP 1.0).

This causes a problem that was noted recently when Paul Marganian started working on a lightweight M&C status screen that uses Grail. Paul's code makes many requests of Grail every second. Because of the nature of TCP connections, each Grail connection leaves a file handle on the host machine unused and unusable for a period set in the networking stack of that machine (on Solaris, this is 240 seconds by default). This phenomenon can be seen by running the following command on a command line on the Grail host, after making a series of Grail requests:

[rcreager@titan rcreager]$netstat -a | grep 18000
      *.18000              *.*                0      0 33232      0 LISTEN
titan.18000          lycaste.gb.nrao.edu.4619 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4700 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4717 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4732 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4745 16126      0 33580      0 TIME_WAIT
titan.18000          lycaste.gb.nrao.edu.4758 16126      0 33580      0 TIME_WAIT

Here, 6 requests were made in quick succession on a Grail running on the Solaris host titan, port 18000. One can easily see that by continuously making many requests a second, that titan might eventually run out of file handles and no new requests to any services on titan will work until some of these sockets have finally timed out (this depends on how heavily titan was loaded to begin with, and the rate at which requests are made of Grail or any other titan services). This is not a bug, and using tricks to avoid this is Not A Good Thing! (See the "Programming Unix Sockets in C" FAQ for a detailed explanation of why the TIME_WAIT state is necessary for any newly closed TCP socket.)

There are two possible solutions to this problem:

  1. Change the TIME_WAIT time-out period on the host machine to something much lower than 240 seconds
  2. Keep the socket open when the same client is making multiple transactions with Grail

The first one is easy to do, if you are the sysadmin for that machine. The downside is that this is a system variable, and thus affects all TCP communications on that machine. Lowering it so something like 10-15 seconds should not break anything though, and indeed Joe Bandt did do this to virgo when he ran into the same problem with the Antenna Characterization SOAP interface he uses on the Antenna manager.

The second solution is somewhat more involved, but has several things going for it. The solution involves making Grail's SOAP interface support the HTTP Keep-Alive option. This also requires making the SOAP interface multithreaded: Keep-Alive would lock out other clients if the interface were to remain single-threaded. Internally, Grail is already multi-threaded, so no other changes would be required.

There are a few good reasons to pursue this second approach:

  1. Using a multithreaded SOAP server would make Grail scale better; a client wishing to make a Grail transaction would not have to wait for another client to finish first, provided both clients weren't after the same device.
  2. Using Keep-Alive would improve the transaction turnaround time, as now the transaction would not involve opening a socket and then closing it (on both ends). Keep-Alive in conjuction with asynchronous messaging would greatly inprove data throughput for sampler callbacks as well. (The use of Keep-Alive is recommended by the gSOAP Manual to improve performance.)
  3. The number of sockets left in the TIME_WAIT state on the server now will be dependent on the recent (within 4 minutes) number of Grail clients, not the number of transactions. This will be a far lower and less variable number.

Of course, the client end must also support Keep-Alive, otherwise the system will behave as before. I have yet to figure out how to make SOAPpy do this, but SOAPpy is based on Python's httplib and httplib can do this. I have emailed one of the SOAPpy maintainers, asking if there is a ready way to do this; laking an answer, I will have to go through the SOAPpy source. Fortunately, it is not very big.

06/30/2004

Grail verbosity reduced

The old Grail was fairly verbose, making it vulnerable to being aborted on a SIGXFSZ signal (file size exceeded) as its log file approached the limit set in vortexProc.conf.

All non-error output has been wrapped in #if defined(DEBUG)/#endif conditional directives, which means that if compiled with no DEBUG defined Grail is a whole lot quieter.

New 'loConfig' bug fixed.

Why did this problem reappear? This is actually a new bug, with the same symptoms as the old 'loConfig' bug. Grail used the virtual function Panel::reportComplete() to indicate that all registered parameter values have been loaded from the manager. Since the old Grail registered all parameters up front, this was OK. Only one reportComplete() is received on registering all parameters, and therefore there is a guarantee that all parameters registered have valid data. (In the old 'loConfig' bug, the test to see if reportComplete() had been received fell through prematurely; see earlier notes on 'loConfig'.)

The new Grail registers parameters on demand. On creating a manager client, Grail registers 'state' and 'status' and requests values for those. When configuring, this is rapidly followed by a series of new parameter request. This results in multiple reportComplete() being received. A race condition could set in where the new parameter requests could see the original 'sate' and 'status' reportComplete() (or reportComplete() for a previous parameter request) and think that values had been received for the new parameters. The fix was to give each parameter its own condition variable and wait for it to be broadcast in Panel::reportParameter(), when the actual data is received. This is much more positive: either there is real data or there is an real error.

fixGoHang helps Grail too!

The new Grail (4.4) is vulnerable to an iteresting Manager/Panel interaction issue: If a parameter has no value (such as a dynamic array with 0 elements), no reportParameter() is received after issuing a getValue/getValues (and no reportComplete() either, if this was the only parameter registered). This is the issue fixGoHang adresses: it actually sets these parameters to have at least one element. Joe is looking into the proper fix for this. Alternatives could be to not allow parameters with no values; another is for the Manager to send reportParameter()/reportComplete() even if no value exists, with some way to tell the Panel user that there is no value associated with that parameter (by setting the 'len' parameter in reportParameter() to 0, for example).

In the current release, this problem manifests itself when either (or both) the Antenna or LO1 Managers is restarted. Grail may report problems with either of these two devices, or the ScanCoordinator, which also may deal with these two devices. The config tool will throw an exception or generate an error message with the following text in it:

Device failed to respond

A look at the Grail log (/home/gbt/etc/log/vortex/Grail.< pid >.< date&time >) will show an entry like this:

53187 12:44:53
Caught DeviceClientException:   Device: ScanCoordinator.ScanCoordinator
                                Problem: Device failed to respond
                                Location: ParameterCache::subscribe(int)

The short-term fix is to run fixGoHang.

04/01/2004

I have started work refactoring Grail. The biggest functional changes from the 4.2 version is that parameters will no longer be automatically registered for callbacks and cached. This will occur on-demand as parameters are needed. This will considerably reduce network traffic between Grail and the M&C system.

Other refactoring:

These changes have resulted in a considerable slimming down of DeviceClient accompanied by only a very modest increase in complexity in the ParameterCache (formerly GenericRecipient) class, and the creation of a similar companion SamplerCache class.

03/25/2004

'loConfig' bug

Caused by fresh data buffer containing bogus data.

Symptoms

Sometimes, Grail returns an error message when an enum parameter on an attempt to read or set an enum parameter. Because this has first and most often been observed with the LO1 parameter loConfig, I call this the 'loConfig bug'.

In the Python based GrailClient, the error looks like this:

>>> LO1.get_value("loConfig")
<Fault SOAP-ENV:Client: Device error: LO1.LO1: No value returned for parameter l
oConfig>
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "grailclient.py", line 421, in get_value
    return self.cl.get_value(self.dev, path)
  File "grailclient.py", line 249, in get_value
    return self.cl.get_value(device, path)
  File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 362, in __call__

    return self.__r_call(*args, **kw)
  File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 384, in __r_call

    self.__hd, self.__ma)
  File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 306, in __call
    raise p
SOAPpy.Types.faultType: <Fault SOAP-ENV:Client: Device error: LO1.LO1: No value
returned for parameter loConfig>

A look throgh Grails logs reveals this:

1 - 17:1:32.8962: accepted connection from IP = 192.33.116.175 socket = 5
LO1.LO1 initialized correctly
DataNamedValues::value2Name(-1431655766, 1627e0, 200): EnumerationParser::findNa
me(1627e0, 200, -1431655766) failed, error code -3
Parameter::get_value(): data = aa aa aa aa
Parameter::get_value() failure: 0 = _ddp->getFieldValueStr(loConfig, 18ee18, 4, 1627e0, 0)
Caught DeviceClientException:   Device: LO1.LO1
                                Problem: No value returned for parameter loConfig
                                Location: DeviceClient::get_value()
 request served in 0:0:0.252247

If the attempt to read/set the value had succeeded, it would look like this:

2 - 17:2:21.6822: accepted connection from IP = 192.33.116.175 socket = 5
 request served in 0:0:0.005856

Finally, under the wrong conditions, this can cause Grail to hang, because it exposes a synchronization error caused by the read exception unwinding leaving a mutex set. (This has since been fixed in the latest Grail, but the loConfig bug itself remains.)

Cause of the loConfig bug:

The bug will occur when a read or set operation is requested of a DeviceClient which has not yet been constructed. The sequence of events is as follows:

  1. Request is made to set or get a parameter (say loConfig) from a DeviceClient object (say LO1) which does not yet exist.
  2. Grail creates the LO1 client
  3. Grail instructs the LO1 client to getValues(), to populate parameter cache
  4. Grail waits for LO1 to notify reportComplete
  5. Grail gets/sets the value

The problem occurs between items 4 and 5. The test for notification for reportComplete was prematurely falling through. Thus, Grail was attempting to use a value before it was actually received from the manager. When this happens, what follows depends on the type of parameter. For all numeric parameters, an improbable value might be returned, such as a NaN for a voltage, etc. But no exception will be thrown. For an enum, an exception will be thrown, because a bad value means that the findName routine (of class EnumerationParse in the DataDescription library) will fail. Thus the problem is not confined to enums, just made very visible by them.

Solution

This problem was fixed by switching to the new TCondition<> condition variable template class in Ygor/libraries/Threads. This condition variable works, so Grail always properly waits for the reportComplete.

-- RamonCreager - 19 Mar 2004

Topic GrailNotes . { Edit | Attach | Ref-By | Printable | Diffs | r1.20 | > | r1.19 | > | r1.18 | More }
Revision r1.20 - 13 Dec 2006 - 21:26 GMT - RamonCreager
Parents: TWikiUsers > RamonCreager
Content copyright © 1999-2007 by the contributing authors.
All material on this collaboration platform is the property of the contributing authors.