| <<O>> Difference Topic GrailNotes (r1.20 - 13 Dec 2006 - RamonCreager) |
| Changed: | |
| < < |
Grail Development Notes |
| > > |
Grail Development Notes |
| Changed: | |
| < < |
05/30/2006Grail Call Execution Speed |
| > > |
12/13/2006Field Callback improvementsGrail has been modified to improve the field level callback services. In addition to providing callbacks where an entire parameter or sampler may be returned, Grail now accepts callback subscriptions for individual sampler/parameter fields. This can significantly simplify the processing of the callback, and also significantly reduce the Grail client's processing overhead, especially when all that is needed is one or two fields of a complex sampler. All examples that follow use the quadrant detector and assume the following:
>>> from gbt.ygor import GrailClient
>>> from pprint import pprint
>>> cl = GrailClient("goauld", 18000, cb_port=19592)
>>> # Call-back function:
>>> def cat(device, sampler, value):
... pprint(value)
...
>>> qd = cl.create_manager('QuadrantDetector')
To register a callback for a sampler/parameter field, one provides the desired field(s) -- minus the root sampler/parameter name -- as an extra parameter to both versions of GrailClient 's reg_sampler() or reg_param(). To illustrate, this call registers a callback for the entire 'monitorData' sampler:
>>> qd.reg_sampler('monitorData', cat)
but this call registers a callback only for the 'n12VOK' field of the 'monitorData' sampler:
>>> qd.reg_sampler('monitorData', cat, 'n12VOK')
The value will be returned just as if qd.get_sampler_value('monitorData,n12VOK') had been called:
'1'(this indicates that this power supply is OK). Multiple fields for the sampler/parameter may be specified in the same subscription. These are separated by semicolons:
>>> qd.reg_sampler('monitorData', cat, 'n12VOK;n15VOK;p5VOK')
The values returned will now be a semicolon delimited list of values, with the ordering matching the order in which they were subscribed:
'1;1;1'(Note that the callback still will be called every time the sampler is updated, not when the field is updated. So if a sampler updates, but the registered field has not changed, callbacks are received nonetheless.) Using field callbacks for samplers/parameters with no fieldsThe field callback syntax requires a field name to specify a field. What of parameters/samplers that have only one root field? Up do now, there was no way to specify this. Grail has been modified to allow for this. For such samplers/parameters, one specifies the root field using '.'. The following example registers a field callback for the quadrant detector's 'state' parameter:
>>> qd.reg_param('state', cat, '.')
This is desirable because now there is no need to parse the entire 'state' parameter to get the value; the value is returned just as if qd.get_value('state') had been called:
>>> qd.off() 'OK' >>> 'Off' 'Off' >>> qd.on() 'OK' >>> 'Activating' 'Activating' 'Aborting' 'Ready'(Note that the 'cat' callback function above does nothing but print the value.) Time stamps and field callbacksIn Ygor, samplers are returned with a time stamp. The time stamp is not an integral part of the sampler structure itself, however; it is merely returned with the sampler structure. Therefore, the normal sampler callback mechanism does not return one (yet). However, the time stamp values may be obtained using the field callback mechanism. Two special fields are recognized: TS:MJD and TS:seconds. If these fields are provided during the field callback subscription, the MJD and seconds-in-the-day values will be returned along with the values of any other subscribed fields:
>>> qd.reg_sampler('monitorData', cat, 'n12VOK;n15VOK;p5VOK;TS:MJD;TS:seconds')
>>>
'1;1;1;54082;69355.437475'
Here, returned with the three requested values, are the MJD and number of seconds elapsed during the day (UT).
NOTE: Grail can return time stamps this way for both samplers and parameters. It should be noted however that Ygor only supports time stamps for samplers; Grail adds the parameter time stamp when it receives the parameter callback from the manager.
05/30/2006Grail Call Execution Speed |
| Changed: | |
| < < |
11/22/2005 |
| > > |
11/22/2005 |
| Changed: | |
| < < |
11/18/2005 |
| > > |
11/18/2005 |
| Changed: | |
| < < |
08/16/2005 |
| > > |
08/16/2005 |
| Changed: | |
| < < |
02/24/2005 |
| > > |
02/24/2005 |
| Changed: | |
| < < |
02/22/2005 |
| > > |
02/22/2005 |
| Changed: | |
| < < |
01/06/2005 |
| > > |
01/06/2005 |
| Changed: | |
| < < |
12/22/2004 |
| > > |
12/22/2004 |
| Changed: | |
| < < |
12/16/2004Grail Quality of Service Improvements |
| > > |
12/16/2004Grail Quality of Service Improvements |
| Changed: | |
| < < |
11/04/2004Grail memory leaks fixed |
| > > |
11/04/2004Grail memory leaks fixed |
| Changed: | |
| < < |
10/28/2004Grail ported to Linux! |
| > > |
10/28/2004Grail ported to Linux! |
| Changed: | |
| < < |
Other Grail enhancements |
| > > |
Other Grail enhancements |
| Changed: | |
| < < |
08/03/2004 |
| > > |
08/03/2004 |
| Changed: | |
| < < |
07/22/2004 |
| > > |
07/22/2004 |
| Changed: | |
| < < |
Optional verbosity re-introduced |
| > > |
Optional verbosity re-introduced |
| Changed: | |
| < < |
Inconsistent state bug fixed |
| > > |
Inconsistent state bug fixed |
| Changed: | |
| < < |
The
|
| > > |
The
|
| Changed: | |
| < < |
06/30/2004 |
| > > |
06/30/2004 |
| Changed: | |
| < < |
Grail verbosity reduced |
| > > |
Grail verbosity reduced |
| Changed: | |
| < < |
New 'loConfig' bug fixed. |
| > > |
New 'loConfig' bug fixed. |
| Changed: | |
| < < |
fixGoHang helps Grail too! |
| > > |
fixGoHang helps Grail too! |
| Changed: | |
| < < |
04/01/2004 |
| > > |
04/01/2004 |
| Changed: | |
| < < |
03/25/2004 |
| > > |
03/25/2004 |
| Changed: | |
| < < |
'loConfig' bug |
| > > |
'loConfig' bug |
| Changed: | |
| < < |
Symptoms |
| > > |
Symptoms |
| Changed: | |
| < < |
Cause of the loConfig bug: |
| > > |
Cause of the loConfig bug: |
| Changed: | |
| < < |
Solution |
| > > |
Solution |
| <<O>> Difference Topic GrailNotes (r1.19 - 01 Jun 2006 - RamonCreager) |
| Changed: | |
| < < | Occasionally, questions arise about the speed with which Grail can handle an individual SOAP request. This appears to be prompted by a perception that SOAP is slow. Therefore I have conducted a series of tests to lay this issue to rest. In order to measure Grail, and not the performance of the Python clients, I completed the C++ Grail client library, at least enough to conduct the test. The test was conducted with two versions of Grail: non-multithreaded SOAP, with HTTP 1.0 protocol, and multithreaded SOAP, which can support HTTP 1.1. I wrote a test program using the C++ Grail client which did the following: |
| > > | Occasionally, questions arise about the speed with which Grail can handle an individual SOAP request. This appears to be prompted by a perception that SOAP is slow. Therefore I have conducted a series of tests to lay this issue to rest. In order to differentiate between the performance of Grail and the performance of the Python clients, I completed the C++ Grail client library, at least enough to conduct the tests. I assumed that the tests using the C++ Grail client would provide a minimum baseline time. The call times would be shared between both client and server, but would be the smallest possible times that could reasonably be expected to perform an RPC call to Grail. If RPC calls using Python clients show a significant time difference, these then could be ascribed to the Python client, and not to Grail. The test was conducted with two versions of Grail: non-multithreaded SOAP, with HTTP 1.0 protocol, and multithreaded SOAP, which can support HTTP 1.1 (SOAPpy clients use HTTP 1.0 only). I wrote a test program using the C++ Grail client which did the following: |
| Changed: | |
| < < |
This amounts to 33 RPC calls to Grail in all. The calls were made from a client running on colossus to a Grail instance running on goauld. The times are as follows:
|
| > > |
This amounts to 33 RPC calls to Grail in all. The calls were made from a client running on colossus to a Grail instance running on goauld. In the Python clients, RPC get and set calls were timed separately (more on this below). The times are as follows:
|
| <<O>> Difference Topic GrailNotes (r1.18 - 30 May 2006 - RamonCreager) |
| Added: | |
| > > |
05/30/2006Grail Call Execution SpeedOccasionally, questions arise about the speed with which Grail can handle an individual SOAP request. This appears to be prompted by a perception that SOAP is slow. Therefore I have conducted a series of tests to lay this issue to rest. In order to measure Grail, and not the performance of the Python clients, I completed the C++ Grail client library, at least enough to conduct the test. The test was conducted with two versions of Grail: non-multithreaded SOAP, with HTTP 1.0 protocol, and multithreaded SOAP, which can support HTTP 1.1. I wrote a test program using the C++ Grail client which did the following:
colossus to a Grail instance running on goauld. The times are as follows:
|
| <<O>> Difference Topic GrailNotes (r1.17 - 22 Nov 2005 - RamonCreager) |
| Added: | |
| > > |
11/22/2005
11/18/2005
config_tool to use the set_values_array() interface as needed, rather than having to make individual set_value calls. |
| <<O>> Difference Topic GrailNotes (r1.16 - 22 Aug 2005 - RamonCreager) |
| Added: | |
| > > |
08/16/2005
|
| <<O>> Difference Topic GrailNotes (r1.15 - 25 Feb 2005 - RamonCreager) |
| Changed: | |
| < < |
|
| > > |
An issue arose concerning the total number of threads that Grail can spawn. When Ron DuPlain made an attempt to register every manager in the GBT system, Grail failed to allocate more than 253 threads. Since Grail starts at least 3 threads per Manager client, and there are currently 142 managers in the current GBT Telescope system (release 5.1), Grail was attempting to create some 426 threads. The problem turns out to be that the default stack size per thread is 8MB. Since the process space is 3GB, this works out to about 380 threads before process space is exhausted, if the process space was only given over to thread stacks. In reality, other parts of the process get process space, so this number is closer to 250. Note that this memory is not actually used! The process space is simply the amount of memory that the process could conceivably address given the memory addressing hardware of the machine's architecture. Only portions of memory actually used are mapped to physical memory by the hardware. Thus, one can conceivably exhaust the entire process space by reserving portions of it for potential use (as happened here) but only be actually using a few megabytes of physical memory.
I tested to see if this was the problem by using ulimits to temporarily set the default thread stack size to 128K and ran Grail on leeloo. This time Grail was able to create all the needed threads. I am implementing a more permanent fix by explicitly setting the stack size for threads created in Grail by using the pthread_attr_setstacksize() call before every call to pthread_create(). I am also modifying the Thread template class in ygor/libraries/Threads to allow this size to be specified (if not, the default is the ulimits set default.) I am leaving alone threads spawned by other libraries, such as RPC++ (in ShVxClient), to minimize impact on any other applications. Though one of these is created for every manager client Grail creates, the savings elsewhere should allow Grail to meet its needs.
02/22/2005Toney Minter found a bug in the way Grail handles dynamic String parameters. Ygor has two kinds of String parameters. Static and Dynamic. The static kind are the most common. These parameters use a programmer defined hard limit on their length. The dynamic kind are much more rare. Their size is set by the user when the user sets the parameter. Grail was mishandling these so that they could not be set at all (for example thepolycoDatFile parameter in the SpectralProcessor.) Grail was only allowing access to dynamic parameters by requiring an index to access their individual elements, but the individual string elements are not of interest, the entire string is (thus no index is needed to get/set the parameter.)
Fixed this by testing the parameter to see if it is a BasicType::String, and if so, handling it differently. Dynamic non-String parameters are still handled the same as before. |
| <<O>> Difference Topic GrailNotes (r1.14 - 24 Feb 2005 - RamonCreager) |
| Added: | |
| > > |
02/24/2005
01/06/2005
|
| <<O>> Difference Topic GrailNotes (r1.13 - 30 Dec 2004 - RamonCreager) |
| Changed: | |
| < < | Grail suffers from not being able to handle large loads, and also does not insulated itself and its clients from a problem client. This means that heavy use of callbacks, or a misbehaving client, could freeze Grail and any other Grail clients. (see also Grail Test Plan.) |
| > > | Grail suffered from not being able to handle large loads, and also did not insulated itself and its clients from a problem client. This means that heavy use of callbacks, or a misbehaving client, could freeze Grail and any other Grail clients. (see also Grail Test Plan.) |
| <<O>> Difference Topic GrailNotes (r1.12 - 22 Dec 2004 - RamonCreager) |
| Changed: | |
| < < |
12/17/2004 |
| > > |
12/22/2004 |
| Changed: | |
| < < |
|
| > > |
|
| Added: | |
| > > |
|
| <<O>> Difference Topic GrailNotes (r1.11 - 17 Dec 2004 - RamonCreager) |
| Added: | |
| > > |
12/17/2004
|
| <<O>> Difference Topic GrailNotes (r1.10 - 16 Dec 2004 - RamonCreager) |
| Added: | |
| > > |
12/16/2004Grail Quality of Service ImprovementsGrail suffers from not being able to handle large loads, and also does not insulated itself and its clients from a problem client. This means that heavy use of callbacks, or a misbehaving client, could freeze Grail and any other Grail clients. (see also Grail Test Plan.) The following has been done to fix this problem:
|
| <<O>> Difference Topic GrailNotes (r1.9 - 04 Nov 2004 - RamonCreager) |
| Changed: | |
| < < |
10/28/2004 |
| > > |
11/04/2004Grail memory leaks fixedThe following Grail SOAP interface functions produced memory leaks:
top. All changes apply to both the Linux and Solaris Grail.
|
| Added: | |
| > > |
The problem was that I misunderstood how gSOAP manages memory in the SOAP serializer. After carefully reading of the gSOAP manual (A comprehensive though not very direct document, and short on meaningful examples) I was able to figure out how to manage memory in dynamic gSOAP arrays. More details about gSOAP memory management can be found here.
10/28/2004 |
| <<O>> Difference Topic GrailNotes (r1.8 - 28 Oct 2004 - RamonCreager) |
| Added: | |
| > > |
10/28/2004Grail ported to Linux!Grail has been ported to Linux. In the Solaris Grail the SOAP interface thread would create the DeviceClients (including the recipient RPC server file handles) while handling a client request, and a different thread, the M&C RPC++ recipient server thread, would service the DeviceClient recipient servers. In Solaris, file handles are global and can be used in any thread like this. Under Linux, file handles to be used by aselect() call must have been created in the same thread that runs the select() call. Somehow the SOAP client thread had to be able to notify the RPC select() thread (waiting on socket file handles) to create the DeviceClient for it. The solution was for Grail to make an RPC call to itself, one thread to the other, to have the RPC thread jump out of the select() call and create DeviceClient on behalf of the SOAP service thread. This was done by adding a CREATE_CLIENT RPC method to the Grail Status Service RPC service. The GrailStatusService class in turn has been merged into the DeviceClientMap class to allow this to be easily done.
Other Grail enhancements
|
| <<O>> Difference Topic GrailNotes (r1.7 - 03 Aug 2004 - RamonCreager) |
| Added: | |
| > > |
08/03/2004Grail now has a multithreaded SOAP server. Grail no longer blocks out service requests until the previous service request is handled. Instead, it starts a new thread to handle the request and immediately resumes listening for more requests. This allows Grail to support by default HTTP 1.1 connections, which keep the socket open by default until the client closes it. It also allows Grail to be more responsive to requests. Now a client will block only if the client desires access to a device that anoter client is currently accessing. In this case, as soon as the previous client finishes with the device, the next client can access it, even if the previous client keeps its connection open. I have also modified the SOAP interface for Grail to fully support WSDL and also to support anonymous (as before) and named parameters (new) simultaneously. I am trying to bring the Grail SOAP interface up to the latest standards while getting it ready to migrate to gSOAP 2.6. WSDL has been successfuly tested with SOAPpy and Grail built with gSOAP 2.3 and 2.6. Part of the motivation to use HTTP 1.1 Keep Alive connections was to improve throughput and turn-around times for clients who wish to periodically and repeatedly make requests to Grail, as noted in an earlier entry. I found thatSOAPpy does not support this, but does allow different transport classes to be specified in the SOAPProxy class. So I wrote an HTTP 1.1 transport class (based largely on the older one) to do some timing tests. The results were surprising. First, here is the test code
from grailclient import *
from time import time
cl = GrailClient("titan", 18000, cb_port=19591)
def test():
begin = time();
cl.get_value('Accelerometer', 'state')
end = time();
print "Call took", end - begin, "seconds"
On executing test() with the old HTTP 1.0 transport class, this call typically took about 10 mS. Using the new HTTP11Transport class, the client behaved as expected: the connection to Grail remained open between calls. However, the timing went up by an order of magnitude, taking approximately 100 mS per call!
This counter-intuitive result led me to try another Python SOAP client, ZSI, with the same results. I also tried to test this using Perl and SOAP::Lite, but gave up (for now) because of my lack of Perl knowlege. I measured the time Grail was taking to process the requests, hoping to find something. Grail was spending most of its time waiting in the soap_recv() call. This means that the client still may be at fault, if it does not finish the transmission in good time. Building Grail with gSOAP 2.6 did not improve things. The bottom line is that with respect to execution times Grail will behave as before when called from an unmodified SOAPpy library. |
| <<O>> Difference Topic GrailNotes (r1.6 - 22 Jul 2004 - RamonCreager) |
| Added: | |
| > > |
Optional verbosity re-introducedAmy Shelton requested that I add back into Grail some of the verbosity that I eliminated for release 4.4. The reason she wanted this is for feedback during testing of turtle on the Antenna simulator. For this, she runs Grail interactively, and this feedback is useful in ensuring that Grail requests are going to this Grail and not to the real system onvortex.
To accomodate this request, I added the command line switches
-v, --verbose
to Grail. This allows Grail to remain quiet when being run by TaskMaster, but print out useful information when run interactively.
|
| Changed: | |
| < < | Bug found and fixed and patched in Grail 4.4 on 7/20. The bug manifested itself whenever a device went down, whether it crashed or was terminated, and came back up again. The DeviceClient on Grail for that device would be left by this in a state where it believed it was subscribed to some parameters on the device, but was not. Thus values for 'state' could conflict with values reported by CLEO, for example. Control was unaffected, but, as far as previously registered parameters were concenrned, Grail was blind (new parameter registrations worked OK). |
| > > | Bug found and fixed and patched in Grail 4.4 on 7/20. The bug manifested itself whenever an M&C device (say, DCR) went down and came back on-line again. The DeviceClient on Grail for that device would be left by this in a state where it believed it was subscribed to some parameters on the device, but was not. Thus values for 'state' would eventually become inconsistent with the actual value on the device and could conflict with values reported by CLEO, for example. Control of the device was unaffected, but, as far as previously registered parameters were concenrned, Grail was blind (new parameter registrations worked OK). |
| Changed: | |
| < < | The current Grail has a single threaded SOAP server, and requires clients to connect, transact, and disconnect for every transaction made with Grail (This is the default model for HTTP). |
| > > | The current Grail has a single threaded SOAP server, and requires clients to connect, transact, and disconnect for every transaction made with Grail (This is the default model for HTTP 1.0). |
| Changed: | |
| < < | This causes a problem that was noted recently when Paul Marganian started working on a lightweight M&C status screen that uses Grail. Paul's code makes many requests of Grail every second. Because of the nature of TCP connections, each Grail connection leaves a file handle on the host machine unused and unusable for a period set in the networking stack of that machine (on Solaris, this is 240 seconds by default). This phenomenon can be seen by running the following command on a command line on the Grail host, after making a series of Grail requests: |
| > > | This causes a problem that was noted recently when Paul Marganian started working on a lightweight M&C status screen that uses Grail. Paul's code makes many requests of Grail every second. Because of the nature of TCP connections, each Grail connection leaves a file handle on the host machine unused and unusable for a period set in the networking stack of that machine (on Solaris, this is 240 seconds by default). This phenomenon can be seen by running the following command on a command line on the Grail host, after making a series of Grail requests: |
| Changed: | |
| < < |
Here, 6 requests were made in quick succession on a Grail running on the Solaris host titan. One can easily see that by continuously making many requests a second, that titan might eventually run out of file handles and no new requests to any services on titan will work until some of these sockets have finally timed out (this depends on how heavily titan was loaded to begin with, and the rate at which requests are made of Grail or any other titan services). This is not a bug, and using tricks to avoid this is Not A Good Thing (see the "Programming Unix Sockets in C" FAQ for a detailed explanation of why the TIME_WAIT state is necessary for any newly closed socket.
|
| > > |
Here, 6 requests were made in quick succession on a Grail running on the Solaris host titan, port 18000. One can easily see that by continuously making many requests a second, that titan might eventually run out of file handles and no new requests to any services on titan will work until some of these sockets have finally timed out (this depends on how heavily titan was loaded to begin with, and the rate at which requests are made of Grail or any other titan services). This is not a bug, and using tricks to avoid this is Not A Good Thing! (See the "Programming Unix Sockets in C" FAQ for a detailed explanation of why the TIME_WAIT state is necessary for any newly closed TCP socket.)
There are two possible solutions to this problem:
virgo when he ran into the same problem with the Antenna Characterization SOAP interface he uses on the Antenna manager.
The second solution is somewhat more involved, but has several things going for it. The solution involves making Grail's SOAP interface support the HTTP Keep-Alive option. This also requires making the SOAP interface multithreaded: Keep-Alive would lock out other clients if the interface were to remain single-threaded. Internally, Grail is already multi-threaded, so no other changes would be required.
There are a few good reasons to pursue this second approach:
SOAPpy do this, but SOAPpy is based on Python's httplib and httplib can do this. I have emailed one of the SOAPpy maintainers, asking if there is a ready way to do this; laking an answer, I will have to go through the SOAPpy source. Fortunately, it is not very big. |
| <<O>> Difference Topic GrailNotes (r1.5 - 22 Jul 2004 - RamonCreager) |
| Added: | |
| > > |
07/22/2004Inconsistent state bug fixedBug found and fixed and patched in Grail 4.4 on 7/20. The bug manifested itself whenever a device went down, whether it crashed or was terminated, and came back up again. The DeviceClient on Grail for that device would be left by this in a state where it believed it was subscribed to some parameters on the device, but was not. Thus values for 'state' could conflict with values reported by CLEO, for example. Control was unaffected, but, as far as previously registered parameters were concenrned, Grail was blind (new parameter registrations worked OK). The problem was caused by the recovery method not having been refactored to the new model of Grail parameter handling. This was fixed by having the recovery method re-register all subscribed parameters with the new recipient client on the resurrected device. The
The current Grail has a single threaded SOAP server, and requires clients to connect, transact, and disconnect for every transaction made with Grail (This is the default model for HTTP).
This causes a problem that was noted recently when Paul Marganian started working on a lightweight M&C status screen that uses Grail. Paul's code makes many requests of Grail every second. Because of the nature of TCP connections, each Grail connection leaves a file handle on the host machine unused and unusable for a period set in the networking stack of that machine (on Solaris, this is 240 seconds by default). This phenomenon can be seen by running the following command on a command line on the Grail host, after making a series of Grail requests:
|
| <<O>> Difference Topic GrailNotes (r1.4 - 01 Jul 2004 - RamonCreager) |
| Changed: | |
| < < |
NOTE: This bug fix exposed an iteresting Manager/Panel interaction issue: If a parameter has no value (such as a dynamic array with 0 elements), no reportParameter() is received after issuing a getValue/getValues (and no reportComplete() either, if this was the only parameter registered). This is the issue FixGoHang adresses: it actually sets these parameters to have at least one element. Joe is looking into the proper fix for this. Alternatives could be to not have parameters with no values, or to send reportParameter()/reportComplete() even if no value exists, with some way to show that there is no value associated with that parameter (by setting the 'len' parameter in reportParameter() to 0, for example).
|
| > > |
fixGoHang helps Grail too!The new Grail (4.4) is vulnerable to an iteresting Manager/Panel interaction issue: If a parameter has no value (such as a dynamic array with 0 elements), noreportParameter() is received after issuing a getValue/getValues (and no reportComplete() either, if this was the only parameter registered). This is the issue fixGoHang adresses: it actually sets these parameters to have at least one element. Joe is looking into the proper fix for this. Alternatives could be to not allow parameters with no values; another is for the Manager to send reportParameter()/reportComplete() even if no value exists, with some way to tell the Panel user that there is no value associated with that parameter (by setting the 'len' parameter in reportParameter() to 0, for example).
In the current release, this problem manifests itself when either (or both) the Antenna or LO1 Managers is restarted. Grail may report problems with either of these two devices, or the ScanCoordinator, which also may deal with these two devices. The config tool will throw an exception or generate an error message with the following text in it:
Device failed to respond
A look at the Grail log (/home/gbt/etc/log/vortex/Grail.< pid >.< date&time >) will show an entry like this:
53187 12:44:53
Caught DeviceClientException: Device: ScanCoordinator.ScanCoordinator
Problem: Device failed to respond
Location: ParameterCache::subscribe(int)
The short-term fix is to run fixGoHang. |
| <<O>> Difference Topic GrailNotes (r1.3 - 30 Jun 2004 - RamonCreager) |
| Added: | |
| > > |
06/30/2004Grail verbosity reducedThe old Grail was fairly verbose, making it vulnerable to being aborted on a SIGXFSZ signal (file size exceeded) as its log file approached the limit set in vortexProc.conf. All non-error output has been wrapped in#if defined(DEBUG)/#endif conditional directives, which means that if compiled with no DEBUG defined Grail is a whole lot quieter.
New 'loConfig' bug fixed.Why did this problem reappear? This is actually a new bug, with the same symptoms as the old 'loConfig' bug. Grail used the virtual functionPanel::reportComplete() to indicate that all registered parameter values have been loaded from the manager. Since the old Grail registered all parameters up front, this was OK. Only one reportComplete() is received on registering all parameters, and therefore there is a guarantee that all parameters registered have valid data. (In the old 'loConfig' bug, the test to see if reportComplete() had been received fell through prematurely; see earlier notes on 'loConfig'.)
The new Grail registers parameters on demand. On creating a manager client, Grail registers 'state' and 'status' and requests values for those. When configuring, this is rapidly followed by a series of new parameter request. This results in multiple reportComplete() being received. A race condition could set in where the new parameter requests could see the original 'sate' and 'status' reportComplete() (or reportComplete() for a previous parameter request) and think that values had been received for the new parameters. The fix was to give each parameter its own condition variable and wait for it to be broadcast in Panel::reportParameter(), when the actual data is received. This is much more positive: either there is real data or there is an real error.
NOTE: This bug fix exposed an iteresting Manager/Panel interaction issue: If a parameter has no value (such as a dynamic array with 0 elements), no reportParameter() is received after issuing a getValue/getValues (and no reportComplete() either, if this was the only parameter registered). This is the issue FixGoHang adresses: it actually sets these parameters to have at least one element. Joe is looking into the proper fix for this. Alternatives could be to not have parameters with no values, or to send reportParameter()/reportComplete() even if no value exists, with some way to show that there is no value associated with that parameter (by setting the 'len' parameter in reportParameter() to 0, for example).
|
| Changed: | |
| < < |
The problem occurs between items 4 and 5. Sometimes (not always!) the reportComplete is received from the manager and processed before Grail has had a chance to copy and process the parameter data received from the manager. When this happens, what follows depends on the type of parameter. For all numeric parameters, an improbable value might be returned, such as a NaN for a voltage, etc. But no exception will be thrown. For an enum, an exception will be thrown, because a bad value means that the findName routine (of class EnumerationParse in the DataDescription library) will fail. Thus the problem is not confined to enums, just made very visible by them.
This looks like a race condition. This feeling is reinforced by the fact that reading the value again immediately after failure results in success, despite the fact that no new getValues() call was made to get a new value. The problem seems strange because both the processing of parameter data and the processing of reportComplete happens sequentially in what is supposed to be one thread of execution. However, I do not yet know exactly how PanelServer and Recipient handle communications. Executing info threads from within gdb revealed many more threads than expected. It is possible that Recipient is starting threads for each communication received; this would explain the race condition.
|
| > > |
The problem occurs between items 4 and 5. The test for notification for reportComplete was prematurely falling through. Thus, Grail was attempting to use a value before it was actually received from the manager. When this happens, what follows depends on the type of parameter. For all numeric parameters, an improbable value might be returned, such as a NaN for a voltage, etc. But no exception will be thrown. For an enum, an exception will be thrown, because a bad value means that the findName routine (of class EnumerationParse in the DataDescription library) will fail. Thus the problem is not confined to enums, just made very visible by them.
|
| Changed: | |
| < < |
The best solution is to ensure that when a reportComplete event is received, all parameters have been processed. This means looking into how PanelServer sends the values and how Recipient handles them. It may mean changes to these libraries.
Meanwhile, a quick patch would be to wait 1 or 2 seconds after receiving the reportComplete notification before releasing the DeviceClient object for use. This would apply ONLY when creating a new one, not every time an object is used!
|
| > > |
This problem was fixed by switching to the new TCondition<> condition variable template class in Ygor/libraries/Threads. This condition variable works, so Grail always properly waits for the reportComplete. |
| <<O>> Difference Topic GrailNotes (r1.2 - 01 Apr 2004 - RamonCreager) |
| Changed: | |
| < < |
'loConfig' bug |
| > > |
04/01/2004I have started work refactoring Grail. The biggest functional changes from the 4.2 version is that parameters will no longer be automatically registered for callbacks and cached. This will occur on-demand as parameters are needed. This will considerably reduce network traffic between Grail and the M&C system. Other refactoring:
03/25/2004'loConfig' bug |
| <<O>> Difference Topic GrailNotes (r1.1 - 19 Mar 2004 - RamonCreager) |
| Added: | |
| > > |
%META:TOPICINFO{author="RamonCreager" date="1079722260" format="1.0" version="1.1"}%
%META:TOPICPARENT{name="RamonCreager"}%
Grail Development Notes'loConfig' bugCaused by fresh data buffer containing bogus data.SymptomsSometimes, Grail returns an error message when an enum parameter on an attempt to read or set an enum parameter. Because this has first and most often been observed with theLO1 parameter loConfig, I call this the 'loConfig bug'.
In the Python based GrailClient, the error looks like this:
>>> LO1.get_value("loConfig")
<Fault SOAP-ENV:Client: Device error: LO1.LO1: No value returned for parameter l
oConfig>
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "grailclient.py", line 421, in get_value
return self.cl.get_value(self.dev, path)
File "grailclient.py", line 249, in get_value
return self.cl.get_value(device, path)
File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 362, in __call__
return self.__r_call(*args, **kw)
File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 384, in __r_call
self.__hd, self.__ma)
File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 306, in __call
raise p
SOAPpy.Types.faultType: <Fault SOAP-ENV:Client: Device error: LO1.LO1: No value
returned for parameter loConfig>
A look throgh Grails logs reveals this:
1 - 17:1:32.8962: accepted connection from IP = 192.33.116.175 socket = 5
LO1.LO1 initialized correctly
DataNamedValues::value2Name(-1431655766, 1627e0, 200): EnumerationParser::findNa
me(1627e0, 200, -1431655766) failed, error code -3
Parameter::get_value(): data = aa aa aa aa
Parameter::get_value() failure: 0 = _ddp->getFieldValueStr(loConfig, 18ee18, 4, 1627e0, 0)
Caught DeviceClientException: Device: LO1.LO1
Problem: No value returned for parameter loConfig
Location: DeviceClient::get_value()
request served in 0:0:0.252247
If the attempt to read/set the value had succeeded, it would look like this:
2 - 17:2:21.6822: accepted connection from IP = 192.33.116.175 socket = 5 request served in 0:0:0.005856Finally, under the wrong conditions, this can cause Grail to hang, because it exposes a synchronization error caused by the read exception unwinding leaving a mutex set. (This has since been fixed in the latest Grail, but the loConfig bug itself remains.) Cause of the loConfig bug:The bug will occur when a read or set operation is requested of aDeviceClient which has not yet been constructed. The sequence of events is as follows:
reportComplete is received from the manager and processed before Grail has had a chance to copy and process the parameter data received from the manager. When this happens, what follows depends on the type of parameter. For all numeric parameters, an improbable value might be returned, such as a NaN for a voltage, etc. But no exception will be thrown. For an enum, an exception will be thrown, because a bad value means that the findName routine (of class EnumerationParse in the DataDescription library) will fail. Thus the problem is not confined to enums, just made very visible by them.
This looks like a race condition. This feeling is reinforced by the fact that reading the value again immediately after failure results in success, despite the fact that no new getValues() call was made to get a new value. The problem seems strange because both the processing of parameter data and the processing of reportComplete happens sequentially in what is supposed to be one thread of execution. However, I do not yet know exactly how PanelServer and Recipient handle communications. Executing info threads from within gdb revealed many more threads than expected. It is possible that Recipient is starting threads for each communication received; this would explain the race condition.
SolutionThe best solution is to ensure that when a reportComplete event is received, all parameters have been processed. This means looking into howPanelServer sends the values and how Recipient handles them. It may mean changes to these libraries.
Meanwhile, a quick patch would be to wait 1 or 2 seconds after receiving the reportComplete notification before releasing the DeviceClient object for use. This would apply ONLY when creating a new one, not every time an object is used!
-- RamonCreager - 19 Mar 2004 |
| Topic GrailNotes . { View | Diffs | r1.20 | > | r1.19 | > | r1.18 | More } |
|
Revision r1.1 - 19 Mar 2004 - 18:51 GMT - RamonCreager Revision r1.20 - 13 Dec 2006 - 21:26 GMT - RamonCreager |
Content copyright © 1999-2007 by the contributing authors. All material on this collaboration platform is the property of the contributing authors. |