>>> from gbt.ygor import GrailClient
>>> from pprint import pprint
>>> cl = GrailClient("goauld", 18000, cb_port=19592)
>>> # Call-back function:
>>> def cat(device, sampler, value):
... pprint(value)
...
>>> qd = cl.create_manager('QuadrantDetector')
To register a callback for a sampler/parameter field, one provides the desired field(s) -- minus the root sampler/parameter name -- as an extra parameter to both versions of GrailClient 's reg_sampler() or reg_param(). To illustrate, this call registers a callback for the entire 'monitorData' sampler:
>>> qd.reg_sampler('monitorData', cat)
but this call registers a callback only for the 'n12VOK' field of the 'monitorData' sampler:
>>> qd.reg_sampler('monitorData', cat, 'n12VOK')
The value will be returned just as if qd.get_sampler_value('monitorData,n12VOK') had been called:
'1'(this indicates that this power supply is OK). Multiple fields for the sampler/parameter may be specified in the same subscription. These are separated by semicolons:
>>> qd.reg_sampler('monitorData', cat, 'n12VOK;n15VOK;p5VOK')
The values returned will now be a semicolon delimited list of values, with the ordering matching the order in which they were subscribed:
'1;1;1'(Note that the callback still will be called every time the sampler is updated, not when the field is updated. So if a sampler updates, but the registered field has not changed, callbacks are received nonetheless.)
>>> qd.reg_param('state', cat, '.')
This is desirable because now there is no need to parse the entire 'state' parameter to get the value; the value is returned just as if qd.get_value('state') had been called:
>>> qd.off() 'OK' >>> 'Off' 'Off' >>> qd.on() 'OK' >>> 'Activating' 'Activating' 'Aborting' 'Ready'(Note that the 'cat' callback function above does nothing but print the value.)
>>> qd.reg_sampler('monitorData', cat, 'n12VOK;n15VOK;p5VOK;TS:MJD;TS:seconds')
>>>
'1;1;1;54082;69355.437475'
Here, returned with the three requested values, are the MJD and number of seconds elapsed during the day (UT).
NOTE: Grail can return time stamps this way for both samplers and parameters. It should be noted however that Ygor only supports time stamps for samplers; Grail adds the parameter time stamp when it receives the parameter callback from the manager.
Grail::show_managers()
projectId value 30 times, in 30 separate calls to Grail::set_value()
Grail::show_params()
Grail::show_samplers()
colossus to a Grail instance running on goauld. In the Python clients, RPC get and set calls were timed separately (more on this below). The times are as follows:
DeviceClientMap, an RPCserver, is a singleton, which means the first function to call DeviceClientMap::instance() will create it. When initializing, Grail first creates the rpc_task thread to make sure DeviceClientMap gets built in the right thread. On the RHEL4 machines, this thread did not run far enough before DeviceClientMap::instance() was called somewhere else in Grail's initialization procedure, causing DeviceClientMap to fail to respond to connections. Since the function of this service is to enable the creation of DeviceClients in the RPC service thread, this meant that no DeviceClient objects could be created when Grail runs on RHEL4 machines. The solution was to place a condition variable that allowed the Grail initialization to hold until rpc_task finishes creating the DeviceClientMap object, thus ensuring that it gets built in the right thread.
sleep() calls in ProcessParameterList(), in soap_handlers.cc, which together amounted to 3 seconds per configured manager.
ProcessParemeterList() allowing the Grail client to specify whether, after setting all values, through the 'prepare' flag:
config_tool to use the set_values_array() interface as needed, rather than having to make individual set_value calls.
SamplerCache.cc, which used reference counts to keep track of whether a sampler stream should be stopped or started. Reference counts were lost if the new sampler stream did not respond with a value, causing an exception prior to the reference count increment. The new code drops the reference count idea, instead deciding whether to start a sampler stream by checking directly with the sampler stream to see if it is already started. This required modifications to the Monitor library. The code also now checks to see if there are any subscribers to the sampler before stopping the sampler stream by checking the EventManager. This required changes to the EventManager library. This is much more positive and does not result in a sampler stream being shut off while there are subscribers. For polls to sampler values, the call will re-start the stream if it has been shut down before, since it does not rely on reference counts. Finally, the problem where the sampler does not return a value when data is started has been solved in the newest sampler/monitor libraries.
sampler and parameter can now return internal information on the handling by Grail of the specified parameter. Type help sampler or help parameter on the tim command line for more information.
manager, samplers, and parameters now accept a switch, -r, which specifies that only registered managers, samplers and parameters, respectively, should be displayed.
ulimits to temporarily set the default thread stack size to 128K and ran Grail on leeloo. This time Grail was able to create all the needed threads. I am implementing a more permanent fix by explicitly setting the stack size for threads created in Grail by using the pthread_attr_setstacksize() call before every call to pthread_create(). I am also modifying the Thread template class in ygor/libraries/Threads to allow this size to be specified (if not, the default is the ulimits set default.) I am leaving alone threads spawned by other libraries, such as RPC++ (in ShVxClient), to minimize impact on any other applications. Though one of these is created for every manager client Grail creates, the savings elsewhere should allow Grail to meet its needs.
polycoDatFile parameter in the SpectralProcessor.) Grail was only allowing access to dynamic parameters by requiring an index to access their individual elements, but the individual string elements are not of interest, the entire string is (thus no index is needed to get/set the parameter.)
Fixed this by testing the parameter to see if it is a BasicType::String, and if so, handling it differently. Dynamic non-String parameters are still handled the same as before.
reg_sampler() or reg_parameter() SOAP interface call, except that these now support an extra parameter, a ';' delimited list of fields for the sampler or parameter. On callback, a ';' delimited list of values will be returned, in the same order as was specified in the registration call.
soap_handlers.cc file to remove the GrailCallback and GrailCallbackMap classes into their own files, GrailCallback.[h,cc]. This was done to achieve a more logical code layout: these two classes had little bearing on SOAP interface handlers.
tim, the status client.
CB_CLIENTS to the Status Server. This allows tim to display the callback URLs of all clients that have registered a callback. Useful for investigating problems.
DeviceClient::send_values(). This function was making a call to PanelRemote::newParam() with two Parameter member functions as arguments, Parameter::data() and Parameter::data_length(). The order of execution of the two arguments was significant, and is implementation dependent when placed in an argument list. On the Solaris version, the calls executed in the expected order, but on the Linux version, the order was reversed. This meant that the newParam() call was being made (on the Linux version) with new data but with the old data size. Fixed this by removing the significance of the order of execution.
ServerBase::poll_with_qos()). This means that no matter how busy Grail gets, the polling loop will still operate correctly, ensuring reconnection to re-started managers.
get_parameter()
get_sampler()
get_values_array()
show_managers()
show_params()
show_samplers()
top. All changes apply to both the Linux and Solaris Grail.
The problem was that I misunderstood how gSOAP manages memory in the SOAP serializer. After carefully reading of the gSOAP manual (A comprehensive though not very direct document, and short on meaningful examples) I was able to figure out how to manage memory in dynamic gSOAP arrays. More details about gSOAP memory management can be found here.
select() call must have been created in the same thread that runs the select() call. Somehow the SOAP client thread had to be able to notify the RPC select() thread (waiting on socket file handles) to create the DeviceClient for it. The solution was for Grail to make an RPC call to itself, one thread to the other, to have the RPC thread jump out of the select() call and create DeviceClient on behalf of the SOAP service thread. This was done by adding a CREATE_CLIENT RPC method to the Grail Status Service RPC service. The GrailStatusService class in turn has been merged into the DeviceClientMap class to allow this to be easily done.
get_values_array() call. Melinda Mello added this feature to cut down on the network traffic to Grail. Clients that must make repeated periodic calls to Grail for sampler and parameter values may now use get_values_array() to batch these requests into one single call. The result is far fewer sockets left in a TIME_WAIT state.
SOAPpy does not support this, but does allow different transport classes to be specified in the SOAPProxy class. So I wrote an HTTP 1.1 transport class (based largely on the older one) to do some timing tests. The results were surprising. First, here is the test code
from grailclient import *
from time import time
cl = GrailClient("titan", 18000, cb_port=19591)
def test():
begin = time();
cl.get_value('Accelerometer', 'state')
end = time();
print "Call took", end - begin, "seconds"
On executing test() with the old HTTP 1.0 transport class, this call typically took about 10 mS. Using the new HTTP11Transport class, the client behaved as expected: the connection to Grail remained open between calls. However, the timing went up by an order of magnitude, taking approximately 100 mS per call!
This counter-intuitive result led me to try another Python SOAP client, ZSI, with the same results. I also tried to test this using Perl and SOAP::Lite, but gave up (for now) because of my lack of Perl knowlege. I measured the time Grail was taking to process the requests, hoping to find something. Grail was spending most of its time waiting in the soap_recv() call. This means that the client still may be at fault, if it does not finish the transmission in good time. Building Grail with gSOAP 2.6 did not improve things. The bottom line is that with respect to execution times Grail will behave as before when called from an unmodified SOAPpy library.
vortex.
To accomodate this request, I added the command line switches
-v, --verbose
to Grail. This allows Grail to remain quiet when being run by TaskMaster, but print out useful information when run interactively.
TIME_WAIT problem
[rcreager@titan rcreager]$netstat -a | grep 18000
*.18000 *.* 0 0 33232 0 LISTEN
titan.18000 lycaste.gb.nrao.edu.4619 16126 0 33580 0 TIME_WAIT
titan.18000 lycaste.gb.nrao.edu.4700 16126 0 33580 0 TIME_WAIT
titan.18000 lycaste.gb.nrao.edu.4717 16126 0 33580 0 TIME_WAIT
titan.18000 lycaste.gb.nrao.edu.4732 16126 0 33580 0 TIME_WAIT
titan.18000 lycaste.gb.nrao.edu.4745 16126 0 33580 0 TIME_WAIT
titan.18000 lycaste.gb.nrao.edu.4758 16126 0 33580 0 TIME_WAIT
Here, 6 requests were made in quick succession on a Grail running on the Solaris host titan, port 18000. One can easily see that by continuously making many requests a second, that titan might eventually run out of file handles and no new requests to any services on titan will work until some of these sockets have finally timed out (this depends on how heavily titan was loaded to begin with, and the rate at which requests are made of Grail or any other titan services). This is not a bug, and using tricks to avoid this is Not A Good Thing! (See the "Programming Unix Sockets in C" FAQ for a detailed explanation of why the TIME_WAIT state is necessary for any newly closed TCP socket.)
There are two possible solutions to this problem:
TIME_WAIT time-out period on the host machine to something much lower than 240 seconds
virgo when he ran into the same problem with the Antenna Characterization SOAP interface he uses on the Antenna manager.
The second solution is somewhat more involved, but has several things going for it. The solution involves making Grail's SOAP interface support the HTTP Keep-Alive option. This also requires making the SOAP interface multithreaded: Keep-Alive would lock out other clients if the interface were to remain single-threaded. Internally, Grail is already multi-threaded, so no other changes would be required.
There are a few good reasons to pursue this second approach:
TIME_WAIT state on the server now will be dependent on the recent (within 4 minutes) number of Grail clients, not the number of transactions. This will be a far lower and less variable number.
SOAPpy do this, but SOAPpy is based on Python's httplib and httplib can do this. I have emailed one of the SOAPpy maintainers, asking if there is a ready way to do this; laking an answer, I will have to go through the SOAPpy source. Fortunately, it is not very big.
#if defined(DEBUG)/#endif conditional directives, which means that if compiled with no DEBUG defined Grail is a whole lot quieter.
Panel::reportComplete() to indicate that all registered parameter values have been loaded from the manager. Since the old Grail registered all parameters up front, this was OK. Only one reportComplete() is received on registering all parameters, and therefore there is a guarantee that all parameters registered have valid data. (In the old 'loConfig' bug, the test to see if reportComplete() had been received fell through prematurely; see earlier notes on 'loConfig'.)
The new Grail registers parameters on demand. On creating a manager client, Grail registers 'state' and 'status' and requests values for those. When configuring, this is rapidly followed by a series of new parameter request. This results in multiple reportComplete() being received. A race condition could set in where the new parameter requests could see the original 'sate' and 'status' reportComplete() (or reportComplete() for a previous parameter request) and think that values had been received for the new parameters. The fix was to give each parameter its own condition variable and wait for it to be broadcast in Panel::reportParameter(), when the actual data is received. This is much more positive: either there is real data or there is an real error.
reportParameter() is received after issuing a getValue/getValues (and no reportComplete() either, if this was the only parameter registered). This is the issue fixGoHang adresses: it actually sets these parameters to have at least one element. Joe is looking into the proper fix for this. Alternatives could be to not allow parameters with no values; another is for the Manager to send reportParameter()/reportComplete() even if no value exists, with some way to tell the Panel user that there is no value associated with that parameter (by setting the 'len' parameter in reportParameter() to 0, for example).
In the current release, this problem manifests itself when either (or both) the Antenna or LO1 Managers is restarted. Grail may report problems with either of these two devices, or the ScanCoordinator, which also may deal with these two devices. The config tool will throw an exception or generate an error message with the following text in it:
Device failed to respond
A look at the Grail log (/home/gbt/etc/log/vortex/Grail.< pid >.< date&time >) will show an entry like this:
53187 12:44:53
Caught DeviceClientException: Device: ScanCoordinator.ScanCoordinator
Problem: Device failed to respond
Location: ParameterCache::subscribe(int)
The short-term fix is to run fixGoHang.
ygor/libraries/Util directory as EventDispatcher.h.
DeviceClientMap.cc
LO1 parameter loConfig, I call this the 'loConfig bug'.
In the Python based GrailClient, the error looks like this:
>>> LO1.get_value("loConfig")
<Fault SOAP-ENV:Client: Device error: LO1.LO1: No value returned for parameter l
oConfig>
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "grailclient.py", line 421, in get_value
return self.cl.get_value(self.dev, path)
File "grailclient.py", line 249, in get_value
return self.cl.get_value(device, path)
File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 362, in __call__
return self.__r_call(*args, **kw)
File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 384, in __r_call
self.__hd, self.__ma)
File "F:\bin\Python\Lib\site-packages\SOAPpy\Client.py", line 306, in __call
raise p
SOAPpy.Types.faultType: <Fault SOAP-ENV:Client: Device error: LO1.LO1: No value
returned for parameter loConfig>
A look throgh Grails logs reveals this:
1 - 17:1:32.8962: accepted connection from IP = 192.33.116.175 socket = 5
LO1.LO1 initialized correctly
DataNamedValues::value2Name(-1431655766, 1627e0, 200): EnumerationParser::findNa
me(1627e0, 200, -1431655766) failed, error code -3
Parameter::get_value(): data = aa aa aa aa
Parameter::get_value() failure: 0 = _ddp->getFieldValueStr(loConfig, 18ee18, 4, 1627e0, 0)
Caught DeviceClientException: Device: LO1.LO1
Problem: No value returned for parameter loConfig
Location: DeviceClient::get_value()
request served in 0:0:0.252247
If the attempt to read/set the value had succeeded, it would look like this:
2 - 17:2:21.6822: accepted connection from IP = 192.33.116.175 socket = 5 request served in 0:0:0.005856Finally, under the wrong conditions, this can cause Grail to hang, because it exposes a synchronization error caused by the read exception unwinding leaving a mutex set. (This has since been fixed in the latest Grail, but the loConfig bug itself remains.)
DeviceClient which has not yet been constructed. The sequence of events is as follows:
loConfig) from a DeviceClient object (say LO1) which does not yet exist.
LO1 client
LO1 client to getValues(), to populate parameter cache
LO1 to notify reportComplete
reportComplete was prematurely falling through. Thus, Grail was attempting to use a value before it was actually received from the manager. When this happens, what follows depends on the type of parameter. For all numeric parameters, an improbable value might be returned, such as a NaN for a voltage, etc. But no exception will be thrown. For an enum, an exception will be thrown, because a bad value means that the findName routine (of class EnumerationParse in the DataDescription library) will fail. Thus the problem is not confined to enums, just made very visible by them.
reportComplete.
-- RamonCreager - 19 Mar 2004
| Topic GrailNotes . { Edit | Attach | Ref-By | Printable | Diffs | r1.20 | > | r1.19 | > | r1.18 | More } |
|
Revision r1.20 - 13 Dec 2006 - 21:26 GMT - RamonCreager Parents: TWikiUsers > RamonCreager |
Content copyright © 1999-2007 by the contributing authors. All material on this collaboration platform is the property of the contributing authors. |