david.pippenger

data_engine/qdump/dr. nimbus/fixqueue issues

Discussion created by david.pippenger on Apr 6, 2010
Latest reply on Apr 7, 2010 by keith_k

Basically my problem is that I have a fairly large hub layer (190 hubs) and on occasion one of those hubs has an issue like a full disk. When this occurs it seems as though the QoS data will become partially truncated in a way that it can still be passed through the hub queues. When it gets to the data_engine it decides there is an issue and will claim to pop the message and then restart. This leaves us with an ever growing queue since the data_engine spends more time restarting than processing it's queue.

 

So our solution was to try and fix the queue.... First we try qdump and find it's totally broken and fails to pop messages from the queue with errors I'll include later. Next we try fixqueue and again are thwarted by it's lack of even doing anything without the -v option which seems to spew endless debug output to the screen making it painfully slow. Then the resulting "fixed" queue seems to contain the same erroneous messages that cause data_engine to restart.

 

I have also made attempts to use Dr. Nimbus to diagnose and fix the queue, but it simply pops one message, fails to parse it's contents then promptly crashes.

 

If I'm using outdated versions of these tools it's something support is not able to determine because they indicated the versions of fixqueue and qdump I have from like 2008 are the most current they have available. qdump reports version 1.00 and fixqueue has no version check flag. My dr. nimbus is version 1.5.3 which as far as I can tell is current as well.

 

So if there are more current versions available of these tools that anyone has I would appreciate being pointed in the proper direction. t really feels like I just have out of date tools.... I would also welcome any ideas on how to repair my queues so I don't have to keep dropping 20+ GB chunks of QoS data whenever an issue arises. 

 

--Dave P.

 

Here is the log from qdump. The odd part is the hubpost_bulk command it's sending seems to be a valid request for hub, but it seems to be told it's invalid in qdump's query and fails to pop the message.

 

Apr  2 13:26:29:426 qdump: ****************[ Starting ]**************** Apr  2 13:26:29:442 qdump: CONNECT: 325498(1904) 127.0.0.1/1753->127.0.0.1/48000  Apr  2 13:26:29:442 qdump: sockWrite: first 20 bytes of buf =  Apr  2 13:26:29:442 qdump:  127.0.0.1/1753->127.0.0.1/48000 (141): Apr  2 13:26:29:442 qdump: SREQUEST: probe_checkin ->127.0.0.1/48000 Apr  2 13:26:30:441 qdump:  127.0.0.1/1753<-127.0.0.1/48000 (442): Apr  2 13:26:30:441 qdump: RREPLY: status=OK(0) <-127.0.0.1/48000  h=37 d=386 Apr  2 13:26:30:441 qdump: CLOSE: 325498 127.0.0.1/1753 Apr  2 13:26:30:441 qdump: CONNECT: 325090(1884) 127.0.0.1/1763->127.0.0.1/48000  Apr  2 13:26:30:441 qdump: sockWrite: first 20 bytes of buf =  Apr  2 13:26:30:441 qdump:  127.0.0.1/1763->127.0.0.1/48000 (121): Apr  2 13:26:30:441 qdump: SREQUEST: gethub ->127.0.0.1/48000 Apr  2 13:26:30:441 qdump:  127.0.0.1/1763<-127.0.0.1/48000 (357): Apr  2 13:26:30:441 qdump: RREPLY: status=OK(0) <-127.0.0.1/48000  h=37 d=301 Apr  2 13:26:30:441 qdump: CLOSE: 325090 127.0.0.1/1763 Apr  2 13:26:30:457 qdump: CONNECT: 325090(1880) 127.0.0.1/1764->127.0.0.1/48000  Apr  2 13:26:30:457 qdump: sockWrite: first 20 bytes of buf =  Apr  2 13:26:30:457 qdump:  127.0.0.1/1764->127.0.0.1/48000 (121): Apr  2 13:26:30:457 qdump: SREQUEST: gethub ->127.0.0.1/48000 Apr  2 13:26:30:457 qdump:  127.0.0.1/1764<-127.0.0.1/48000 (357): Apr  2 13:26:30:457 qdump: RREPLY: status=OK(0) <-127.0.0.1/48000  h=37 d=301 Apr  2 13:26:30:457 qdump: CLOSE: 325090 127.0.0.1/1764 Apr  2 13:26:30:457 qdump: contacting HUB at 10.75.27.240:48002 Apr  2 13:26:30:457 qdump: CONNECT: 325090(1900) 10.75.27.240/1765->10.75.27.240/48002 Apr  2 13:26:30:457 qdump: sockWrite: first 20 bytes of buf =  Apr  2 13:26:30:457 qdump:  10.75.27.240/1765->10.75.27.240/48002 (166): Apr  2 13:26:30:457 qdump: SREQUEST: subscribe ->10.75.27.240/48002 Apr  2 13:26:31:316 qdump:  10.75.27.240/1765<-10.75.27.240/48002 (54): Apr  2 13:26:31:316 qdump: RREPLY: status=OK(0) <-10.75.27.240/48002  h=37 d=0 Apr  2 13:27:06:077 qdump:  10.75.27.240/1765<-10.75.27.240/48002 (4096): Apr  2 13:27:06:077 qdump:  10.75.27.240/1765<-10.75.27.240/48002 (156): Apr  2 13:27:06:077 qdump: got MSG on session 0x325090 Apr  2 13:27:06:077 qdump: RREQUEST: hubpost_bulk <-10.75.27.240/48002  h=215 d=4016 Apr  2 13:27:06:077 qdump: DISPATCHING hubpost_bulk   , user-data (4016 bytes) Apr  2 13:27:06:077 qdump: Command 'hubpost_bulk' not in command-list Apr  2 13:27:06:077 qdump: sockWrite: first 20 bytes of buf =  Apr  2 13:27:06:077 qdump:  10.75.27.240/1765->10.75.27.240/48002 (55): Apr  2 13:27:06:077 qdump: SREPLY: status = 11(command not found) ->10.75.27.240/48002 Apr  2 13:27:09:684 qdump:  10.75.27.240/1765<-10.75.27.240/48002 (226): Apr  2 13:27:09:684 qdump: got MSG on session 0x325090 Apr  2 13:27:09:684 qdump: RREQUEST: _close <-10.75.27.240/48002  h=208 d=0 Apr  2 13:27:09:684 qdump: DISPATCHING _close         , user-data (0 bytes) Apr  2 13:27:09:684 qdump: Command '_close' not in command-list Apr  2 13:27:09:684 qdump: sockWrite: first 20 bytes of buf =  Apr  2 13:27:09:684 qdump:  10.75.27.240/1765->10.75.27.240/48002 (55): Apr  2 13:27:09:684 qdump: SREPLY: status = 11(command not found) ->10.75.27.240/48002 Apr  2 13:27:09:684 qdump: sockCloseErr: 325090 10.75.27.240/48002 socket disconnected Apr  2 13:27:09:684 qdump: got ERROR on session 0x325090, errorcode: 10053 Apr  2 13:27:09:684 qdump: The subscriber-channel was disconnected by HUB Apr  2 13:27:09:684 qdump: CLOSE: 325090 10.75.27.240/1765 Apr  2 13:27:09:684 qdump: CONNECT: 326a60(1904) 127.0.0.1/2070->127.0.0.1/48000  Apr  2 13:27:09:684 qdump: sockWrite: first 20 bytes of buf =  Apr  2 13:27:09:684 qdump:  127.0.0.1/2070->127.0.0.1/48000 (147): Apr  2 13:27:09:700 qdump: SREQUEST: port_unregister ->127.0.0.1/48000 Apr  2 13:27:09:700 qdump:  127.0.0.1/2070<-127.0.0.1/48000 (54): Apr  2 13:27:09:700 qdump: RREPLY: status=error(1) <-127.0.0.1/48000  h=37 d=0 Apr  2 13:27:09:700 qdump: CLOSE: 326a60 127.0.0.1/2070 Apr  2 13:27:09:700 qdump: nimEnd

Outcomes