Queue issues

Blog Post created by jonhcw on Dec 23, 2015

Recently I've come across issues with my integration message queues getting stuck on the hub. At first I dismissed it as a random issue with corrupted queue files you get occasionally. But then it happened again. And again. Reviewing my code that handles PDSs didn't review any glaring issues with unhandled exceptions or anything of the sort, so I decided to investigate the queue files themselves more closely. This is all based on my own investigation and none of it is official information from CA.


Backup your stuck queues

When you've got a queue that is not being processed, it is likely that you want to get it going as quickly as possibly. If there's no obvious problems with the probe and it's relations (like database access for data_engine), one of the things you often end up doing is just resetting the queue, which you can do from the hub gui. It is a good idea, however, to first make a backup of the queue files so you can investigate the issue and possibly repost the messages. The way to go about it is to navigate to your %nimdir%\hub\q directory. There you will find a list of sub directories (if you're not on a legacy version) named after the queue they represent. You'll want to copy the whole folder of the queue that you're backing up: these days queues consist of multiple files if they are large enough, which all contain a chunk of the queue data. Do not go to the folder and rename the files: when you empty the queue from the gui all files under the respective directory will be deleted, so be sure to move the entire folder to a completely different location.


Investigate and fix

You can use the qtool made by CA to get some information on the queue files: get some stats, display messages, mark messages read and some other stuff. For some reason the tool hasn't really been made publicly available, but you can grab it at qtool experience or ask support.


Qtool is a handy little tool for getting some quick information and reposting some messages, but it's not a comprehensive tool. For example if the queue contains messages for multiple subjects, you can't choose to repost just some of them.


When it's not enough

If you can't get what you need with the the tools provided, you'll have to go to the source and open up the queue files yourself. Usually, I start out with Notepad++, as with most such files.


The structure of a queue file

The queue files are pretty simple. When opening a queue up in Notepad++, you'll notice a bunch of boxy characters with text such as "NUL", "ACK", "SOH" and some others with some plain text thrown in there.



PDS is a nul delimited structure, meaning all fields in it are separeted with a control character '\0' which is NUL. A pds row entry consists of four fields: "Key", "type of value", "length of value", "value data". For example: nimid<NUL>7<NUL>10<NUL>abcd12345<NUL>.



Queue basically consists of a bunch of PDSs with some additional stuff thrown in to separate them from each other and denote the status of the message. Your queue file should start with such a sequence:

-- read message
-- unread message


When you see such a sequence, it always means that a new message is beginning. If the first character is a NUL, it means that this message has been read: it has already been processed. When all the messages in a queue file are in this state, it means that the hub can remove this queue file. If the first character is EOT, the following message has not been processed. After this sequence (ending in four SOHs) there will always be exactly eight more characters before the actual pds begins. I do not know the meaning of the two first characters here (after the SOHs), but in my use cases I haven't found them to bear any relevant purpose.


With this knowledge, you can parse messages from the queue file yourself.


The problem at hand

The effect

The issue with my custom probe not being able to read the queue any longer can be indicated with qtool. It will essentially show the key of another field as part of the value of the previous field. To get a better understanding of the issue, I wrote a couple of custom parsers to troubleshoot the queue files. In short, it turns out that there are a lot of messages that do not adhere to the pds format described above. It all boils down to some messages that contain field that is of type '16', which is FLOAT type in PDS. There are messages that are as follows:


-- first NUL is ending of previous row's value field. Following is invalid format, since value field is not nul terminated. "used" is actually the key in the next row.
-- really problematic invalid formatting


Bear in mind that I'm reading the queue with a probe made with .NET SDK. These are partial PDSs from problematic messages. The first one does not cause a crash, but the value of "diff" will be clipped at the end and will be 4235. I'm not certain what happens, but I assume it reads size - 1 characters from the value field since you don't need NUL termination in .NET strings, after which the string is converted to double. The second one is really the problematic kind of message, as it will throw the probe into a loop trying to read the same message over and over again. Again, not sure what exactly happens, but I suspect that maybe there's an index or format violation. The really unfortunate thing is that the SDK "eats" the exceptions thrown and logs nothing.


The cause

In my experience, most of these ill formed messages come from the ews_response probe. More specifically with ews_response the issue seems to be when the value of a float field is actually an integer and the dot is removed by the SDK. I've also demonstrated the same issue with a simple .NET application, whereas similar application made with the C SDK works properly. So, I suspect the issue lies within the Java/.NET SDK or at least some versions of them. Possibly probe framework too.

Update: Currently it seems like this is all due to a bug in the .NET SDK. I cycle a lot of messages through a probe made with .NET and it seems to be this one that breaks the float fields. ews_response alarms seem to be alright before they actually hit that part.


The solution

Well, no good options as of yet, in my case. You can alleviate the issue by breaking down all subjects into their own queues, in which case you shouldn't lose all functionality should an issue arise with one particular message. You could also then take down just the particular queue and reset it, try to fix it with qtool and repost rest of the messages. If message volume for that subject is high, you're likely to lose some messages, unless you turn off the hub altogether while you operate the queues. Breaking down to queue per subject does add complexity on the probe side though and I'd rather not have a bunch of queues as the probes can subscribe to quite many.


While I'm waiting for CA to confirm and fix the issue, I'm writing a bit more sophisticated (or just targeted for this purpose) tool that can dig out and remove the problematic messages and also repost messages depending on their subject, age, etc.