Saturday, February 25, 2012

missing notifications

Hello,

We are building a simple distributed application around the service broker where each queue is serviced from a windows service. We have a windows service running constantly as a "listener" on an event queue for each of what we call "request" queues. When a message is received by the request, the listener service instantiates a reader to process the request.

As we are in development, things get out of sync at times and the event queues sometimes stop getting the event notification message. I can correct this usually emptying the queues by receiving all messages from the "request" and "event" queues but sometimes it becomes necessary to drop and recreate the broker objects. Can someone tell me what are the fail points event_notification and what corrective actions are needed at these points?

For instance, if I see that a queue has a state of "Notified" in dm_broker_queue_monitors, I just do a receive to drain the appropriate queue. I have been unable to determine what state the service broker is in when I have to delete and recreate the event notification so I don't know what is causing it and how to programmatically determine when it is necessary to take this action. We would of course prefer not to get to this state but since it can happen during development, I would guess this can happen in production so I need to develop the recovery logic to detect and correct for this condition.

Jim

If you consume the notification message from the 'events' queue but you fail to launch the reader child thread/process to consume the messages from the the 'request' queue, it will end up exactly in the situation you described. Once a notification is sent, the queue monitor goes into 'Notified' state until a RECEIVE occurs on the queue that triggered the event. It does NOT continue to send notifications, otherwhise it could end up sending notifications for ever to a service that has stopped listening.

The most likely cause of the problem is that you are failing to launch the child thread/process when you receive the notification. See my reply on the previous thread ('5 times a charm' http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=202594&SiteID=1) for a sugestion on how to process messages when the processing involves some inherently unrelaible work (like launching a thread or a process).

Also, make sure to check the 'External Activator' sample at http://www.gotdotnet.com/codegallery/codegallery.aspx?id=9f7ae2af-31aa-44dd-9ee8-6b6b6d3d6319 , it does exactly what you want.

HTH,
~ Remus

|||

Hi Remus,

No, there is something else going on. When I get this condition, I manually do a receive in query analyzer until I clean out the "request" queue. When I have emptied the queue completely and then do a new send, the messages show up in the request queue but I get no notification event. If I drop and recreate the event notification, I will then see the event message on the next request message.

Jim

|||

Does any error show up in the notifications error queue [EventNotificationErrorsQueue]?

What is the state of the notification (it will be in sys.event_notifications) and od the dialog that carries the notification (it will be in sys.conversation_endpoints)?

HTH,
~ Remus

|||

Hi Remus,

I don't see a state in sys.event_notifications but all of the event notifications I created are listed there (6). As for the sys.conversation_endpoints, I do see some conversations in disconnected state but when we have our problem, I have a stored proc that goes through sys.conversation_endpoints and ends all these conversations with cleanup. Even with doing that, and all queues apparently empty I have had to recreate the event notificaiton.

I don't have a scenario to recreate this reliably so I just have to wait until we muck things up enough for it to happen and I am looking for what things I need to check. FYI, we are really using this as a monolog, the use pattern is begin dialog, send, end dialog. I have a periodic task that looks for missing task completions and just resends the request message if a task fails to complete.

Jim

|||

You need to investigate what are thos disconnected conversations. Are they initiator or targets? What 'far_service' do they have? What near service (service_id)? Are they system (is_system)? The state is 'DO' or 'DI'?

If you end the dialog that carries the notifications you are basically canceling the notification. Make sure you don't end the dialogs on the 'events' queue, that would cause the problem of having to drop and recreate the notification.

You definitely shouldn't need to have a periodic task that resends the message, that's the whole purpose of Service Broker, to avoit such problems.

One problem with the begin/send/end pattern is that you're missing any info if the message was delivered or not. In general, a better ppattern is to do begin/send and do the end on the initiator side as a response to the end from the target side.

HTH,
~ Remus

|||

Hi Remus,

No, I am not ending the dialog on the events queue, at least not directly or intentionally. Are there errors that would cause the dialog to end?

Yes, I realize that I could have used the service broker to monitor for task completion by keeping a conversation active and responding with a completion message. However, I would still need some other task responsible for deciding the task wasn't going to complete (service died, someone shut down that server, network unplugged etc) even though the message was successfully received, and generating a new request. I'm not looking for a message failure but a task failure in a service on any system connected to a Service Broker queue.

I'm relatively sure that the disconnected conversations are due us developers debugging our services and not allowing processing to complete. While debugging we are doing sends out of query analyzer manually, we are stopping and starting our code that reads the queues etc. The more we are learning, the more stable we are getting things.

So where can I find a list of things to work through when notifications stop coming? For example, I saw the state of one of my queues was NOTIFIED a couple of days ago and nothing I could do would get it back to INACTIVE. I tried clearing all conversations, made sure all of my queues were empty etc but I wound up having to drop and recreate the event notification.

Jim

PS - Aside from these little problems we are creating for ourselves, SSB is awesome!!

|||

Hi Jim,

I'm really glad you like SSB! I'm a big fan myself ;-)

Is not much I can do to help you in this case, w/o a repro case or a more precise description of the problem. Rushi might have some aditional ideas, he knows the activation machinery and queue monitors better than I do.

For your development environment, there is one big switch button that you can use:

ALTER DATABASE [dbname] SET DISABLE_BROKER;
ALTER DATABASE [dbname] SET ENABLE_BROKER;

This will basically reinitialize everything broker related in the database, including queue monitors and such. Of course, this is not a production solution, as it requires an exclusive lock on the database (no users connected).

As to investigate what's happening, try looking at these things for investigation:

- is the notification dialog still active? look in sys.event_notifications and in sys.conversation_endpoints
- is the notification message lingering on the sender's database? look in sys.transmission_queue
- check state of the notification dialog on the target side (the conversation_id is the same on both sides). Compare the send_sequence_number from the initiator with the receive_sequence_number from the target
- make sure that if the monitoring service has consumed a notification, it did launched a thread or a process that RECEIVED from that queue.

HTH,
~ Remus

|||

Jim Stallings wrote:

No, I am not ending the dialog on the events queue, at least not directly or intentionally. Are there errors that would cause the dialog to end?

I think this are all the conditions that could end the notification dialog from the target side:
- explicit END DIALOG, or END DIALOG .. WITH ERROR
- ALTER DATABASE [...] WITH ERROR_BROKER_CONVERSATIONS
- Sender service is denied SEND permission on the target service. Can only happen on the first message (so you won't get any notification at all)
- restore of target database to a point back in time when the target did not yet exist. At the next notification message sent the target will reply with an error that will end the dialog.
- END DIALOG ... WITH CLEANUP. It will wipe out the target w/o actually notfying the initiator, so the initiator will be errored out only next time it sends a notification message, same as above.

ALTER DATABASE [...] SET NEW_BROKER is a special case, it will wipe out the target w/o notifying the initiator, but since the broker is completely erased, the inittiator can never find again the old broker to deliver the message. If the initiator happens to be on the same database as the target then the initiator is wiped out as well, so there is no problem.

Just for the record, there are two more conditions that could end the dialog, but they cannot happen on notifications:
- sending an incorrectly formatted XML message, this cannot happen on notifications since notifications message bodies are outside your control
- dialog timeout, but notifications dialog have the maximum timeout, so it cannot happen until sometime in 2074.

HTH,
~ Remus

|||

Ending the dialog at the notification service: You could check the [EventNotificationsErrorQueue] to find error messages in case the dialog was closed due to some error. Also, you said that you have a stored proc that deliberately ends dialogs at the notification service; are you sure this is not being called somehow?

Programming pattern: Could you explain what the application is and why you are choosing the fire-and-forget (begin/send/end) pattern? I did not understand what you mean by 'task wasn't going to complete due to server shutdown, network unplugged, etc'. The reliable transport of service broker guarantees that messages will be delivered even if you run into such problems. The standard pattern we recommend is initiator begins dialog, sends the request and goes around with its own business (maybe accepting next request from user and doing begin/send). Sometime in the future, the message is delivered and a queue reader receives the message, processes it and sends back a response. If the intiator doesn't really need any excess information, the response could be as simple as ending the dialog. A background thread or a periodic program on the initiator side receives the end dialog response and ends the dialog on its side.

Task failures: If the request cannot be processed immediately, then we come across some interesting problems. A typical example is that the request generates an HttpRequest which could fail. You could rollback the transaction, but then you are going to get the message back almost immediately the next time you do a RECEIVE. If you rollback 5 times, your queue is going to get disabled. A common solution is that if the request cannot be processed immediately, you log it to a table and begin a conversation timer to retry the request after some X units of time.

My queue is stuck in NOTIFIED state: The NOTIFIED state indicates that we delivered that notification and now it's your job to perform a RECEIVE on the queue. A correctly written app will never orphan the queue in NOTIFIED state but always try to schedule something that will eventually come around and do that RECEIVE we've been waiting for. The RECEIVE moves the queue monitor state machine to the RECEIVES_OCCURING state. When the queue is drained and the last queue reader releases the conversation group lock (i.e. commits/rollsback a transaction doing the receive), we notice that there is nothing in the queue and take the queue monitor back to INACTIVE state. There is no way to reset the queue monitor from NOTIFIED to INACTIVE state. You can only reset from NOTIFIED to RECEIVES_OCCURING state. It's only when the queue is drained that we will go back to INACTIVE state.

INACTIVE is not EMPTY: INACTIVE does not really mean queue is empty. We could also go into INACTIVE state when all messages in the queue belong to conversation groups that are currently locked by some transaction or the other. In such a scenario, it doesn't make sense to activate a new queue reader since it will be unable to RECEIVE any messages. So INACTIVE is really the union of EMPTY and NO_UNLOCKED_MESSAGES state.

The external activator that we've released is a good example of how to write the kind of app you are trying to build. You should certainly take a look at that. We will be releasing an update to that shortly.

|||

Hi Rushi

This application uses a highly modified version of the ServiceBrokerInterface project from the activator example so yes, I have been through your examples many times including the PDC 2005.

As to what this project is, it is a print queue that gets kicked off by a periodic stored proc that examines a legacy database for new jobs. When a job is found, it is put into an active print table and two windows services receive a message to begin their processing. One does an httpRequest to an outside vendor the other creates a pdf from the data in the print table.

These may or may not be on the same server as other processes but definitely is not on the server running Service Broker. Yes I could have done a rollback/retry as you suggest but regardless, if we have a failure, we have to send this information to some controlling process to decide if it's a one time failure that should be retried, a single request that fails everytime due to bad data and should be logged or retried, whether we should send an alert to the users or sys admin etc, etc.

We also need to send the controlling process a success message when a task completes so that it can begin the next step, send to printer, archive files etc. I just elected to send completion messages directly to the controller process rather than using service broker conversation. Using a conversation, I would have to use activation for each initiator queue which would mean starting another service for each queue as I have built the application as I have my event listener windows service doing receive with WAITFOR so that they can check for shutdown and/or confguration changes between receives. Whether I use service broker conversations to encapsulate task completion or just fire and forget as you describe it, I still wind up with a controlling process listening to some queue.

With a more standard implementation as you suggest, my controller would listen to all task queues, in mine I have a dedicatcontroller queue that all the tasks report completion to. In any case, since I will have services running on multiple systems, I do need a timer thread running in my controller to see if a service has stopped processing a queue and take whatever actions are called for, alerts, redirect to another queue etc. That was it's primary function and I wanted the logic as to how to handle this not tied up in completing a conversation.

Thanks for the explanation on the queue state. When I was seeing this problem, it wasn't that the queue was inactive when I still had messages, it was that it was in Notified state and I couldn't receive from the notification queue. This can result from the developer stepping through the listener code and not completing the transaction but the lock should release either when he stopped and the connection terminated or when the command timed out so I'm at a loss as to what else to look at when this occurs.

Thanks

Jim

|||

There is a well-known race condition that can cause similar symptoms. The race occurs when the following conditions are met:

1) a given queue-reader is configured to a given max (e.g. 1)

2) there are currently max queue-readers running

3) a notification is delivered to the notification queue.

If these conditions occur while the queue reader(s) is still running, but in the process of shutting down it is quite likely that the notification queue will be drained, but the messages in the app queue will not be processed resulting in the app queue being in NOTIFIED state and the notification queue being in the RECEIVING state.

To deal with this the activating process should keep record of notifications received while the max queue readers condition is in effect and kick another off (just in case) when the # of queue readers falls below the max.

This is what the current version of our external activator sample does. (I'm not sure if this one is posted to the web yet or not).

I'm not sure if this condition is what you are seeing or not, but thought it worthy of calling out for both you and others reading this thread anyway.

-Gerald

|||

I believe that I have finally tracked down my problem. When I do an "end conversation with cleanup", my event notification gets dropped. I thought that since I still had an entry in sys.dm_broker_queue_monitors, that the event notification was active. I didn't see anything in the help for "end conversation" that led me to believe that the cleanup option should do this.

Thanks,

Jim

No comments:

Post a Comment