Thursday, August 28, 2008

SQL Server Data Services Beta Encounters First Major Unscheduled Downtime

• Update 8/28/2008: Added link to SSDS Blog Post of 8/27/2008 and comment to this post.

Update 8/27/2008: Moved from LINQ and Entity Framework Posts for 8/25/2008+ due to size and updated.

08/26/2008 07:04 PDT: The SQL Server Data Services (SSDS) team reported by e-mail what appeared to be unscheduled downtime. The message consisted of two instances of “Maintenance Notification” and nothing else.

08/26/2008 07:16 PDT: SSDS was reported to be recovering and functioning.

08/26/2008 07:54 PDT: SSDS was reported to be healthy. (The delay between the start of the downtime and the message timestamp isn’t known.)The message said “Details to follow.”

08/26/2008 08:14 PDT: The preceding message appears to be a false alarm because a subsequent message arrived at 08:14 PDT:

We are experiencing an unplanned downtime of the service. We are working with utmost urgency to correct the problem and to bring the service back up.

When:
START: 08/26/2008, 06:00am
END: Unknown

Impact Alert:
SSDS is unavailable for all beta users

08/26/2008 08:45 PDT: Here’s the error message at 08:45 PDT from my SSDSNwindEntitiesCS project (see Updated SQL Server Data Services (SSDS) Test Harness: Northwind REST and SOAP Uploads of 7/27/2006):

 

The promised 09:00 and 10:00 messages reported SSDS was still down. At about 10:45 PDT, the error message changed to:

“We are experiencing internal network issues, thereby leading to an unplanned downtime of the service” was the 12:00 PDT message and the client returned the earlier error message.

08/26/2008 02:10 PDT: :

Status: Resolved 08/26/2008 01:30PM PDT

We have resolved the issue. The service has been restored and is fully functional.

We were able to isolate the problem, to an internal network issue that was preventing our servers from properly communicating. The issue has since been resolved and the service is back online.

The SSDSNwindEntitiesCS test harness is running as expected except for a hang that occurred when attempting to download the list of containers and entities on opening.

The problem reported was similar to that experienced by Amazon in their last major outage on July 20, 2008, as quoted here from CNet’s Amazon S3: For now at least, sometimes you have to reboot the cloud article of 7/21/2008:

"As a distributed system, the different components of S3 need to be aware of the state of each other. For example, this awareness makes it possible for the system to decide which redundant physical storage server to route a request to. We experienced a problem with those internal system communications, leaving the components unable to interact properly, and customers unable to successfully process requests. After exploring several alternatives, the team determined it had to take the service offline to restore proper communication and then bring service online again. These are sophisticated systems and it generally takes a while to get to root cause in such a situation," Amazon said. "We will be providing our customers with more information when we've fully investigated the incident."

The SSDS team is to be commended for keeping testers up to date on the progress of bringing the service back up. Their response contrasted with Amazon’s failure to alert customers of their February outage, as reported in Amazon Web Services Outage: Causes And Remedies of 2/16/2008.

• Update 8/28/2008: Soumitra Sengupta responds in a comment to this post.

The SSDS is unavailable for all beta users? (2008-08-26 06:00 PDT) [Resolved] thread in the SQL Server Data Services (SSDS) - Getting Started forum recounts Mike Amundsen’s problems during the outage.

• Update 8/28/2008: Soumitra Sengupta’s We experience our first major unscheduled downtime post of 8/27/2008 concludes:

The team is hard at work figuring out definitively what the root cause was and we will report back our findings and provide more details soon.  Trust can only be built through transparency and consistently delivering on your promise. 

4 comments:

Anonymous said...

Thanks Roger for commending the team. Truth be told, we did learn from the AWS/S3 outage on what not to do. You know how we like to go quiet but this was certainly not the time to be quiet. Appreciate you calling this out.

Roger Jennings (--rj) said...

@Soumitra,

What's ironic is that I was in the process of completing a "TechBrief" on SSDS for Redmond Developer News when I received the first few messages.

The tech review will be in the September 15 issue.

Anonymous said...

Yikes, that is bad timing on our part. Sorry about that.

Roger Jennings (--rj) said...

@Soumitra,

Bad timing, indeed, but I didn't mention it in the "TechBrief."

Cheers