Just one day…
I rolled into work this morning a few minutes later than I had planned. Oh well.
The guy that was supposed to let me in was running late in any case. I got bored and went around to the warehouse to see if I could get into the front offices.
Yup. Through the loading dock, past the sales offices, up the front stairs, past the accounting… I even remembered my office key.
They gave me a key to my office, but not one to the front door. So how am I supposed to deal with off-hours emergencies? Whatever. If I don’t have a key, they can’t complain that I didn’t take care of the problem.
I dropped my crap onto my desk and immediately logged into the cluster. First task was to backup the existing databases. I wrote a script to determine the database names and back everything up. But, where was it? I think it’s on the Intranet server. Why is the Intranet server offline? Time to head to the server room.
I find a very amusing sight. Due to the failure of one of our A/C units, we now have 2 industrial size portable A/C units in the front area of the server room. One hose is stuck out of the window in the server room door. It was actually hanging from a coat tree, held in place with a printer cable. Very funny. I was around when the glass was removed yesterday. Very amusing to watch the head of Networking use a screwdriver to pry apart the security glass. He removed the screws from the frame on the inside, but the glass was actually glued to both frames. Let’s just say taking it apart wasn’t pretty for the door and it took definitive effort from two network engineering. It kind of looked like monkeys trying to hump a football, but that’s a totally different subject.
Anyway. The other exhaust hose was blocking the door way. It ran out of the server room and another 10 feet across the floor toward the restrooms.
So, I wedge my way through the door frame, avoiding the ladder in place behind the door. It’s there to support the window-exhaust hose. It was rather pleasant in front of the A/C units. I fired up the Intranet server and returned to my desk. About this time the networking guy shows up. He has to reboot the phone system.
I do my backups, shut down replication and dive into applying the service pack. Everything goes smoothly. There are no problems with my install. Reboot the dormant node. Roll the cluster. Reboot the new dormant node. Log into the database. The connection string is correct for the new version. Run a test query. Everything looks good. Pull up the replication software interface. Restart mirroring on the AS400 side. Done. No problems. Restart the SQL side. Failure. Retry. Failure. Hmm… “Connection failed.” Maybe it lost its connection while the SP was applied? The install required stopping the SQL instance. Exit the application and reopen it. Restart SQL mirroring. Success.
Hmph! Cool.
I’m done. I start talking with the networking guy, discussing what I need him to backup on the cluster. He doesn’t get it. Clusters are new to him and he just doesn’t grok it just yet. I shift into ‘teacher’ mode. I draw him a couple diagrams, then show him how everything looks in RDP.
I can access any node of the cluster directly, but to leverage the power of a cluster I should use the front end, the actual cluster IP address/name. If I need the cluster to do work, I should ask the cluster to process my request and not individual nodes. If I ask individual nodes, I just defeated all of the benefits provided by the cluster. So we log into the cluster, which is a virtual connection to the active node. From a maintenance perspective this gets tricky, because I need a full-system backup of the cluster interface, and backups of the details for each node. The cluster interface exposes all of the shared aspects of a cluster. The individual nodes are still unique and discrete. So, I need those details as well.
“See? When 01 is the active node, the cluster instance shows that we are actually on 01. If I roll this over, we will be on 02. See? In both cases, I am accessing the cluster using the front-end, the cluster name/IP. Get it?”
“It takes about 15 seconds to roll the cluster when everything is working correctly”
As for the database, there is only one way to access it.
“That’s one way to present a cluster. There is another way and this is actually how I built up my Beowulf cluster…”
The model is similar to a corporate network diagram. There are 3 zones to a cluster: the outside, the cluster interface and the background workers. A Beowulf cluster presents a single machine to the user. You log in, submit a job and it returns the results. And you as the user have no idea what really happened. What is happening is that the main machine, the Alpha, presents you with a single user interface. (I always use wolf pack terms to describe my clusters.) The members of the cluster are hidden behind the scenes: wolf(1), wolf(2)… wolf(n). Usually, I have an Omega that doesn’t actually do any work. It monitors the pack so I can watch performance stats in real-time. This machine is usually the crappiest one of the lot. It doesn’t need to provide real power, so why waste the resource?
The minimum size of a Beowulf cluster in this configuration is 3: Alpha, Worker/Wolf (1), and Omega. A corporate network will have 3 zones as well: the open internet, the DMZ, and the internal network. A machine in the open ‘net can’t see a computer on the internal network and all requests for work have to pass through the DMZ machine.
Make sense? It doesn’t really matter if it doesn’t. It’s just what I do.
Our implementation of a Microsoft cluster, at work, is similar to a Beowulf cluster from a database perspective. There is only one way for me to access the database, but dissimilar in that I can access any node directly.
I spend about 20 minutes explaining all of this. Of course, there were a lot more details that relate only to our implementation, plus all of the other crap I use to fill in the gaps in my lit’ model.
I ‘think’ the network dude gets it now. Hopefully, we can make some progress on this topic next week. He hasn’t been backing up my cluster correctly, so he needs to alter his plan and include the details from the individual nodes. I need to implement the new maintenance plans which manage the backup process for the databases. Together we should be able to restore the whole system if we have a catastrophic failure.
It’s done. Everything worked. Nothing blew up. And I was actually finished in less than 1.5 hours.
I ran a couple of errands, then came home. I’ve been in front of my computers for about 9 hours now, writing C# code, tweaking my database, adding new test data, creating new stored procedures, setting up a version control system so I don’t lose anything, plus doing laundry, running the dishwasher (twice), all while listening to Queensryche and Geoff Tate . Everything’s in the kitchen in my lit’ apartment so I don’t have to move very far to manage this exercise in multi-tasking.
Tomorrow, I don’t know what I am doing. Probably… I’ll have breakfast with Christine. Maybe Tony will come along. Maybe not. I want to disappear for the day. I should go down to my Mom’s but I need one day a week completely to myself. I just need to not have a schedule or pressure to do anything in particular. That’s my one day to unwind and if I don’t take it, I tend to have a shitty week. One of the joys of experience is I know what I can and can’t do these days. I can work for 6 days, but not 12. I need a day where I have the option of leaving everyone and everything behind.
Just one day…
