Test Vco Cluster Node Failure

Welcome to the vCO High Availability Lab. In this brief video, we'll step through the process of verifying that your load balanced vCO Cluster is operating as expected. In this lab you will test the vCO High Availability features in vCO 5.5. The lab environment contains a clustered vCO environment sitting behind a round robin load balancer provided by vCNS. For testing purposes 1:1 NATs have been setup for each of the two nodes of the cluster. vco-lb.rainpole.com is the load balanced vCO Server URL. vco-1.rainpole.com is node 1 of the cluster. vco-2.rainpole.com is node 2 of the cluster. Now let's get started by verifying that our load balanced vCO Cluster is available. Now before we click on the client to login, let's go ahead and open a pair of Putty windows. Now in our lab environment I've already added shortcuts here for vco-l-01a that maps to our node 1 of our cluster so I'll load that up and get connected. I'll position node 1 on the bottom left of my screen... get that resized a little bit. Now I'll launch another Putty window and we'll get that connected to node 2 of our vCO Cluster. OK, we have both of our nodes visible to us in our Putty windows here. So let's go ahead and keep an eye on our log files. In each one of the windows, type: tail -f /var/log/vco/app-server/server.log - and repeat for the other window. Great, now we can see activity that takes place on either of the two nodes of the cluster. Ok, now that we have a tail running against each one of the log files on our two nodes, we'll go ahead and start the vCO client by clicking on the Start Orchestrator Client shortcut from our browser. I'm going to go ahead and minimize our browser window here so we can still see our Putty windows. Once the vCO Client Login window comes up, confirm that you have the load balanced address in the Host Name box - vco-lb.rainpole.com:8281 Also confirm that username is administrator@corp.local and go ahead and login with the credentials. Once you click on login, take note of the two Putty windows that are sitting in the background. We can see that node 1 has picked up from our round robin load balance. Here we've confirmed connectivity to our cluster. We've seen that one of the nodes within the cluster has picked up and is actively running the requests. Next, we'll confirm that the cluster resumes workflow upon a node failure. But in order to do that, we're going to create a long running workflow that generates some System log information over to the Putty window. In order to best illustrate the cluster behavior for a failed node, a long running workflow is needed. Next, I'll click on my workflows tab in the vCO client and I'm going to add a new folder for my class lab. Now that I have a project folder to work with I'll go ahead and create my new workflow. I will simply call this "Long Running Workflow". Now that the workflow is created we need to go ahead and put some logic in this. We'll start off by clicking over to the schema and we'll build out our loop. The idea here is to create a loop that is going to run a System.log() statement several times over a long period of time. That will give us an opportunity to see consistent log messages happening in the active node of our cluster and when we kill that active node by turning it off, we should be able to see the other node of the cluster pick up where the powered off node left out. I'll go ahead and start off my loop with a custom decision. Put that right here at the beginning. Next up, I want to put in my Scriptable task. I will use this for my System logging. Next, I will put in my counter so that I can increase my counter. And finally I want to put in a Sleep so that I have a little bit of a delay before the next round of the loop. Ok, it looks like I have all the base elements here for the loop that I want to create. Let's go ahead and actually make it a loop. Start off by selecting the last connector line that goes to the End point and deleting that. And I'll mouse over my Sleep element here then get the Blue arrow icon. Click and drag that back over to my Custom Decision. Now let's go ahead and get these positioned to where this actually looks like a loop. I think that's good enough. Now let me go ahead and start off with my Sleep item here. Let's take a look at that - what kind of inputs do we need? sleepTime in seconds... I'm going to set my timer to sleep for about 2 seconds. I'm going to click on Source Parameter. We don't have an existing parameter yet so I'll just accept the default name. I do want this as an Attribute so I'll leave that as an attribute and I will set a default value of 2. Click OK Nothing coming out so I can go ahead and close that. My Increase Counter - I want to go ahead and edit that because I need to have my "counter". I will go ahead and create our new counter for that - number - and we do want an Attribute. And we want this to be initialized with a 0. Type in 0 and click OK. And if you started off with "In" then do the same for "Out" or Vice-Versa. Since I already have my counter created I'll go ahead and just select that. And now I can close my Increase Counter. So I've got my Sleep done and I've got my counter.. Now let's do our Scriptable task here. The first thing I'm going to do is I'm going to change the name to give it something a little more descriptive. And we will call it "Display Message". Alright, coming into this - we want our counter variable to come in. Click on Select. We don't need to send anything out. This is a really really simple workflow that we're creating here. We'll just do a System.log("Counter: "+counter); So that is going to perform a System.log - so it will generate an entry in our Server.log file that we have in the two Putty windows. It'll start off with a label of "Counter: " and with each call to the this particular workflow, whatever the value of the counter is will come after the colon. I'll go ahead and click on Close. Now finally we need to adjust our Custom Decision here. Now I'll click on Edit there ... In .. We actually need two things to come in. 1 - we need our counter so I'll go ahead and select that. And the second thing we need is our iteration count. So let's go ahead and create a new parameter. We want it to be just an Attribute, and I'm going to call this "iterations". Alright this is how many times to run the loop. Now since I do want this to be a very long running workflow and we have a 2 second delay with each iteration of the loop, I'm going to go ahead and set our iteration count to 300. Now click OK, now click on Scripting so that we can set our logic to determine whether or not the loop should continue. Fortunately the logic is very easy here. All we have to do is return a true or a false. return counter less than iterations; So as long as our counter is less than iterations the loop is going to continue running. Click on Close - Click Validate. Make sure that we don't have any errors. Then Close. So now we can Save and Close. I'll click on Continue Anyway. Now we should be able to run our workflow. I Clicked on the Green Play icon to start the workflow. I'm clicking on the Logs tab here so that we can see that our counter is incrementing every 2 seconds. And in the background here we can see that node 1 is displaying our log messages - here - counter 6, counter 7 - and that's continuing to go. Now in order to view how the cluster handles a node failure, we'll need to simulate one of our nodes crashing. So to do that I'm going to go back into the vSphere Web Client. Let me get things resized here so we can see what's going on. vSphere Web Client - I'll get logged back in as my administrative credential. Once I've logged in I'll click on VMs and Templates - Select the correct Virtual Center Server: vc-l-01b I'll expand that out - we have our two vCO nodes here: vco-l-01a and 02a and node 1 matches up to 01a and node 2 matches up to 02a. Since node 1 is currently our active node in the cluster processing that long running workflow, that's the one I want to simulate the failure on. Now I'll right click on that particular VM and select Shutdown Guest OS. I'll click Yes to confirm shutdown. Now we can see in the background here that the active node is still processing the loop. We have 63, 64, 5,6,7,8 - In the meantime this operating system has already gotten the signal to go ahead and shutdown. So over the next couple minutes we'll see that a number of messages indicating that the OS is stopping.. now the workflow engine has stopped.. Now the cluster is using its heartbeat to determine that one of the nodes has stopped. And we see that node 2 has indeed picked up with the counter at 78, 79, 80, 81. So here we see that one of the nodes went down and the other node has indeed picked up. Now it looks like it started up at 78 - if I scroll back up here on the node 1 - 77 was the last entry that we had on node 1 and 78 was the first entry that I saw here on node 2. So we didn't lose any iterations of our loop. That concludes our vCO Cluster test.