-->

Wednesday, November 30, 2011

From the Annals of Cyber Monday


When it's too late to take the stitch in time that would have saved nine ...

I've had better Mondays. Coming off the Thanksgiving weekend I spent my re-entry day dealing with two critical problems where I work. I started thinking that instead of "Cyber Monday" it should be called "Blue Monday."

The first issue was that one of our printers, an expensive laser photo-imaging device, was down. Indications were that one or more of its lasers were not working (it has red, green, and blue ones). There are also three computers driving this beast. The first prepares images to be printed, the second is dedicated to feeding them to the printer because it's a delicate operation that requires tight synchronization. Neither of those systems had any problems, but the one that sends the images lit up with a warning screen about the bad lasers.

A call to tech support for the printer reminded me to check the third computer, which is tucked away inside the machine and normally not used by the operator. Unlike the first two computers, which run HP Unix, this shy unit runs a command-line-only version of ... you'll never guess ... Windows 98! Which kind of shows how old the whole thing is. Anyway, this computer has a job to do in controlling the lasers, and it was simply powered off.

But why? It's supposed to come on when you start the machine. Turned out the little button battery that keeps the CMOS memory alive when it's powered down had finally run out of juice. (That's where all the BIOS settings are saved.) After all, it was at least 10 years old. Apparently when that computer failed to boot up properly the system just shut it down. Tech support coached me on how to put the BIOS settings in manually so we could get up and running, and we sent out for a replacement battery. For want of a $6 part the whole thing was dead. Problem solved.

In the middle of all this the second problem came up. One of our Mac users couldn't get on the network. I couldn't find any obvious reason. Then a Windows user came to report that he had the same problem -- in fact "the whole network" was down. Well, I knew that wasn't true because I'd just been on the network at my desk. And the Mac user confirmed that at least some of the other Macs were not having any problem. While I was looking into the network settings on the Windows machine it suddenly started working again. So, maybe a temporary glitch? But no, the other computers were still having problems.

We now had a select group of Windows PCs, Macs, and even one Unix machine exhibiting the connection problem, and scattered between our two buildings. It couldn't be a network switch, because most likely that would have affected all the computers connected to it, but I restarted the switches anyway. No help. We tried our wireless access point and could connect to it, but could not access the Internet, so it seemed to be having the same problem -- it couldn't connect to the network gateway to the Internet. It seemed like it had to be something on the server, but the server appeared to be working normally and to have Internet access itself. Time to call in our friendly IT support company which does all the heavy lifting on our server issues.

Meanwhile I noticed that all the machines having the problem had similarly misconfigured IP addresses. Our network uses 10.1.1.x addresses, and these all had 192.168.3.x addresses. I tried manually configuring one of them with a correct address. It still didn't work, but now we were thinking about DHCP.

[If you want to know, that stands for Dynamic Host Control Protocol, which is simply a procedure by which a server can control the addresses of the computers on its network. When each computer starts up it shouts down the hallway, "Hey, I'd like to join the network -- what address can I use?" Everyone else on the network ignores this request, but the server shouts back something like "10.1.1.203."]

The first thing the support tech told me was that the DHCP service was not running on the server. Aha! that made sense. Only those computers that had recently rebooted, or whose addresses had to be renewed, were having connection issues, because they couldn't get an address on the network. All those that had been left on over the weekend were still running normally. So, just as simple as starting the DHCP service up again, right?

Not really. Because we didn't know what had caused it to shut down. And we didn't know where those bogus IP addresses were coming from. Normally if the computer asks for an address and can't get one, then it simply has no address; it doesn't substitute a different one. The network engineer deduced that something else on the network -- either an unauthorized device or a piece of malware -- was acting as a DHCP server. If that was the case, then it would cause the Windows server to stop its own DHCP service to avoid conflicts. And it would explain where all those similar IP addresses were coming from.

We had not added any new equipment, so the only thing to do was try to find out which machine was the culprit. Time to play detective. The engineer began unplugging one system at a time from the patch panel in the server room. After each one he tried to connect to the network with his iPad. As soon as he was unable to get an address, he knew he had unplugged the offending device. The answer was port 18 on the patch panel. Using our sketchy documentation of our wiring plan we located the guilty party.

If your prejudices are like mine, you probably think that either someone had plugged in their own router somewhere, or else one of the Windows machines had gotten a virus. But surprise -- it was a Mac Pro workstation.

So now we knew where and what, but not how. Why would one of the Macs suddenly decide to start acting as a DHCP server? Consulted by phone, a Mac guru identified the problem by that "3" in the IP address, which is not commonly used. It's an address range used by the Internet connection sharing service on the Mac, by means of which one computer can share its network access with a group of other computers, and which works by acting as a DHCP server.

But why would this service suddenly be switched on? We questioned the users -- had they made any changes, had network problems that they tried to solve? No, but, now that we mentioned it, the only unusual thing about that Mac was that over the weekend it had been moved from one desk to another, so it was plugged into a different port on the wall. But why should that change anything?

Of course, it shouldn't. But the Mac Pro has two ethernet ports on the back which can be configured differently. Sure enough, the other port did NOT have Internet connection sharing enabled. All that had happened is that when the computer was moved the network cable had been switched to the other port, with all the resulting complications. Normally it would not have been a problem, and everything would have "just worked" as Macs are famous for. But somewhere in the past someone must have configured that other port for some long-forgotten reason, and it had emerged to bite us.

They say you're always supposed to learn from your mistakes, but that implies that you can figure out what your mistakes were. What we learned from this is that the littlest things -- like that bad CMOS battery -- can have far reaching implications. In an organization of any size, where the network must run smoothly, even a small glitch can bring everything to a standstill. In this case, we lost several hours of impaired production time while we tried to find something that should have been apparent as soon as it happened.

If an IT support person -- me, for example -- had moved the Mac instead of one of the users, then the problem still might have come up. After all, there was a 50/50 chance that I might plug into the "wrong" ethernet port, and even if I were careful about testing the connection the problem might not have been noticable on that computer because it was the one causing the problem. But at least I would have known that this was the last thing that had changed on the network, and I could have started looking there. I might have just disconnected it, and when that solved the problem we could have left it disconnected until we had a chance to get down to the source of the problem. Instead, since I was working in the dark, we had to take the long way around to the solution.

What else could we have done? Well, if time travel was an option, I would suggest going back to when that second ethernet port was configured to support some external device. Then we could document what was done, and why. We might even label the port as being dedicated to a specific purpose in the hopes this might prevent someone in the far future from trying to use it to connect to the network.

The moral, I guess, is that the future arrives sooner than we think, and the past -- in which all mistakes happen -- is right now.




No comments:

Post a Comment