Monday, October 13, 2014

What A Week

Last week I was "on-call" for work.  That meant I was responsible for watching our monitoring systems and problem queue, working problems as they arise if possible, coordinating efforts if it's something I need help resolving.  The first couple of days were pretty slow, a couple of failed power supplies in systems with redundant power, no biggie.

Thursday I got a call that users in our Mechanicsburg office were experiencing a lot of performance degradation.  A quick check of their primary MPLS circuit (from Level3) showed a lot of packet loss.  We have BGP configured to switch them over to Centurylink if Level3 fails, but the circuit hadn't actually dropped, so we forced it - shut BGP to Level3, and opened a problem ticket with them.  

A short time later, our monitoring tools reporting trouble reaching a router in Williamsport - another Level3 circuit, this time the backup circuit, normally only used when connecting to that one router.  We began thinking Level3 was having a bigger issue.  But before we could contact them to add the info to our ticket, we heard users in Harrisburg were having performance issues.  Level3 again, and the primary circuit - so we shut BGP there, forcing them over to a backup circuit from Verizon.  Finally we got the Level3 ticket updated with all the circuit information and waited for their response.  

About 3:00PM a bunch of us were supposed to go out to celebrate a teammate's birthday.  Right when I get to the bar, the phone rings - network admin requested to look at an application issue.  So I went back in and launched into one of those 3-hour marathon sniffer sessions. Fun!  I finally got out about 6:00PM and headed home.

On the way home I got a text message from Bank of America - fraudulent charge suspected on my debit card, please call or login to online banking to check.  Peachy.  As I walked into my house around 7:00PM, my cellphone rang - a guy at work who was going to swap some potentially bad GBICs on a fiber, wanted me to make sure we had traffic off the link.  

I decided to call back from my landline because cell coverage at home is spotty.  I picked up the phone, and...no dial-tone.  Luckily I still had DSL service.  I got logged in, called him from my cell, and got that one worked.

In the meantime I opened a chat session with the phone company's tech support.  They wanted me to swap phones or try the test jack outside the house.  No good - I didn't have a spare phone, and the one I did have was a cordless that requires power for the base station.  I would have to wait until I could get another phone on Friday to find out if it was my problem or the phone company.

Finally I logged into BoA's web site.  Yep, somebody tried to access my account from a Publix supermarket down in Florida.  Of course as soon as I marked the charge fraudulent, BoA promptly canceled my debit card and notified me it would be 5 - 7 days to get a new one.  You just have to love the modern world, right?  I checked my wallet - $5 cash, maybe with that and the change I keep in the jar at work I would be able to eat on Friday.

Friday morning, we had an email from Level3 waiting for us.  They had found a problem with a core router serving a bunch of their customers in the northeast, and routed around it.  After talking it over with my director and teammates, we decided to keep Mechanicsburg and Harrisburg on their backup circuits for the day and watch the Level3 circuits.  If everything held up we would re-enable BGP over Level3 sometime Friday night.

Two hours later the Verizon circuit to Harrisburg died.  Just plain died.  And with BGP shut over the Level3 circuit, they were cut off completely.  We dialed into a modem on an emergency backup router and got BGP going again on Level3 to get them back online.  Total time of that outage was maybe 5 minutes.

Friday afternoon rolls around and I got talked into trying another social outing.  But just when it was time to leave, I got asked to look at another issue - a file transfer running over a point-to-point circuit between Florida and Pennsylvania was running slow.  In fact, it had been running slow all week, but no one had asked for help until Friday afternoon.  AAAUUUGGGHHH!  So another night not getting off until 6:00PM, not getting home until 7:00PM.  And to make it more interesting, it looked like there was packet loss going from us to the remote site - on a Level3 circuit.  Not MPLS, true, but another Level3 circuit in Pennsylvania?  They claimed to have routed around their other issue, but at this point we were getting gun-shy about putting anything else on their network if we didn't have to (Harrisburg notwithstanding).

On the way home I stopped at Target and bought a plain-old telephone that doesn't need external power.  When I got home I plugged it in inside the house - no dial-tone.  I took it out to the box outside - no dial-tone.  Ok, it's the phone company's problem.  I went in to do another online chat session with tech support, but now I had no DSL.

I got on the cell phone to call the phone company and halfway through one of the half-dozen prerecorded messages, the call dropped.  I dialed back, worked my way through the menus - and got dropped listening to the same message.  Now, they say that doing the same thing over and over and expecting a different result is one definition of insanity.  I must be insane, because I tried a third time.  And got dropped during the same message.  Finally I called in and just kept hitting "0" on every menu and eventually got a live person.  Of course, all they could tell me was they didn't see any trouble in my area, couldn't call my house phone (duh) and couldn't see any signal from my computer.   That, and they couldn't send anyone to the house to fix it during the weekend unless I paid, otherwise I would have to wait until Monday for a visit from a tech (I was still on-call for the weekend), and I would have to stay home from work to meet the tech or they wouldn't come (despite the fact that the issue was clearly NOT inside me house).

So today is Monday.  The tech came.  They had moved my circuits last week to a new switch and somehow failed to configure my service.  

The good news is, I'm not on-call again for about 7 weeks.  

Yeesh!

No comments:

Post a Comment