Entire Network Down - 100% Network Utilization - Please Help!

By link470 ·
Alright guys. So as you can imagine, network goes down, I'm hoping to get this resolved soon. Here's what happened.

It's 12:30 P.M. [Do YOU know what you're networks' doing?]. Everything is running great. 3 main switches in the server room, 48 port managed switches [Dell PowerConnect 3348's] and all is well. Everything functions normal, all servers are online, all desktops are happily happenin'. I go out to get a shipment of 50 new machines that arrive and start piling them outside my office. Next thing I know, I have lots of requests saying the entire network is down and nobody can access anything. I quickly head back to the server room wondering if a UPS went down, if a server restarted, if a switch turned off, anything. But what do I see? Absolutely nothing out of the ordinary. Everything is functioning great.

But it's not. I can't get on the internet. I can't ping ANY computers, I can't remote desktop into the servers, the RDC's I DO have up with servers all fail, and everything is extremely slow. So I call the school board office. They head over with their handy $18,000 fluke meter. They plug it into one of our switches, and it measures our network and quickly throws back at us a 100% Network Utilization. The guy from the board office goes WHOA!!! I've never EVER seen it that high before.

So we try swapping the first switch in the stack under suspicion it may be bad. We put in a 3448, Dell's next model of the 48 port 10/100 PowerConnect switch and take out the 3348. We use patch cables to link them together in a chain setup and see if that works.

In the end, the switch switch [lol] did nothing. I still can't ping any machine in the school or get out of the network. I checked our main router, and it's functioning normally. I restarted the servers and they all appear to be functioning normally. So I think to myself, what would cause 100% network utilization. I noticed that ONE ping got through, but only 1 out of the 4 pings got a reply. So I knew the infrastructure itself was probably ok, but I had the assumption that something was loop backing.

So off I go around the network. Documenting every single wall jack and port in the school [took 7 hours] and checked EVERY switch we have for any type of loopback that might be possible, like a jack plugged into a switch, and another port on the switch plugged into another jack. I also shut down every machine I came across and every network printer I came across so the network can essentially have nothing to broadcast [except for powered nics but since the machine isn't on it's less of a chance of being anything malicious running on the machine itself]. Nothing from what I could see in any of the labs was like that. All the jacks had either a direct connection to a PC, or a connection to a switch, that contained only other connections to PC's, not back to a wall jack.

So here I am guys. It's the weekend, Friday night, schools not in for the weekend so I've got 2 days to work and go in for some overtime. Anyone have any suggestions of what to try next? Thanks a ton you guys. I really appreciate this community and hope we can get this resolved!

Network Notes:
-Windows NT Based Network running Windows Server 2003 servers and 250 XP Professional client stations.
-6 Windows Server 2003 servers total
-3 main switches in server room, 7 others around school. All checked thoroughly and restarted.

Take care and have a good weekend. We have multiple IP's to the school so since I can't access the internet inside the school because of extreme lag, I'll plug in to the main external switch plugged into the modem and grab an IP for my laptop if I go in so I can check in on answers.

Thanks guys!

This conversation is currently closed to new comments.

Thread display: Collapse - | Expand +

All Answers

Collapse -

Not sure if this is the issue

by Jacky Howe In reply to Entire Network Down - 100 ...

but it sure sounds like it. Check out this: Identifying a Broadcast Storm**500/c8bandut.htm

Collapse -

This might help you..

Please post back if you have any more problems or questions.

Collapse -


by Churdoo In reply to Entire Network Down - 100 ...

If you don't know how to read or manipulate the management tools of your managed switches, then you can physically isolate parts of your network to narrow down to the source of the problem.

You can physically unplug all uplinks from the main switch and check to see if the utilization within the main switch returns to normal. At the same time, check the utilization at the disconnected uplinks and see where the high utilization remains.

By successively isolating each switch and checking its utilization, you can quickly isolate the switch or switches closest to the source of the problem. At the same time, you can quickly return unaffected segments to production status while you work on affected segment(s).

At a given switch, you can unplug individual ports until utilization at the switch drops, or in the event of a virus outbreak where multiple machines can be infected, unplug all switchports and re-connect one at a time until utilization jumps. Note, depending on the switch and many other factors, it can take 30-60 seconds or more for the host at a given switchport to reconnect and resume what it was doing before the disconnect.

You get the picture.

Collapse -

Good ideas

by Grey Hat Geek In reply to Entire Network Down - 100 ...

The broadcast storm and isolation ideas are excellent starting points. Here is a good article on broadcast storms...

Do you have any traffic logs that may show a spike in network traffic?

Collapse -

Thanks! Fixed It.

by link470 In reply to Entire Network Down - 100 ...

Thank you all for your replies. Much appreciated! I ended up taking a laptop into work, ran wireshark, found a TON of packets, like, in the 100,000 range almost instantly. I ended up seperating our switch stacks, isolated it to one switch, and that switch was looped into another switch...twice. Everything is back up and running after disconnecting just one of those cables.

What's strange, is I think it's been like that for quite awhile and nothing ever happened before. I may be wrong, but does this sound possible? As of now, the entire network is up and running again, and I thank you all so much for your support and quick suggestions and replies. I'm just chillin' at home now very happy but still wondering if it's possible that there could have been a delay and that the storm of broadcasting didn't catch on till later? The set up was that the main switch [switch 1 out of 3 switches connected together via gigabit uplinks] was plugged into a spare 4th switch down below that the previous tech had used as a spare because there wasn't enough places to plug things in [the patch panel had more ports than the switches could support in that room]. Only 4 things were plugged in. 2 of them were from patch panel locations to connect wall jacks around the school, and 2 were the redundant connections plugged into switch 1 that after removing 1 of those, everything worked again.

Any ideas if it's possible for a delay to happen and it not really get to the point of this until now? Any ideas of what triggered it so suddenly to be problematic?

Either way, it's up and running. Thanks a ton!

Collapse -

I've never used two cables to connect a switch

by Jacky Howe In reply to Thanks! Fixed It.

I know not to now.

Really glad to see that your problem is sorted. :)

Collapse -

Now what you need to do is list the connections.

Write down what cable goes to which port/switch/patch and router/switch and put this into a diagram so if it does happen again you will have a instant diagram of the connections. Also markup the cables that go to each port (if possible) either by number or letter so that you can identify it on the diagram. A sort of a backup plan but on paper. Nice to know everything is working well.

Please post back if you have any more problems or questions.

Collapse -

STP may have been originally setup

by Michael Kassner Contributor In reply to Thanks! Fixed It.

Not sure if STP could have been the reason, but Spanning Tree Protocol often uses multiple trunks between switches to balance the load and provide redundancy. That may have been the original intent and the configuration was altered or corrupted which then created the loop. This Cisco article explains STP if you are interested.

Collapse -

STP would prevent loops, obviously

by robo_dev In reply to STP may have been origina ...

I've seen loops happen at sites where stp was not being used. STP creates some delays with respect to connection time, so some people disable it.

The biggest risk is when users plug in small ethernet switches that have 'auto uplink' on every port.

Switches that have the old fashioned uplink button are not a risk.

It's also possible to create LAN loops with WLAN bridge devices as well.

Related Discussions

Related Forums