Hi All
We're currently facing challenges with downtime and failures in our data center, and I'm reaching out to the community to pool our collective knowledge and experiences. Your expertise could be instrumental in fortifying our data center resilience.
Issue Description:
Our
data center has recently encountered downtime and failures, impacting our operations. We are committed to enhancing our infrastructure's resilience and would appreciate your insights and recommendations.
Redundancy Measures:
Power Redundancy: Dual power feeds from separate grids with automatic failover.
Cooling Redundancy: N+1 configuration for HVAC systems.
Network Redundancy: Multiple internet service providers with BGP routing for automatic failover.
Server Redundancy: VMware vSphere with High Availability (HA) configured across clusters.
Recent Downtime Incidents:
Power outage due to a grid failure. Impact: 2 hours of downtime.
Cooling system malfunction. Impact: Increased temperatures, leading to performance degradation for 4 hours.
Network switch failure. Impact: Temporary loss of connectivity for 1 hour.
Thank You In Advance