Monday Morning Woes

Lost_My_Marbles · July 15, 2024, 6:12pm

Over the weekend, we notice our site was a bit sluggish. Then Sunday PM (about 8:30pm), the site stopped responding and was giving different time out errors. Contacted the web hosting company with the issue. Was told they were working on it. Okay we thought … maybe an extended update to a server. We went to bed.

Get up Monday around 6:30am and still have the errors. We fire off an email to the web hosting IT support and cc’d the owner of the web hosting company. A ticket email came shortly after as if it was the first contact. Waited until the company opened about 8:30am and called in. The person we were transferred to we think is the CFO of the web hosting company who seemed to be in a bit of panic. About 11:30am, the web hosting company sent out an email to all effected.

Hello Everyone,

At approximately 8:30 PM last night, issues were reported with the Maia server. Immediate attempts to resolve the problem by rebooting the server were unsuccessful, and the server has not come back online.

Current Status:

There is a systemic issue with the Maia server.

Intensive efforts have been ongoing around the clock to restore service.

We have engaged with the support team for further assistance and are currently awaiting their response.

Impact:

All emails and websites hosted on Maia are currently down.

Next Steps:

Our engineers are actively working on resolving the issue.

We will provide updates as soon as we have more information from our support team.

There is no estimated time of resolution at this moment, but we are working diligently to get the server back online as quickly as possible.

We appreciate your patience and understanding as we work through this issue. We will keep you update we we learn more.

Shortly after that we got a call from the owner requesting us to change the name servers on both sites saying that the new server will have an additional firewall layer of security.

Sounds like the server was hacked and taken down.

So it could be 48 to 72 hours before the Name Servers propagate.

Taking a deep breathe … telling ourselves to relax … all we can do is wait …

Happy Monday

Pepper_Thine_Angus · July 15, 2024, 6:43pm

I would tend to agree.

It could also be that they are just trying to get your mail flowing again but without knowing their server setup it is hard to say.

I can say I have been in their situation. I feel for them. I had an incident where one of my contractors went down and they were telling us NOTHING, and I had to take the hit from clients. LUCKILY my clients could hear it in my voice, they knew everything that could be done was being done.

Many many sleepless nights during that time.

Dogtamer · July 15, 2024, 6:52pm

May I ask if you have any inkling as to whether or not the Web Host uses Microsoft Azure in their operations?

Lost_My_Marbles · July 15, 2024, 6:59pm

Well for us that would not be the case. We split email out a time back and have it on Google servers so if the site(s) went down at least we had communication lines with customers.

No … linux set up and now on to Cloudfare cloud.

Pepper_Thine_Angus · July 15, 2024, 7:10pm

Very very smart. This is what I do, and this is also how SAS runs (Mail is not through gmail, but having them split IS THE WAY!

Also smart!

Lost_My_Marbles · July 16, 2024, 2:37am

So in the end, this is the second time an update has taken our business down within a month. The following is what was happening …

Hello Everyone,

We have an update on the Maia Server Outage that we have been working on last night and today.

Summary:

An incident occurred that caused our CentOS 7.9 (CloudLinux) servers to hang during boot. This affected our ability to restore from backups as all backups contained the same underlying issue.

Throughout this period, file safety and data integrity were maintained, and multiple versions were backed up. However, the operating system’s stability was compromised.

Timeline

Incident Occurrence:

A kernel update was applied to the servers at an unspecified time.

A subsequent reboot did not occur immediately after the kernel update, allowing the system to continue running on the old kernel while the new kernel was present but untested.

Detection and Initial Response:

Upon rebooting the servers last night around 8:34PM, the system began to hang during the boot process.

Initial troubleshooting identified issues with the kernel and specific services.

Backup Restoration Attempts:

Attempts to restore from daily and monthly backups revealed that all backups contained the same issue.

The backups had files safe and intact but included the unstable kernel, preventing a stable operating system environment.

Resolution:

The root cause was identified as the untested kernel update.

Rebuilding specific kernel components and resolving issues with related services allowed the systems to boot successfully.

Root Cause:

The primary cause of the incident was an applied kernel update that was not followed by an immediate reboot. This led to the system running with an outdated kernel while all backups captured the state with the updated yet untested kernel. Thus, every backup had the same boot issue.

Impact Analysis:

Data Integrity: No data was lost during this incident. All files remained safe and backed up.

System Downtime: The boot issue caused prolonged system downtime while the root cause was identified and resolved.

Customer Impact: Customers experienced service disruptions due to the system hang-on boot.

Preventative Measures:

Immediate Reboots Post-Kernel Updates:

Implement a policy to ensure that any kernel update is followed by an immediate reboot to validate the stability of the new kernel.

Enhanced Backup Strategy:

Include periodic backups that are validated for boot and operational stability, ensuring that at least one recent backup is in a known good state.

Monitoring and Alerts:

Enhance monitoring to alert on pending reboots required for kernel updates or other critical system changes.

Testing Procedures:

Implement a staging environment where kernel updates and other critical changes can be tested before applying to production systems.

Conclusion

The incident was caused by a kernel update that was not followed by an immediate reboot, resulting in unstable backups. The issue was resolved by rebuilding kernel components and related services. Preventative measures are being implemented to avoid such incidents in the future, ensuring system stability and reliability.

Contact Information

For any questions or further details, please contact us at:

Happy to be back up and running after 19 hours of

Pepper_Thine_Angus · July 16, 2024, 11:33am

Actually I salute your hosting company. They presented what the problem was (valid, I’ve seen it) and admitted their mistake.

Lost_My_Marbles · July 16, 2024, 12:02pm

… and that is why we have stayed with them for 20 years.