Sev 1 – Complete Downtime for our Site
Table of contents
In this article, we’ll take you through the painful steps of troubleshooting – when there are multiple components and parties involved.
When that happens, it’s important to think through logically, follow the traffic flow – and understand the root cause. Without understanding the root cause of the problem – we will not be able to fix it.
But funnily – we still do not know the root cause of the problem, and the issue is still under internal investigation. It’s highly possible that it was our fault – outdated plugins etc., but unlikely – for reasons we would elaborate further in this article.
Issue Discovered – Sat, 12 Apr at 6:36 PM
We realised that our site was loading incorrectly. On further inspection – all of the assets such as JS scripts etc., were returning 404.
The first thought would be
- Was there a wordpress update
- Was there a plugin update (we use elementor)
So we tried to access our wp-admin
page, but even that page was failing to load:
This is a clear indication that it’s not a plugin breaking the issue – at least not the elementor pluging or caching plugin. Because we don’t cache the wp-admin
page and the wp-admin
page is also not built using Elementor
Troubleshooting Incorrectly
So when troubleshoot, we first need to know
- The expected behaviour
- The current behaviour
- Then we can start tracing the user flow, traffic flow and check logs to see what happens
Without this logical thinking – you’re bound to incorrectly infer a fault. And then the fix will also be incorrect!
Incorrect assessment – Apr 12, 2025, 9:08 PM
As you can see, the support team proceeded to check the logs without first understanding what was breaking:
The error message indicated the following:
==
2025-04-12 20:43:27.501458 [NOTICE] [1523396] [T0] [127.0.0.1:42226>203.142.4.28#APVH_petcoach.sg] [STDERR] WP-Optimize: No caching took place, because the plugin location could not be found\n
==
So the support team inferred that this was the cause, and disabled the Elementor plugin. There is many things wrong with this step:
- Firstly,
WP-Optimize
is a caching plugin. A simple google search would be able to find that - If the error is due to the
WP-Optimize
plugin, then why disable theElementor
plugin? This step makes zero sense - Furthermore, the
WP-Optimize
plugin was actually already deleted from this wordpress site. So the error message although valid – is probably not the cause of error, because it’s expected for the plugin not to be found- Sure, there is no caching that took place
- But why would that result in the site loading incorrectly?
- It should result in the site loading slower
- Lastly, they proceeded to disable the
Elementor
plugin, in the production site itself, without consulting us- Sure, the site is already broken – what is there to lose. But what if they did this to a working site, their fix would actually break the site
- Next, they highlighted to us that the site is now loading correctly. But obviously – the site was not loading correctly because the site was build on elementor. Now that you have disabled the page builder, the wrong set of information is being served to the user
Mind you, all these were happening directly in Production environment, without any consideration of how it would further impact the site.
Back and forth on troubleshooting
As the support team was handling the matter incorrectly, we had to further highlight specific findings on our end – so that they are aware that they need to thoroughly investigate:
- We highlighted that the
wp-admin
page is still broken, even after they disabled the elementor plugin. Which means they did not fix the correct issue- We also shared screenshots with assets returning 404
- We also shared screenshots with assets returning 404
- We had to highlight to them that it’s not a valid finding – the logs on
WP-Optimize
. This is unacceptable, as they should have been able to clone the site and test and check on their own - On top of that, we actually cannot even log into our
wp-admin
dashboard. This indicates a deeper issue, probably at the infrastructure or server level- Access petcoach.sg/wp-admin –> site is loaded with broken styles
- Enter password, username –> hit the error
Further incorrect troubleshooting
Subsequently, there is further back on forth on incorrect troubleshooting and we had to highlight to them why their approach was not correct.
Proposed to restore the site – Apr 12, 2025, 9:29 PM
Their support team was not focussed on trying to find the root cause. Instead, they are trying the approach we term deploy and pray
lol.
If this was at my workplace, I’ll get F**king wrecked by my boss lol. But whatever, right – the support team just wants the latest working backup, so that they can restore it (and I’m guessing hoping for it to work?). See the screenshot below:
Highlighted that a deploy and pray approach is poor
So, we had to inform them further, why restoring a backup is unlikely to work – hoping to point them in the correct direction ofcourse:
- Why would restoring a backup help?
- What was the root cause of the assets returning 404?
- Why can’t we login to the
wp-admin
site? - Are you planning to restore the backup to a separate server and test it? If not, what are you planning to do with the backup – just restore and hope?
These were all communicated to the support team, in the screenshot below:
Nagging a little bit, but from a software delivery background – it pisses me off, the deploy and pray strategy. Because it’s been drilled into us, that you MUST know for a fact – how to resolve the issue. At least in most cases.
Funny thing is that, we have more incorrect assessment being throwned at us! It’s like they are randomly inferring what logs say and blaming the error in the logs for the page being incorrectly served!
Further incorrect assessment – Apr 12, 2025, 10:51 PM
As previously, mentioned – more incorrect logs being referenced. Sure there are errors, but not all error would cause the site to be down. It’s a terrible approach to randomly find errors in logs, and say that it’s the cause of the issue.
So they found these set of errors, are are damn convinced that elementor somehow has brought my site down:
==
2025-04-12 22:45:24.764015 [NOTICE] [1644055] [T0] [127.0.0.1:52932>2001:4860:7:50d::fd#APVH_petcoach.sg] [STDERR] PHP Warning: Trying to access array offset on false in /home/petcoach.sg/httpdocs/wp-content/plugins/elementor/includes/base/widget-base.php on line 223\n
2025-04-12 22:45:24.764107 [NOTICE] [1644055] [T0] [127.0.0.1:52932>2001:4860:7:50d::fd#APVH_petcoach.sg] [STDERR] PHP Warning: Undefined array key -1 in /home/petcoach.sg/httpdocs/wp-content/plugins/elementor/includes/base/controls-stack.php on line 695\n
==
Ahh, another set of advise – for main providers such as Elementor
or Stripe
etc. – if there is an issue, the issue most likely lies with you. Because if the issue is with them, all other sites with the same configuration as you would be facing it.
If they are – then you can easily find the results in the forums.
Remind them that their assessment did not make sense
So – we had to remind them that the assessment did not make sense:
- Elementor is not used to build the
wp-admin
- The
wp-admin
is broken, so what caused this. The fundamental question is not even answered
To which – they replied that they will be restoring the site, and inform us once done:
Firstly, I already disagree with the appraoch – but since we don’t have access to their servers, and had no one better to communicate with, we’ll wait out and see.
The least they can do is to provide a plan to update us periodically – especially given that this is a severity 1 issue.
Lack of Customer Updates
This is where they literally failed lol – no updates, no ETA. Just a vague plan to restore the backup with no clear understanding of the problem.
There isn’t much to say here – because we were simply waiting for their team to get back to us. But they didn’t
Drastic Measures Taken
We had to engage a separate hosting provider, to spin up the resources. This costed us approximately SGD 3000!
Sure – the price of the resource were relatively affordable. But finding someone with the skillset to restore a full production site, with minimal issues – was tough. And we wanted the site up fast, so we were happy to pay a premium for it!
Steps to restore the site can actually be found in previous articles:
- First, we need to spin up a wordpress installation
- After that, we need to restore our production site
Additionally – Rubbish Error Logs Referenced
The error logs mentioned above were rubbish – because once we restored our site, it was working well!
Furthermore, we can see that the site functions well with the error logs!
And even when the site does not function well, it results in layout issues for the site – not Full Downtime of the whole site -.-! See the screenshot below to see what the error should look like:
- Firstly, this is a simple fix. Background simply cannot be null – just add a classic overlay but leave it transparent?
- Secondly, it does NOT block the function of the site. This is a minor inconvenience compared to an incomplete use of the site
The error was picked up by our automated testing tool – playwright. You can checkout how we use playwright to automate our tests here!
Root Cause was incorrect
Furthermore, the root cause was obviously incorrect – because we restored the exact same codebase and database from our production site petcoach.sg
into a different server, and it worked just fine with a few minor tweaks:
- No updates of plugins
- No updates of elementor
- The error messages still shows, but no major issues
wp-admin
page was loading correctly
This is key evidnece that the root cause analysis was improperly done, before they proceeded with changes in the production environment – unacceptable negligence.
Escalation to a separate Team
The support team escalated the issue to a separate engineer. The engineer was more competent and have a structured way of resolving the issue:
- First, he cloned the site – so that he can investigate and test on a test site
- Next, he checked forums to see if the issues were previously surfaced
These steps provided assurance that the team is now capable of taking little steps to figuring out the issue.
- His fix was to update the php version, based on the forum findings, and it fixed the page being served incorrectly (or so it seems…). I’m not fully sure yet – but I can’t say otherwise, because the results shows for itself
- I verified it and we close the issue
Why not convinced of php update
I’m not convinced of the php update (especially the error message), pointing to the correct root cause. I believe that the team has incorrectly attributed the error logs to the page loading incorrectly.
In the forum that was shared by the support team here, the issue was with errors being printed even when WP_DEBUG
was set to false
.
Now, we faced this ourselves in our playwright tests – see the screenshot again here:
So this error exists sure, and the forum addresses this error.
But nowhere in the forum did I see issues where:
- assets are returning 404
wp-admin
login page is failing etc.
These are issues that are critical, compared to layout issues.
I am not saying that updating php did not fix the issue. It might have fixed the issue – but we do not know the underlying root cause for now.
Next Steps
Without knowing the root cause, the issue could very well appear again so the team has launched an internal investigation and will update me on the issue:
Now, whilst the case has finally been resolved – it has significant implications on our business revenue, and downstream impact on our SEO efforts.
See how our favicon was lost due to the downtime. It was good learning for us – but costly for the business.
And more importantly, the support team should be taking (I believe they are) steps to strengthen their processes.
You cannot simply guess a root cause, and make changes in production in an attempt to fix it. This is completely unacceptable.
That lapse, was below industry standard for sure – and I’m certain that they would fix it!
Learning points for me too – considering I’m in the same industry
Thank you!
That’s all – sounds like a rant. But almost like a productive rant :)
Much learnings, to both myself and the support team. Hope we come out of this better and stronger.