Lightdash Cloud instability (slow response times)
Incident Report for Lightdash
Postmortem

Today at 13:37 (UTC) we noticed that all API endpoints for https://app.lightdash.cloud were starting to run very slow. This was affecting all users and led to response times of over a minute for the API.

Our first response was to greatly increase the amount of resources available for the Lightdash server (both the number and size of servers). After this change all services remained stable.

We have investigated our logs to understand exactly what actions user’s were taking to make the server so busy. In addition to the volume of users, we noticed a higher number than usual of Databricks users. We already have a fix in testing for improved Databricks performance, which can be expected in the coming hours. For further performance improvements you can follow this milestone in GitHub: https://github.com/lightdash/lightdash/milestone/91

Posted May 09, 2023 - 15:43 UTC

Resolved
This incident has been resolved.
Posted May 09, 2023 - 15:33 UTC
Monitoring
We are monitoring performance
Posted May 09, 2023 - 15:22 UTC
Investigating
The Lightdash API a https://app.lightdash.cloud is experiencing very slow response times for all users, leading to some actions taking minutes or not executing.

We're currently investigating.
Posted May 09, 2023 - 13:48 UTC
This incident affected: Lightdash Cloud (Lightdash Cloud (US)).