HERE IS THE TECHNICAL EXPLANATION FROM SERVER POINT.
Hello Michael,
Here is a status report from Peter (Lead Engineer) of what is currently affecting your virtual machine.
We use a distributed storage system called Ceph to house the virtual disks of your virtual machines. Each Ceph implementation at ServerPoint is made up of 8 servers with a total of about 100 SSD disks, providing about 100TBs of data usage. We have many such implementations.
In this cluster, your VM's data is triplicated and distributed among many physical disks and servers, and the data is stored in virtual "buckets" called placement groups.
As of yesterday, one of these placement groups, equivalent to 0.091% of the entire cluster, started misbehaving. The problem is, any VM with data assigned to this misbehaving placement group will all of a sudden find itself "frozen". Currently, it seems it affects about 40-50 virtual machines out of about 2000 on this affected Ceph cluster.
Two consulting firms, including RedHat, owners of Ceph, are currently working on analyzing why this is happening. They believe that we have hit an unknown bug and they are working to find what is causing it.
Being a weekend, it has been rough trying to get their experts online to work with us. While we have been working non stop because, well, we can be workaholics here at ServerPoint, unfortunately, other companies don't follow the same work philosophy.
We can assure you that we are pushing them hard to work rapidly. Rest assured, the problem will be resolved one way or another. Please enjoy your weekend and leave the worrying to us. That is what we are here for, to take things off your back. :)
Best regards,
Ruthy
Hello Michael,
Here is a status report from Peter (Lead Engineer) of what is currently affecting your virtual machine.
We use a distributed storage system called Ceph to house the virtual disks of your virtual machines. Each Ceph implementation at ServerPoint is made up of 8 servers with a total of about 100 SSD disks, providing about 100TBs of data usage. We have many such implementations.
In this cluster, your VM's data is triplicated and distributed among many physical disks and servers, and the data is stored in virtual "buckets" called placement groups.
As of yesterday, one of these placement groups, equivalent to 0.091% of the entire cluster, started misbehaving. The problem is, any VM with data assigned to this misbehaving placement group will all of a sudden find itself "frozen". Currently, it seems it affects about 40-50 virtual machines out of about 2000 on this affected Ceph cluster.
Two consulting firms, including RedHat, owners of Ceph, are currently working on analyzing why this is happening. They believe that we have hit an unknown bug and they are working to find what is causing it.
Being a weekend, it has been rough trying to get their experts online to work with us. While we have been working non stop because, well, we can be workaholics here at ServerPoint, unfortunately, other companies don't follow the same work philosophy.
We can assure you that we are pushing them hard to work rapidly. Rest assured, the problem will be resolved one way or another. Please enjoy your weekend and leave the worrying to us. That is what we are here for, to take things off your back. :)
Best regards,
Ruthy