Usecase: A node in the Kubernetes cluster is replaced with a new node. The
new node gets a different `kubernetes.io/hostname`. The storage devices
that were attached to the old node are re-attached to the new node.
Fix: Instead of using the default `kubenetes.io/hostname` as the node affinity
label, this commit changes to use `openebs.io/nodeid`. The ZFS LocalPV driver
will pick the value from the nodes and set the affinity.
Once the old node is removed from the cluster, the K8s scheduler will continue
to schedule applications on the old node only.
User can now modify the value of `openebs.io/nodeid` on the new node to the same
value that was available on the old node. This will make sure the pods/volumes are
scheduled to the node now.
Note: Now to migrate the PV to the other node, we have to move the disks to the other node
and remove the old node from the cluster and set the same label on the new node using
the same key, which will let k8s scheduler to schedule the pods to that node.
Other updates:
* adding faq doc
* renaming the config variable to nodename
Signed-off-by: Pawan <pawan@mayadata.io>
Co-authored-by: Akhil Mohan <akhilerm@gmail.com>
* Update docs/faq.md
Co-authored-by: Akhil Mohan <akhilerm@gmail.com>
We set the Finalizer to nil while handling the delete event, instead,
we should try to destroy the volume when there are no user finalizers
set. User might have added his own finalizers and we should not try to destroy
the volumes until those user finalizers are removed.
Signed-off-by: Pawan <pawan@mayadata.io>
Currently controller picks one node and the node agent keeps on trying to
create the volume on that node. There might not be enough space available
on that node to create the volume.
The controller can try on all the nodes sequentially and fail
the request if volume creation fails on all the nodes which satisfies the
topology contraints.
Signed-off-by: Pawan <pawan@mayadata.io>
Encrypted pool does not allow the volume to be pre created for the
restore purpose. Here changing the design to do the restore first
and then create the ZFSVolume object which will bind the volume
already created while doing restore.
Signed-off-by: Pawan <pawan@mayadata.io>
This PR adds the capability to create the Clone from pvc directly
```
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: pvc-clone
spec:
storageClassName: openebs-snap
dataSource:
name: pvc-snap
kind: PersistentVolumeClaim
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
```
The ZFS_LocalPV driver will create one internal snapshot of the name
same as the new volume name and will create a clone out of it. Also,
while destroying the volume the driver will take care of deleting
the created snapshot for the clone.
Signed-off-by: Pawan <pawan@mayadata.io>
Added a schema validation for backup and restore CR. Also validating
the server address in the backup/restore controller.
Validating the server address as :
^([0-9]+.[0-9]+.[0-9]+.[0-9]+:[0-9]+)$
which is :
<any number>.<any number>.<any number>.<any number>:<any number>
Here we are validating just the format of the IP, not validating that IP should be
correct which will be little more complex. In any case if IP is not correct,
the zfs send will fail, so no need to do complex validation to validate the
correct IP and port.
Signed-off-by: Pawan <pawan@mayadata.io>
This commit adds support for Backup and Restore controller, which will be watching for
the events. The velero plugin will create a Backup CR to create a backup
with the remote location information, the controller will send the data
to that remote location.
In the same way, the velero plugin will create a Restore CR to restore the
volume from the the remote location and the restore controller will restore
the data.
Steps to use velero plugin for ZFS-LocalPV are :
1. install velero
2. add openebs plugin
velero plugin add openebs/velero-plugin:latest
3. Create the volumesnapshot location :
for full backup :-
```yaml
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
name: default
namespace: velero
spec:
provider: openebs.io/zfspv-blockstore
config:
bucket: velero
prefix: zfs
namespace: openebs
provider: aws
region: minio
s3ForcePathStyle: "true"
s3Url: http://minio.velero.svc:9000
```
for incremental backup :-
```yaml
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
name: default
namespace: velero
spec:
provider: openebs.io/zfspv-blockstore
config:
bucket: velero
prefix: zfs
backup: incremental
namespace: openebs
provider: aws
region: minio
s3ForcePathStyle: "true"
s3Url: http://minio.velero.svc:9000
```
4. Create backup
velero backup create my-backup --snapshot-volumes --include-namespaces=velero-ns --volume-snapshot-locations=aws-cloud-default --storage-location=default
5. Create Schedule
velero create schedule newschedule --schedule="*/1 * * * *" --snapshot-volumes --include-namespaces=velero-ns --volume-snapshot-locations=aws-local-default --storage-location=default
6. Restore from backup
velero restore create --from-backup my-backup --restore-volumes=true --namespace-mappings velero-ns:ns1
Signed-off-by: Pawan <pawan@mayadata.io>
few customers are using old version of the driver where
Status field is not present. So mount will fail after the
upgrade to the 0.9 or later version.
Reverting back to the checking if finalizer is set to check if
volume is ready to be mounted.
Signed-off-by: Pawan <pawan@mayadata.io>
This issue is specific to xfs only, when we create a clone volume and system is taking time in creating the device.
When we create a clone volume from a xfs filesystem, ZFS-LocalPV will go ahead and generate a new UUID for the clone volumes as we need a new UUID to mount the new clone filesystem. To generate a new UUID for the clone volume, ZFS-LocalPV first replays the xfs log by mounting the device to a tmp localtion.
Here, what is happening is since device creation is slow, so we went ahead and created the tmp location to mount the clone volume but since device has not created yet, the mount failed. In the next try since the tmp location is present, it will keep failing there only at every reconciliation time.
Signed-off-by: Pawan <pawan@mayadata.io>
Instead of checking for the finalizer, checking for the
volume state to be ready is more intuitive before mounting it.
Also removed duplicate if statement for btrfs which was added while resolveing
the merge conflict in https://github.com/openebs/zfs-localpv/pull/175.
Signed-off-by: Pawan <pawan@mayadata.io>
btrfs, like xfs, needs to generate a new UUID for the
cloned volumes. All the devices with the same UUID will be treated
same for btrfs, so here generating the new UUID for the cloned volumes
using btrfstune command.
Signed-off-by: Pawan <pawan@mayadata.io>
Applications who want to share a volume can use below storageclass
to make their volumes shared by multiple pods
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: openebs-zfspv
parameters:
shared: "yes"
fstype: "zfs"
poolname: "zfspv-pool"
provisioner: zfs.csi.openebs.io
```
Now the provisioned volume using this storageclass can be used by multiple pods.
Here pods have to make sure of the data consistency and have to have locking mechanism.
One thing to note here is pods will be scheduled to the node where volume is present
so that all the pods can use the same volume as they can access it locally only.
This was we can avoid the NFS overhead and can get the optimal performance also.
Also fixed the log formatting in the GRPC log.
Signed-off-by: Pawan <pawan@mayadata.io>
We can not mount the datasets to more than one path via zfs mount command,
shifting to the legacy way of handling ZFS volumes where we can mount/umount
the datasets via legacy mount and umount commands.
This will also add a building block for SINGLE-NODE-MULTI-WRITER Capability.
Signed-off-by: Pawan <pawan@mayadata.io>
PVC will not bound if there are wrong parameters/poolname in the storageclass,
the ZFSVolume CR will be still created and will remain in Pending State,
deletion of the PVC will delete PVC and since PVC is not bound, ZFS-LocalPV
driver will not get the delete call and will leave the ZFSVolume CR hanging there.
Reverting the behavior introduced in https://github.com/openebs/zfs-localpv/pull/121,
Now PVC will be bound but still ZFSVolume will be in Pending state until the volume is created.
Signed-off-by: Pawan <pawan@mayadata.io>
The controller does not check whether the volume has been created or not
and return successful. Which in turn binds the pvc to the pv.
The PVC should not bound until corresponding zfs volume has been created.
Now controller will check the ZFSVolume CR state to be "Ready" before returning
successful. The CSI will retry the CreateVolume request when it will get
a error reply and when the ZFS node agent creates the ZFS volume and sets the
ZFSVolume CR state to be "Ready", the controller will return success for the
CreateVolume Request and then PVC will be bound.
Signed-off-by: Pawan <pawan@mayadata.io>
This commit adds the support for creating a Raw Block Volume request using volumemode as block in PVC :-
```
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: block-claim
spec:
volumeMode: Block
storageClassName: zfspv-block
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
```
The driver will create a zvol for this volume and bind mount the block device at the given path.
Signed-off-by: Pawan <pawan@mayadata.io>
There are setups where nodename is different than the hostname.
The driver uses the nodename and tries to set the "kubernetes.io/hostname"
node label to the nodename. Which will fail if nodename is not same as
hostname. Here, changing the key to unique name so that the driver can set
that key as node label and also it can not modify/touch the existing node labels.
Now onwards, the driver will use "openebs.io/nodename" key to set the PV node affinity.
Old volumes will have "kubernetes.io/hostname" affinity, and they will also work as
after the PR https://github.com/openebs/zfs-localpv/pull/94, it supports all the node
labels as topology key and all the nodes have "kubernetes.io/hostname" label set. So
old volumes will work without any issue. Also for the same reason old stoarge classes
which are using "kubernetes.io/hostname" as topology key, will work as that key is supported.
This fixes the issue where the driver was trying to create the PV on the master node
as master node is having "kubernetes.io/hostname" label, so it is also becoming a valid
candidate for provisioning the PV. After changing the key to unique name, since the driver
will not run on master node, so it will not set "openebs.io/nodename" label to this node
hence this node will never become a valid candidate for the provisioning the volume.
Signed-off-by: Pawan <pawan@mayadata.io>
looks like a bug in ZFS as when you change the mountpoint property to none,
ZFS automatically umounts the file system. When we delete the pod, we get the
unmount request for the old pod and mount request for the new pod. Unmount
is done by the driver by setting mountpoint to none and the driver assumes that
unmount has done and proceeded to delete the mountpath, but here zfs has not unmounted
the dataset
```
$ sudo zfs get all zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765 | grep mount
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765 mounted yes -
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765 mountpoint none local
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765 canmount on
```
here, the driver will assume that dataset has been unmouted and proceed to delete the
mountpath and it will delete the data as part of cleaning up for the NodeUnPublish request.
Shifting to use zfs umount instead of doing zfs set mountpoint=none for umounting the dataset.
Also the driver is using os.RemoveAll which is very risky as it will clean
child also, since the mountpoint is not supposed to have anything,
just os.Remove is sufficient and it will fail if there is anything there.
Signed-off-by: Pawan <pawan@mayadata.io>
There can be cases where openebs namespace has been accidently deleted (Optoro case: https://mdap.zendesk.com/agent/tickets/963), There the driver attempted to destroy the dataset which will first umount the dataset and then try to destroy it, the destroy will fail as volume is busy. Here, as mentioned in the steps to recover, we have to manually mount the dataset
```
6. The driver might have attempted to destroy the volume before going down, which sets the mount as no(this strange behavior on gke ubuntu 18.04), we have to mount the dataset, go to the each node and check if there is any unmounted volume
zfs get mounted
if there is any unmounted dataset with this option as "no", we should do the below :-
mountpath=zfs get -Hp -o value mountpoint <dataset name>
zfs set mountpoint=none
zfs set mountpoint=<mountpath>
this will set the dataset to be mounted.
```
So in this case the volume will be unmounted and still mountpoint will set to the mountpath, so if application pod is deleted later on, it will try to mount the zfs dataset, here just setting the `mountpoint` is not sufficient, as if we have unmounted the zfs dataset (via zfs destroy in this case), so we have to explicitely mount the dataset **otherwise application will start running without any persistence storage**. Here automating the manual steps performed to resolve the problem, we are checking in the code that if zfs dataset is not mounted after setting the mountpoint property, attempt to mount it.
This is not the case with the zvol as it does not attempt to unmount it, so zvols are fine.
Also NodeUnPublish operation MUST be idempotent. If this RPC failed, or the CO does not know if it failed or not, it can choose to call NudeUnPublishRequest again. So handled this and returned successful if volume is not mounted also added descriptive error messages at few places.
Signed-off-by: Pawan <pawan@mayadata.io>
xfs_admin command to generate the new UUID for the cloned
volume fails without returning error if there is log available
in the filesystem :
ERROR: The filesystem has valuable metadata changes in a log that needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_admin. If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
No UUID will be generated in this case and application can not mount the volume.
Here mounting the filesystem to the temp location with "nouuid" mount option first
so that it can replay the logs first and system is in clean state and then unmount it
and after that generating the UUID with the xfs_admin command.
Signed-off-by: Pawan <pawan@mayadata.io>
for mounting the cloned volume for xfs, a new UUID has to be generated.
We are generating a new UUID for the cloned volumes which are formatted
as xfs using xfs_admin command.
Signed-off-by: Pawan <pawan@mayadata.io>
We can resize the volume by updating the PVC yaml to
the desired size and apply it. The ZFS Driver will take care
of updating the quota in case of dataset. If we are using a
Zvol and have mounted it as ext4 or xfs filesystem, the driver will take
care of expanding the volume via reize2fs/xfs_growfs binaries.
For resize, storageclass that provisions the pvc must suppo
rt resize. We should have allowVolumeExpansion as true in storageclass
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: openebs-zfspv
allowVolumeExpansion: true
parameters:
poolname: "zfspv-pool"
provisioner: zfs.csi.openebs.io
```
Signed-off-by: Pawan <pawan@mayadata.io>
Whenever a volume is provisioned and de-provisioned we will send a google event with mainly following details :
1. pvName (will shown as app title in google analytics)
2. size of the volume
3. event type : volume-provision, volume-deprovision
4. storage type zfs-localpv
5. replicacount as 1
6. ClientId as default namespace uuid
Apart from this, we send the event once in 24 hr, which will have some info like number of nodes, node type, kubernetes version etc.
This metric is cotrolled by OPENEBS_IO_ENABLE_ANALYTICS env. We can set it to false if we don't want to send the metrics.
Signed-off-by: Pawan <pawan@mayadata.io>
This commits support snapshot and clone commands via CSI driver. User can create snap and clone using the following steps.
Note:
- Snapshot is created via reconciliation CR
- Cloned volume will be on the same zpool where the snapshot is taken
- Cloned volume will have same properties as source volume.
-----------------------------------
Create a Snapshotclass
```
kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1beta1
metadata:
name: zfspv-snapclass
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
driver: zfs.csi.openebs.io
deletionPolicy: Delete
```
Once snapshotclass is created, we can use this class to create a Snapshot
```
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
name: zfspv-snap
spec:
volumeSnapshotClassName: zfspv-snapclass
source:
persistentVolumeClaimName: csi-zfspv
```
```
$ kubectl get volumesnapshot
NAME AGE
zfspv-snap 7m52s
```
```
$ kubectl get volumesnapshot -o yaml
apiVersion: v1
items:
- apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"snapshot.storage.k8s.io/v1beta1","kind":"VolumeSnapshot","metadata":{"annotations":{},"name":"zfspv-snap","namespace":"default"},"spec":{"source":{"persistentVolumeClaimName":"csi-zfspv"},"volumeSnapshotClassName":"zfspv-snapclass"}}
creationTimestamp: "2020-01-30T10:31:24Z"
finalizers:
- snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
- snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
generation: 1
name: zfspv-snap
namespace: default
resourceVersion: "30040"
selfLink: /apis/snapshot.storage.k8s.io/v1beta1/namespaces/default/volumesnapshots/zfspv-snap
uid: 1a5cf166-c599-4f58-9f3c-f1148be47fca
spec:
source:
persistentVolumeClaimName: csi-zfspv
volumeSnapshotClassName: zfspv-snapclass
status:
boundVolumeSnapshotContentName: snapcontent-1a5cf166-c599-4f58-9f3c-f1148be47fca
creationTime: "2020-01-30T10:31:24Z"
readyToUse: true
restoreSize: "0"
kind: List
metadata:
resourceVersion: ""
selfLink: ""
```
Openebs resource for the created snapshot
```
$ kubectl get snap -n openebs -o yaml
apiVersion: v1
items:
- apiVersion: openebs.io/v1alpha1
kind: ZFSSnapshot
metadata:
creationTimestamp: "2020-01-30T10:31:24Z"
finalizers:
- zfs.openebs.io/finalizer
generation: 2
labels:
kubernetes.io/nodename: pawan-2
openebs.io/persistent-volume: pvc-18cab7c3-ec5e-4264-8507-e6f7df4c789a
name: snapshot-1a5cf166-c599-4f58-9f3c-f1148be47fca
namespace: openebs
resourceVersion: "30035"
selfLink: /apis/openebs.io/v1alpha1/namespaces/openebs/zfssnapshots/snapshot-1a5cf166-c599-4f58-9f3c-f1148be47fca
uid: e29d571c-42b5-4fb7-9110-e1cfc9b96641
spec:
capacity: "4294967296"
fsType: zfs
ownerNodeID: pawan-2
poolName: zfspv-pool
status: Ready
volumeType: DATASET
kind: List
metadata:
resourceVersion: ""
selfLink: ""
```
Create a clone volume
We can provide a datasource as snapshot name to create a clone volume
```yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: zfspv-clone
spec:
storageClassName: openebs-zfspv
dataSource:
name: zfspv-snap
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
```
It will create a ZFS clone volume from the mentioned snapshot and create the PV on the same node where original volume is there.
Here, As resize is not supported yet, the clone PVC size should match the size of the snapshot.
Also, all the properties from the storageclass will not be considered for the clone case, it will take the properties from the snapshot and create the clone volume. One thing to note here is that, the storageclass in clone PVC should have the same poolname as that of the original volume as across the pool, clone is not supported.
Signed-off-by: Pawan <pawan@mayadata.io>
With "zfs destroy -R" we will delete snapshot and clones also. We should
not use that for deleting the volumes.
Signed-off-by: Pawan <pawan@mayadata.io>
Application can now create a storageclass to create zfs filesystem
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: openebs-zfspv5
allowVolumeExpansion: true
parameters:
blocksize: "4k"
fstype: "zfs"
poolname: "zfspv-pool"
provisioner: zfs.csi.openebs.io
ZFSPV was supporting ext2/3/4 and xfs filesystem only which
adds one extra filesystem layer on top of ZFS filesystem. So now
we can driectly write to the ZFS filesystem and get the optimal performance
by directly creating ZFS filesystem for storage.
Signed-off-by: Pawan <pawan@mayadata.io>
This is an initial scheduler implementation for ZFS Local PV.
* adding scheduler as a configurable option
* adding volumeWeightedScheduler as scheduling logic
The volumeWeightedScheduler will go through all the nodes as per
topology information and it will pick the node which has less
volume provisioned in the given pool.
lets say there are 2 nodes node1 and node2 with below pool configuration :-
```
node1
|
|-----> pool1
| |
| |------> pvc1
| |------> pvc2
|-----> pool2
|------> pvc3
node2
|
|-----> pool1
| |
| |------> pvc4
|-----> pool2
|------> pvc5
|------> pvc6
```
So if application is using pool1 as shown in the below storage class, then ZFS driver will schedule it on node2 as it has one volume as compared to node1 which has 2 volumes in pool1.
```yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: openebs-zfspv
provisioner: zfs.csi.openebs.io
parameters:
blocksize: "4k"
compression: "on"
dedup: "on"
thinprovision: "yes"
poolname: "pool1"
```
So if application is using pool2 as shown in the below storage class, then ZFS driver will schedule it on node1 as it has one volume only as compared node2 which has 2 volumes in pool2.
```yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: openebs-zfspv
provisioner: zfs.csi.openebs.io
parameters:
blocksize: "4k"
compression: "on"
dedup: "on"
thinprovision: "yes"
poolname: "pool2"
```
In case of same number of volumes on all the nodes for the given pool, it can pick any node and schedule the PV on that.
Signed-off-by: Pawan <pawan@mayadata.io>