Overview
Rationale
- Containers in Kubernetes require:
- Dynamic provision of storage alongside the container lifecycle
- Reliability—the ability to withstand failure of nodes/disks/storage arrays
- High density of volumes per host
- Security
- Availability across availability zones
PX-Store
- Virtual storage overlay
- Deployed as containers, for containers
- Aggregates storage across nodes
- Auto categorize storage pools
- Replication
- Multi-zonal clusters
- Backups, e.g. into AWS S3
- Disaster recovery
- Scalability—PX-Autopilot
Architecture
- Deployed as a container
- Design principles:
- Heterogeneous systems
- Difference CPU/memory profiles
- Built-in node disks, or storage arrays (or mixed)
- Scaling—nodes and storage
- Heterogeneous systems
- Node/availability zone/storage outages tolerable
- Application-level granularity
- Managed with Kubernetes YAML
- Extensibility—supports OpenStorage SDK and APIs
Deployment Models
- Disaggregated
- Storage nodes separate from compute—2 clusters
- Hyperconverged
- Compute and storage co-located
- Better performance
Components
- Control plane
- Cluster management
- Data plane
- Data storage
Control Plane
Clustering
Quorum
- Consensus—to elect leader
- Storageless nodes have no voting rights, so can’t form quorum—but can join an existing quorum
- Raft algorithm
- Consensus algorithm
- Random timer on each node—messages all other nodes requesting to become leader
- Other nodes respond with algorithm
- Periodic heartbeat from leader—new leader elected if not received
- Raft DB—all nodes have copy, kept in sync
- Minimum number of available nodes required for Quorum, where is the total number of nodes:
- Table of required quorum, per total number of nodes:
Total num nodes | Quorum | Fault tolerance |
---|---|---|
1 | 1 | 0 |
2 | 2 | 0 |
3 | 2 | 1 |
4 | 3 | 1 |
5 | 3 | 2 |
6 | 4 | 2 |
7 | 4 | 3 |
- 2 nodes no better than 1—both have no fault tolerance for node outage
- Same for 4 vs 3 and 6 vs 5
Gossip Protocol
- Health, IO, CPU, memory—shared between nodes, detect outages
KVDB
- Cluster config—single source of truth
- e.g. etcd
Storage Pools
- Disks pooled by:
- Size
- Type e.g. magnetic, SSD etc.
- IO Priority—performance, judged by Portworx benchmarking disks
- High/medium/low
- Pool aggregated across all nodes in cluster
- RAID 0 by default—i.e. no striping or mirroring
- RAID 10 (stripe of mirrors) available if >= 4 nodes in pool
- Min disk size—8 GB
- Max 32 pools per node
Pool Rebalancing
- Moves volumes from over-utilized pools to under-utilised
- Attempts to bring all pools as close to mean utilization as possible
- Unbalanced pools: if >20% or <20% from mean utilization of cluster
Volumes
- Create new volume from storage pool:
-
If no pool with desired IO Priority available—chooses a pool with a different IO Priority and downgrades/upgrades the volume specification
-
Attach volume to node—attaches to node command is run on, can be run on any node in cluster
- Unmount and detach to move volume to another node
Shared Volumes
- Default—block volumes
- Can’t be shared between nodes
- To create shared volumes, pass
--sharedv4 yes
flag topxctl volume create
command - Volume can now be mounted on multiple nodes
Aggregation Sets
- Default—volume associated to single pool on single node
- Aggregation sets—volume spans multiple nodes
- Striped across disks, polls, and nodes
- To create, pass
--aggregation_level <1|2|3|auto>
topxctl volume create
command
Replication Sets
- Replicate volumes
- Max replicas = 3
- Default RAID 0 for pools—disk failure not tolerated
- Require replication sets
- To create, pass
--repl <1|2|3>
topxctl volume create
command - Replicas automatically distributed across:
- Pools, nodes, racks, rows, DCs, zones, regions
- Synchronous replication—replicated before acknowledgement sent to client
Installation
Standalone
- Prerequisite—KVDB, e.g. etcd already installed
- Install via Docker
Kubernetes
- If less than 25 nodes—Portworx can manage KVDB
- Less overhead of managing etcd
- Options: DaemonSet or Operator
- Generate installation spec at PX-Central
- Once deployed—scans disks attached to cluster
- Aggregates to storage pools
Dynamic Provisioning
- Example: HA, 2 replicas (e.g. for Cassandra DB)
- Create Kubernetes StorageClass object, and specify in StatefulSet or PersistentVolume spec:
PXCTL
- CLI
- Manages:
- Cluster
- Volumes
- Snapshots
- Migration
- Alerts
- Secrets
- Storage policy
- Licences
- Object stores
- Roles
- Credentials
- Authentication
PX-Central
- Multi-cluster management console
- Single pane of glass
- SaaS or on-prem
PX-Autopilot
- Rules-based engine—responds to changes from monitoring source (Prometheus)
- Example rules:
- Resize volumes based on used capacity
- Resize pools based on used capacity
- Rebalance storage if provisioned or used space within predefined range (e.g. <20% or >20% or mean usage)
- When resizing—done in a rolling upgrade fashion
- Configurable back-off period—don’t trigger rule again within certain time period
PX-Backup
- Backs up Kubernetes apps and data across 1 or more clusters
- Prerequisites: STORK 2.5+, Helm
- Generate config at PX-Central
- Can store backups in object storage e.g. S3
- Backup rules—select Pods via labels
- Actions—trigger command in Pod at time of backup
- Schedule—periodic, daily, weekly, monthly
- Create a backup:
- What? Where? — Rules
- When? How? — Schedules