在 GKE 上为 Scrum 团队构建 PostgreSQL Operator 实现数据库即服务


我们团队的 Scrum 流程一度被数据库 provisioning 卡住了。每个 Sprint,为了开发和测试新功能,我们需要创建多个独立的数据库实例。最初的流程依赖于手动执行 gcloud sql instances create 或者维护越来越臃肿的 Terraform 模块。这个过程不仅耗时,而且极易出错,一个错误的参数就可能导致数小时的排查。当一个 Sprint 中有三四个并行的功能分支时,数据库管理员(DBA)和 DevOps 工程师就成了整个团队的瓶颈。这种摩擦力违背了 Scrum 敏捷的初衷。

我们需要的是一种“数据库即服务”的能力:开发人员能够通过一行命令或是一个 YAML 文件,在几分钟内获得一个隔离的、配置正确的数据库集群,并在功能合并后轻松销毁它。现有的 Helm chart 只能解决首次部署的问题,对于数据库的 Day-2 操作,如扩容、配置变更、甚至简单的密码轮换都显得力不从心。我们真正需要的是一个能够理解数据库生命周期的自动化控制器。这正是 Kubernetes Operator 模式的用武之地。我们决定在 Google Kubernetes Engine (GKE) 上,为团队构建一个精简但实用的 PostgreSQL Operator。

技术痛点与初步构想

痛点非常明确:

  1. 供应速度慢: 手动创建或通过 IaC 工具创建数据库实例,通常需要5到15分钟,这在需要快速迭代的开发环境中是无法接受的。
  2. 配置不一致: 手动操作导致每个环境的数据库配置(如 max_connections, shared_buffers)可能存在细微差别,为日后的生产问题埋下隐患。
  3. 资源管理困难: 开发、测试环境的数据库资源常常被遗忘,导致云成本不必要的浪费。销毁流程同样是手动的。

我们的构想是创建一个 Kubernetes Custom Resource Definition (CRD),名为 PostgresCluster。开发人员只需定义他们期望的数据库状态,例如:

# config/samples/db_v1alpha1_postgrescluster.yaml
apiVersion: db.my.domain/v1alpha1
kind: PostgresCluster
metadata:
  name: feature-branch-xyz
spec:
  # PostgreSQL major version
  version: "14"
  # Number of instances in the cluster (1 primary, N-1 standbys)
  instances: 3
  # Storage configuration
  storage:
    size: "10Gi"
    storageClassName: "premium-rwo" # GCP Persistent Disk storage class
  # Compute resources for each instance
  resources:
    requests:
      cpu: "500m"
      memory: "1Gi"
    limits:
      cpu: "1"
      memory: "2Gi"

我们的 Operator 会持续监听这些 PostgresCluster 资源的变化,并自动在 GKE 中创建、配置或删除底层的 Kubernetes 对象(如 StatefulSet, Service, Secret, PersistentVolumeClaim),以使实际状态与期望状态保持一致。这个过程,就是 Operator 的核心——调谐循环(Reconciliation Loop)。

核心实现:从 CRD 定义到调谐循环

我们选择使用 Kubebuilder 框架来快速搭建 Operator 的骨架。它能自动生成 CRD 的 YAML 定义、Go 的 API 类型以及 Controller 的基本框架。

1. API 定义 (CRD)

首先,我们定义 PostgresCluster 的数据结构。这是 Operator 与用户交互的契约。

api/v1alpha1/postgrescluster_types.go:

package v1alpha1

import (
	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// PostgresClusterSpec defines the desired state of PostgresCluster
type PostgresClusterSpec struct {
	// PostgreSQL major version, e.g., "14"
	// +kubebuilder:validation:Required
	// +kubebuilder:validation:Pattern=`^\d+$`
	Version string `json:"version"`

	// Number of instances in the cluster. Must be 1 or greater.
	// +kubebuilder:validation:Required
	// +kubebuilder:validation:Minimum=1
	Instances int32 `json:"instances"`

	// Storage configuration for each instance
	// +kubebuilder:validation:Required
	Storage PostgresStorageSpec `json:"storage"`

	// Compute resource requirements for each instance
	// +kubebuilder:validation:Optional
	Resources corev1.ResourceRequirements `json:"resources,omitempty"`
}

// PostgresStorageSpec defines the storage configuration
type PostgresStorageSpec struct {
	// PVC size, e.g., "10Gi"
	// +kubebuilder:validation:Required
	Size string `json:"size"`

	// Storage class to use for the PVCs
	// +kubebuilder:validation:Required
	StorageClassName string `json:"storageClassName"`
}

// PostgresClusterStatus defines the observed state of PostgresCluster
type PostgresClusterStatus struct {
	// Phase indicates the current state of the cluster.
	// E.g., Creating, Ready, Failed.
	Phase string `json:"phase,omitempty"`

	// ReadyInstances is the number of database instances that are ready.
	ReadyInstances int32 `json:"readyInstances,omitempty"`

	// Conditions represent the latest available observations of an object's state
	Conditions []metav1.Condition `json:"conditions,omitempty"`
}

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Version",type="string",JSONPath=".spec.version"
//+kubebuilder:printcolumn:name="Instances",type="integer",JSONPath=".spec.instances"
//+kubebuilder:printcolumn:name="Ready",type="integer",JSONPath=".status.readyInstances"
//+kubebuilder:printcolumn:name="Phase",type="string",JSONPath=".status.phase"
//+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

// PostgresCluster is the Schema for the postgresclusters API
type PostgresCluster struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   PostgresClusterSpec   `json:"spec,omitempty"`
	Status PostgresClusterStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

// PostgresClusterList contains a list of PostgresCluster
type PostgresClusterList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []PostgresCluster `json:"items"`
}

func init() {
	SchemeBuilder.Register(&PostgresCluster{}, &PostgresClusterList{})
}

这里的 +kubebuilder 注释非常关键,它们会用于生成 CRD YAML 文件,定义校验规则、状态子资源以及 kubectl get 的额外打印列。

2. 控制器核心逻辑 (Reconciliation Loop)

这是 Operator 的大脑。Reconcile 方法会在 PostgresCluster 资源发生任何变化(创建、更新、删除)时被调用。我们的任务是在这个方法中,通过与 Kubernetes API Server 交互,将现实世界(GKE 中的资源状态)调整为用户在 PostgresCluster Spec 中所期望的状态。

下面是 Reconcile 方法的简化但完整的实现框架,展示了核心的调谐逻辑。

controllers/postgrescluster_controller.go:

package controllers

import (
	"context"
	"fmt"
	"time"

	appsv1 "k8s.io/api/apps/v1"
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	"k8s.io/apimachinery/pkg/api/resource"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/types"
	"k8s.io/apimachinery/pkg/util/intstr"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log"
	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"

	dbv1alpha1 "github.com/your-repo/postgres-operator/api/v1alpha1"
)

// ... (Reconciler struct definition)

const finalizerName = "db.my.domain/finalizer"

func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	logger := log.FromContext(ctx)

	// 1. Fetch the PostgresCluster instance
	instance := &dbv1alpha1.PostgresCluster{}
	err := r.Get(ctx, req.NamespacedName, instance)
	if err != nil {
		if errors.IsNotFound(err) {
			logger.Info("PostgresCluster resource not found. Ignoring since object must be deleted.")
			return ctrl.Result{}, nil
		}
		logger.Error(err, "Failed to get PostgresCluster")
		return ctrl.Result{}, err
	}
    
    // --- Finalizer Logic for graceful deletion ---
	if instance.ObjectMeta.DeletionTimestamp.IsZero() {
		// The object is not being deleted, so if it does not have our finalizer,
		// let's add it and update the object.
		if !controllerutil.ContainsFinalizer(instance, finalizerName) {
			controllerutil.AddFinalizer(instance, finalizerName)
			if err := r.Update(ctx, instance); err != nil {
				return ctrl.Result{}, err
			}
		}
	} else {
		// The object is being deleted
		if controllerutil.ContainsFinalizer(instance, finalizerName) {
			// Our finalizer is present, so let's handle any external dependency
			logger.Info("Performing finalizer operations for PostgresCluster")

			// In a real operator, you might perform actions like taking a final backup.
			// Here, we just log it.

			// Remove our finalizer from the list and update it.
			controllerutil.RemoveFinalizer(instance, finalizerName)
			if err := r.Update(ctx, instance); err != nil {
				return ctrl.Result{}, err
			}
		}
		// Stop reconciliation as the item is being deleted
		return ctrl.Result{}, nil
	}


	// 2. Reconcile the admin password Secret
	// In a real project, password should be generated and stored securely.
	// For simplicity, we create a secret if it does not exist.
	secret := &corev1.Secret{}
	err = r.Get(ctx, types.NamespacedName{Name: instance.Name + "-pg-secret", Namespace: instance.Namespace}, secret)
	if err != nil && errors.IsNotFound(err) {
		logger.Info("Creating a new Secret for admin credentials")
		newSecret := r.secretForPostgres(instance)
		if err := r.Create(ctx, newSecret); err != nil {
			logger.Error(err, "Failed to create new Secret")
			return ctrl.Result{}, err
		}
        // Requeue to ensure the secret is available for next steps
		return ctrl.Result{Requeue: true}, nil
	} else if err != nil {
		logger.Error(err, "Failed to get Secret")
		return ctrl.Result{}, err
	}


	// 3. Reconcile the Headless Service for StatefulSet discovery
	headlessSvc := &corev1.Service{}
	err = r.Get(ctx, types.NamespacedName{Name: instance.Name + "-headless", Namespace: instance.Namespace}, headlessSvc)
	if err != nil && errors.IsNotFound(err) {
		logger.Info("Creating a new Headless Service")
		svc := r.headlessServiceForPostgres(instance)
		if err := r.Create(ctx, svc); err != nil {
			logger.Error(err, "Failed to create Headless Service")
			return ctrl.Result{}, err
		}
		return ctrl.Result{Requeue: true}, nil
	} else if err != nil {
		logger.Error(err, "Failed to get Headless Service")
		return ctrl.Result{}, err
	}


	// 4. Reconcile the StatefulSet
	foundSts := &appsv1.StatefulSet{}
	err = r.Get(ctx, types.NamespacedName{Name: instance.Name, Namespace: instance.Namespace}, foundSts)

	if err != nil && errors.IsNotFound(err) {
		// Define a new statefulset
		sts := r.statefulSetForPostgres(instance)
		logger.Info("Creating a new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
		err = r.Create(ctx, sts)
		if err != nil {
			logger.Error(err, "Failed to create new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
			return ctrl.Result{}, err
		}
		// StatefulSet created successfully - return and requeue
        instance.Status.Phase = "Creating"
        if err := r.Status().Update(ctx, instance); err != nil {
            logger.Error(err, "Failed to update PostgresCluster status")
        }
		return ctrl.Result{RequeueAfter: time.Second * 5}, nil
	} else if err != nil {
		logger.Error(err, "Failed to get StatefulSet")
		return ctrl.Result{}, err
	}

	// 5. Ensure the StatefulSet size is the same as the spec
	size := instance.Spec.Instances
	if *foundSts.Spec.Replicas != size {
		logger.Info("Updating StatefulSet replica count", "current", *foundSts.Spec.Replicas, "desired", size)
		foundSts.Spec.Replicas = &size
		err = r.Update(ctx, foundSts)
		if err != nil {
			logger.Error(err, "Failed to update StatefulSet", "StatefulSet.Namespace", foundSts.Namespace, "StatefulSet.Name", foundSts.Name)
			return ctrl.Result{}, err
		}
        // Spec updated - return and requeue
		return ctrl.Result{Requeue: true}, nil
	}

	// 6. Update the PostgresCluster status
    // In a production operator, we would check pod statuses, primary election, etc.
    // For this example, we'll just check the ready replicas of the StatefulSet.
    readyReplicas := foundSts.Status.ReadyReplicas
    if instance.Status.ReadyInstances != readyReplicas {
        instance.Status.ReadyInstances = readyReplicas
    }

    if readyReplicas == instance.Spec.Instances {
        instance.Status.Phase = "Ready"
    } else {
        instance.Status.Phase = "Reconciling"
    }

    err = r.Status().Update(ctx, instance)
    if err != nil {
        logger.Error(err, "Failed to update PostgresCluster status")
        return ctrl.Result{}, err
    }

	return ctrl.Result{}, nil
}

3. 辅助函数:构建 Kubernetes 对象

Reconcile 方法依赖一系列辅助函数来动态构建所需的 Kubernetes 对象。这里的核心是 statefulSetForPostgres

controllers/postgrescluster_helpers.go:

package controllers

import (
	// ... imports
)

// statefulSetForPostgres returns a postgres StatefulSet object
func (r *PostgresClusterReconciler) statefulSetForPostgres(p *dbv1alpha1.PostgresCluster) *appsv1.StatefulSet {
	ls := labelsForPostgres(p.Name)
	replicas := p.Spec.Instances
	storageSize := resource.MustParse(p.Spec.Storage.Size)
	
	// A common pitfall is not setting the controller reference. 
	// Without it, deleting the PostgresCluster object will not delete the StatefulSet.
	// This is known as garbage collection in Kubernetes.
	dep := &appsv1.StatefulSet{
		ObjectMeta: metav1.ObjectMeta{
			Name:      p.Name,
			Namespace: p.Namespace,
		},
		Spec: appsv1.StatefulSetSpec{
			Replicas: &replicas,
			Selector: &metav1.LabelSelector{
				MatchLabels: ls,
			},
			ServiceName: p.Name + "-headless",
			Template: corev1.PodTemplateSpec{
				ObjectMeta: metav1.ObjectMeta{
					Labels: ls,
				},
				Spec: corev1.PodSpec{
					// Use a proper termination grace period to allow Postgres to shut down cleanly.
					TerminationGracePeriodSeconds: func(i int64) *int64 { return &i }(30),
					Containers: []corev1.Container{{
						Image: fmt.Sprintf("postgres:%s", p.Spec.Version),
						Name:  "postgres",
						Env: []corev1.EnvVar{
							{
								Name: "POSTGRES_USER",
								Value: "admin",
							},
							{
								Name: "POSTGRES_PASSWORD",
								ValueFrom: &corev1.EnvVarSource{
									SecretKeyRef: &corev1.SecretKeySelector{
										LocalObjectReference: corev1.LocalObjectReference{Name: p.Name + "-pg-secret"},
										Key:                  "password",
									},
								},
							},
							{
								Name: "PGDATA",
								Value: "/var/lib/postgresql/data/pgdata",
							},
						},
						Ports: []corev1.ContainerPort{{
							ContainerPort: 5432,
							Name:          "postgres",
						}},
						VolumeMounts: []corev1.VolumeMount{{
							Name:      "pgdata",
							MountPath: "/var/lib/postgresql/data",
						}},
						Resources: p.Spec.Resources,
						// A robust liveness probe is critical for stateful applications.
						// It should verify the database is actually responsive.
						LivenessProbe: &corev1.Probe{
							ProbeHandler: corev1.ProbeHandler{
								Exec: &corev1.ExecAction{
									Command: []string{"pg_isready", "-U", "admin"},
								},
							},
							InitialDelaySeconds: 30,
							TimeoutSeconds:      5,
							PeriodSeconds:       10,
							FailureThreshold:    6,
						},
						// Readiness probe ensures the pod only receives traffic when it's ready.
						ReadinessProbe: &corev1.Probe{
							ProbeHandler: corev1.ProbeHandler{
								Exec: &corev1.ExecAction{
									Command: []string{"pg_isready", "-U", "admin"},
								},
							},
							InitialDelaySeconds: 5,
							TimeoutSeconds:      3,
							PeriodSeconds:       5,
						},
					}},
				},
			},
			// VolumeClaimTemplates are the key to persistent storage in StatefulSets.
			// Each pod gets its own persistent volume.
			VolumeClaimTemplates: []corev1.PersistentVolumeClaim{{
				ObjectMeta: metav1.ObjectMeta{
					Name: "pgdata",
				},
				Spec: corev1.PersistentVolumeClaimSpec{
					AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
					StorageClassName: &p.Spec.Storage.StorageClassName,
					Resources: corev1.ResourceRequirement{
						Requests: corev1.ResourceList{
							corev1.ResourceStorage: storageSize,
						},
					},
				},
			}},
		},
	}
	// Set PostgresCluster instance as the owner and controller
	ctrl.SetControllerReference(p, dep, r.Scheme)
	return dep
}
// ... (other helper functions for Service, Secret, labels)

这段代码中最关键的部分是:

  1. ctrl.SetControllerReference: 这行代码建立了 PostgresClusterStatefulSet 之间的父子关系。当 PostgresCluster 被删除时,Kubernetes 的垃圾回收机制会自动删除其拥有的 StatefulSet。这是避免资源泄露的关键。
  2. VolumeClaimTemplates: 这是 StatefulSet 管理有状态应用的核心。它为每个 Pod 副本创建一个独立的 PersistentVolumeClaim,确保了数据的持久性和唯一性。
  3. Probes: 精心设计的 LivenessProbeReadinessProbe 对于数据库至关重要。liveness 探针失败会导致容器重启,而 readiness 探针失败会暂时将 Pod 从 Service 的端点中移除,停止接收流量,但不会杀死它。

调谐流程可视化

整个调谐过程可以通过下面的 Mermaid 图清晰地展示出来:

sequenceDiagram
    participant Dev as Developer
    participant K8s as Kubernetes API Server
    participant Op as Postgres Operator
    participant STS as StatefulSet
    participant Pod as Postgres Pods
    
    Dev->>+K8s: kubectl apply -f postgres-cluster.yaml
    K8s-->>+Op: Notify change for PostgresCluster 'feature-branch-xyz'
    
    Op->>K8s: Get PostgresCluster 'feature-branch-xyz'
    K8s-->>Op: Return object details
    
    Op->>K8s: Does Secret 'feature-branch-xyz-pg-secret' exist?
    K8s-->>Op: No, Not Found
    Op->>+K8s: Create Secret
    K8s-->>-Op: Secret created
    
    Op->>K8s: Does Service 'feature-branch-xyz-headless' exist?
    K8s-->>Op: No, Not Found
    Op->>+K8s: Create Headless Service
    K8s-->>-Op: Service created
    
    Op->>K8s: Does StatefulSet 'feature-branch-xyz' exist?
    K8s-->>Op: No, Not Found
    Op->>+K8s: Create StatefulSet (replicas=3)
    Note over K8s, STS: K8s Controller Manager sees new StatefulSet
    K8s-->>-Op: StatefulSet created
    
    STS->>+Pod: Create Pod-0, Pod-1, Pod-2
    Pod-->>-STS: Pods become Ready
    
    Op->>K8s: Reconcile again (requeued)
    Op->>K8s: Get StatefulSet 'feature-branch-xyz'
    K8s-->>Op: Return StatefulSet with Status (readyReplicas=3)
    
    Op->>K8s: Update PostgresCluster 'feature-branch-xyz' Status (Phase=Ready, ReadyInstances=3)
    K8s-->>Op: Status updated

最终成果与遗留问题

部署完成后,开发人员现在可以完全自服务地管理他们的数据库环境。创建一个功能分支的数据库集群,从提交 YAML 文件到数据库可用,整个过程缩短到了3分钟以内。团队的迭代速度得到了显著提升,DevOps 工程师也从繁琐的重复工作中解放出来,可以专注于更有价值的平台工程建设。

当然,我们当前实现的 Operator 只是一个起点,它在生产环境中使用还存在明显的局限性:

  1. 高可用与故障转移: 当前版本没有实现真正的高可用。虽然 StatefulSet 会在节点故障时在别处重建 Pod,但这并不能自动处理 PostgreSQL 的主从切换。一个生产级的 Operator 需要集成 Patroni 或类似工具来管理集群选举和故障转移。
  2. 备份与恢复: 数据库的生命周期管理离不开备份和恢复。下一步的迭代计划是为 PostgresCluster CRD 增加 backup spec,并实现一个独立的 PostgresBackup CRD,由 Operator 协调 pg_dumppg_basebackup 的执行,并将备份上传到 GCP Cloud Storage。
  3. 版本升级: 无缝的数据库主版本升级是一个复杂的过程,需要精确控制 Pod 的滚动更新顺序、执行 pg_upgrade、并处理潜在的数据格式不兼容问题。这需要更复杂的调谐逻辑。
  4. 配置管理: 目前配置是硬编码在容器镜像或环境变量中的。一个更灵活的设计应该允许用户通过 ConfigMap 来管理 postgresql.conf,并由 Operator 触发滚动重启来应用配置变更。

尽管存在这些待办事项,但这个项目已经证明了 Operator 模式在自动化复杂有状态应用方面的强大能力。它将运维知识固化为代码,为我们的 Scrum 团队提供了一个真正符合敏捷精神的云原生数据库平台。


  目录