Monitoring

Available Metrics

The PeSIT Wizard server exposes Prometheus metrics on /actuator/prometheus.

Key Metrics

Metric	Description
`pesitwizard_connections_active`	Active PeSIT connections
`pesitwizard_connections_total`	Total connections
`pesitwizard_transfers_total`	Number of transfers
`pesitwizard_transfers_bytes_total`	Transferred volume (bytes)
`pesitwizard_transfers_duration_seconds`	Transfer duration
`pesitwizard_errors_total`	Number of errors
`pesitwizard_cluster_members`	Cluster members
`pesitwizard_cluster_is_leader`	1 if leader, 0 otherwise

Prometheus Integration

Prometheus Configuration

yaml

# prometheus.yml
scrape_configs:
  - job_name: 'pesitwizard-server'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['pesitwizard']
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: pesitwizard-server
        action: keep
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        regex: "8080"
        action: keep

Useful Queries

promql

# Transfer rate per minute
rate(pesitwizard_transfers_total[5m]) * 60

# Transferred volume per hour
increase(pesitwizard_transfers_bytes_total[1h])

# Error rate
rate(pesitwizard_errors_total[5m]) / rate(pesitwizard_transfers_total[5m])

# Average transfer duration
rate(pesitwizard_transfers_duration_seconds_sum[5m]) / rate(pesitwizard_transfers_duration_seconds_count[5m])

Grafana Dashboards

Main Dashboard

Import the dashboard from: /grafana/pesitwizard-dashboard.json

Included panels:

Transfers per minute
Transferred volume
Active connections
Error rate
Cluster status
Top partners

Recommended Alerts

yaml

# alerting-rules.yml
groups:
  - name: pesitwizard
    rules:
      - alert: PesitHighErrorRate
        expr: rate(pesitwizard_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High PeSIT Wizard error rate"

      - alert: PesitNoLeader
        expr: sum(pesitwizard_cluster_is_leader) == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No PeSIT Wizard leader"

      - alert: PesitClusterDegraded
        expr: pesitwizard_cluster_members < 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PeSIT Wizard cluster degraded"

Logs

Log Format

2025-01-10 10:30:00.123 INFO  [pesitwizard-server] [session-123] CONNECT partner=CLIENT_XYZ ip=192.168.1.100
2025-01-10 10:30:01.456 INFO  [pesitwizard-server] [session-123] CREATE file=TRANSFER.XML virtualFile=TRANSFERS
2025-01-10 10:30:05.789 INFO  [pesitwizard-server] [session-123] TRANSFER_COMPLETE bytes=15234 duration=4333ms

Centralization with ELK

yaml

# filebeat.yml
filebeat.inputs:
  - type: container
    paths:
      - /var/log/containers/pesitwizard-server-*.log
    processors:
      - add_kubernetes_metadata: ~

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "pesitwizard-%{+yyyy.MM.dd}"

Useful Kibana Queries

# Errors in the last 24 hours
level:ERROR AND kubernetes.labels.app:pesitwizard-server

# Transfers for a partner
message:"TRANSFER_COMPLETE" AND partner:CLIENT_XYZ

# Failed connections
message:"CONNECT" AND status:FAILED

Health Checks

Endpoints

Endpoint	Description
`/actuator/health`	Overall health
`/actuator/health/readiness`	Ready to receive traffic
`/actuator/health/liveness`	Application alive

Kubernetes Probes

yaml

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 30

Alerting

Email

Configure email alerts in the application:

yaml

pesitwizard:
  alerting:
    email:
      enabled: true
      smtp-host: smtp.example.com
      from: pesitwizard@example.com
      to: ops@example.com
    triggers:
      - type: TRANSFER_FAILED
      - type: CONNECTION_FAILED
      - type: CLUSTER_DEGRADED

Webhook

yaml

pesitwizard:
  alerting:
    webhook:
      enabled: true
      url: https://hooks.slack.com/services/xxx
      events:
        - TRANSFER_FAILED
        - CLUSTER_LEADER_CHANGED

Monitoring ​

Available Metrics ​

Key Metrics ​

Prometheus Integration ​

Prometheus Configuration ​

Useful Queries ​

Grafana Dashboards ​

Main Dashboard ​

Recommended Alerts ​

Logs ​

Log Format ​

Centralization with ELK ​

Useful Kibana Queries ​

Health Checks ​

Endpoints ​

Kubernetes Probes ​

Alerting ​

Email ​

Webhook ​

Monitoring

Available Metrics

Key Metrics

Prometheus Integration

Prometheus Configuration

Useful Queries

Grafana Dashboards

Main Dashboard

Recommended Alerts

Logs

Log Format

Centralization with ELK

Useful Kibana Queries

Health Checks

Endpoints

Kubernetes Probes

Alerting

Email

Webhook