跳到主要内容

2 篇博文 含有标签「SRE」

查看所有标签

One Stack for Website Analytics, Uptime, and Server Health: All‑in‑One Observability with Tianji

· 阅读需 2 分钟

analytics dashboard

When you put product analytics, uptime monitoring, and server health on the same observability surface, you find issues faster, iterate more confidently, and make the right calls within privacy and compliance boundaries. Tianji combines Website Analytics + Uptime Monitor + Server Status into one platform, giving teams end‑to‑end insights with a lightweight setup.

Why an all‑in‑one observability layer

  • Fewer context switches: From traffic to availability without hopping across tools.
  • Unified semantics: One set of events and dimensions; metrics connect across layers.
  • Privacy‑first: Cookie‑less by default, with IP truncation, minimization, and aggregation.
  • Self‑hosting optional: Clear boundaries to meet compliance and data residency needs.

privacy lock

The signals you actually need

  • Product analytics: Pageviews, sessions, referrers/UTM, conversions and drop‑offs on critical paths.
  • Uptime monitoring: Reachability, latency, error rates; sliced by region and ISP.
  • Server health: CPU/memory/disk/network essentials with threshold‑based alerts.
  • Notification & collaboration: Route via Webhook/Slack/Telegram, with noise control.

How Tianji delivers it

Tianji ships three capabilities in one platform:

  1. Website analytics: Lightweight script, cookie‑less collection; default aggregation and retention policies.
  2. Uptime monitoring: Passive/active compatible, with built‑in status pages and regional views.
  3. Server status: Unified reporting and visualization; open APIs for audits and export.

Privacy by design is on by default: IP truncation, geo mapping, and minimal storage, with options for self‑hosting and region‑pinned deployments.

3‑minute quickstart

wget https://raw.githubusercontent.com/msgbyte/tianji/master/docker-compose.yml
docker compose up -d

The default account is admin/admin. Change the password promptly and set up your first site and monitors.

Common rollout patterns

server lights

  • Small teams/indies: Single‑host self‑deployment with out‑of‑the‑box end‑to‑end signals.
  • Mid‑size SaaS: Consolidate funnels, SLAs, and server alerts into a single alerting layer to cut false positives.
  • Open‑source self‑host: Public status pages outside, fine‑grained metrics and audit‑friendly exports inside.

Best‑practice checklist

  • Define 3–5 critical funnels and track only decision‑relevant events.
  • Enable IP truncation and set retention (e.g., 30 days for raw events, 180 days for aggregates).
  • Use referrer/UTM cohorts for growth analysis; avoid individual identification.
  • Separate public status pages from internal alerts to reduce exposure.
  • Review monthly: decision value vs. data cost — trim aggressively.

Closing

Seeing product and reliability on the same canvas is a more efficient way to collaborate. With Tianji, teams get fewer‑noise, action‑ready signals — all with privacy and compliance first.

Runbook Automation: Connect Detection → Diagnosis → Repair into a Closed Loop (Powered by a Unified Incident Timeline)

· 阅读需 4 分钟

monitoring dashboards

“The alert fired—now what?” For many teams, the pain is not “Do we have monitoring?” but “How many people, tools, and context switches does it take to get from detection to repair?” This article uses a unified incident timeline as the backbone to connect detection → diagnosis → repair into an automated closed loop, so on-call SREs can focus on judgment rather than tab juggling.

Why build a closed loop

Without a unified context, three common issues plague response workflows:

  • Fragmented signals: metrics, logs, traces, and synthetic flows are split across tools.
  • Slow handoffs: alerts lack diagnostic context, causing repeated pings and evidence gathering.
  • Inconsistent actions: fixes are ad hoc; best practices don’t accumulate as reusable runbooks.

Closed-loop automation makes the “signals → decisions → actions” chain stable, auditable, and replayable by using a unified timeline as the spine.

How a unified incident timeline carries the response

control room comms

Key properties of the unified timeline:

  1. Correlation rules fold multi-source signals of the same root cause into one incident, avoiding alert storms.
  2. Each incident is auto-enriched with context (recent deploys, SLO burn, dependency health, hot metrics).
  3. Response actions (diagnostic scripts, rollback, scale-out, traffic shifting) are recorded on the same timeline for review and continuous improvement.

Five levels of runbook automation

server room cables

An evolution path from prompts to autonomy:

  1. Human-in-the-loop visualization: link charts and log slices on the timeline to cut context switching.
  2. Guided semi-automation: run diagnostic scripts on incident start (dependencies, thread dumps, slow queries).
  3. Conditional actions: execute low-risk fixes (rollback/scale/shift) behind guard conditions.
  4. Policy-driven orchestration: adapt by SLO burn, release windows, and dependency health.
  5. Guardrailed autonomy: self-heal within boundaries; escalate to humans beyond limits.

Automation is not “more scripts,” it’s “better triggers”

self-healing automation concept

High-quality triggers stem from high-quality signal design:

  • Anchor on SLOs: prioritize strong triggers on budget burn and user-impacting paths.
  • Adaptive sampling: full on failure paths, lower in steady state, temporary boosts after deploys.
  • Event folding: compress cascades (DB down → API 5xx → frontend errors) into a single incident so scripts don’t compete.

A practical Detection → Repair pattern

night collaboration

  1. Detect: synthetic flows or external probes fail on user-visible paths.
  2. Correlate: fold related signals on one timeline; auto-escalate when SLO thresholds are at risk.
  3. Diagnose: run scripts in parallel for dependency health, recent deploys, slow queries, hot keys, and thread stacks.
  4. Repair: if guard conditions pass, execute rollback/scale/shift/restart on scoped units; otherwise require human approval.
  5. Review: actions, evidence, and outcomes live on the same timeline to improve the next response.

Implement quickly with Tianji

Implementation checklist (check as you go)

  • Map critical user journeys and SLOs; define guard conditions for safe automation
  • Ingest checks, metrics, deploys, dependencies, and product events into a single timeline
  • Build a library of diagnostic scripts and low-risk repair actions
  • Configure incident folding and escalation to avoid alert storms
  • Switch sampling and thresholds across release windows and traffic peaks/valleys
  • After each incident, push more steps into automation with guardrails

Closing thoughts

Runbook automation is not “shipping a giant orchestrator in one go.” It starts with a unified timeline and turns common response paths into workflows that are visible, executable, verifiable, and evolvable. With Tianji’s open building blocks, you can safely delegate repetitive work to automation and keep human focus on real decisions.