How to Set Up Uptime Monitoring for Your Website
Setting up uptime monitoring properly is the difference between getting woken up at 3 AM for a real outage and getting woken up at 3 AM because a CDN edge node hiccupped for 200 milliseconds. Good monitoring catches genuine failures fast, stays quiet during transient blips, and gives your team the context they need to respond. Bad monitoring drowns your team in noise until they start ignoring every alert -- which is worse than having no monitoring at all.
This guide walks through what to monitor, how to configure checks effectively, setting up alerting that your team will actually trust, and the features that separate a basic setup from a production-grade monitoring strategy.
TL;DR
- Monitor every critical path -- not just the homepage. Use HTTP checks for web endpoints, TCP for databases, and heartbeats for cron jobs.
- Set check intervals based on criticality: 30s for revenue paths, 60s for production endpoints, 5 min for staging.
- Configure response time thresholds based on your actual P95 baseline, not arbitrary round numbers.
- Use multi-channel alerting with escalation: Slack immediately, PagerDuty after 5 minutes unresolved, manager escalation at 15 minutes.
- Add flapping detection, status pages, maintenance windows, and automated SLA reports to complete a production-grade setup.
Step 1: Decide What to Monitor
The first mistake most teams make is monitoring only the homepage. Your homepage might be cached and served from a CDN, staying online even when your application servers, database, and payment processing are all on fire.
Start by mapping every critical path in your infrastructure.
Web Endpoints (HTTP Checks)
HTTP monitoring is the foundation. At minimum, create checks for:
- Primary application URL -- the page your customers actually use, not just the marketing site.
- API health endpoint -- a dedicated
/healthor/api/statusroute that verifies the application can talk to its database and cache. A health endpoint that just returns200 OKwithout checking dependencies is nearly useless. - Authentication flows -- if your login page returns 200 but the auth service behind it is down, your users are locked out. Monitor the login endpoint specifically.
- Payment/checkout endpoints -- revenue-critical paths deserve their own monitors with shorter intervals.
- Webhook receivers -- if your application depends on incoming webhooks (Stripe, GitHub, etc.), verify those endpoints are reachable.
Infrastructure Services (TCP Checks)
Not everything speaks HTTP. Use TCP checks for services that listen on raw ports:
- Databases -- MySQL (3306), PostgreSQL (5432), MongoDB (27017)
- Cache layers -- Redis (6379), Memcached (11211)
- Mail servers -- SMTP (25, 587, 465)
- Message queues -- RabbitMQ (5672), custom services
- SSH access -- port 22, verify your emergency access path works
TCP checks confirm the process is running and accepting connections. They cannot verify the application logic is correct, but they catch the most common failure mode: the process crashed and nobody restarted it.
Background Jobs (Heartbeat Checks)
Scheduled tasks are the silent killers of reliability. A cron job that silently fails at 2 AM will not page anyone -- the damage accumulates until someone notices the data is stale or the backups are missing.
Heartbeat monitoring inverts the usual check direction: your job pings the monitoring service when it completes, and the monitor alerts you if it does not hear back within the expected window.
Set up heartbeat monitors for every recurring job:
- Database backups (nightly, weekly)
- Report generation (daily, monthly)
- Data synchronization jobs
- Certificate renewal processes
- Queue workers (send a heartbeat every N minutes)
- Cache warm-up scripts
Step 2: Choose Check Intervals
Shorter intervals detect failures faster, but there are diminishing returns. Checking every 10 seconds generates six times the traffic of checking every 60 seconds, and the practical difference in mean time to detection (MTTD) is small once you factor in confirmation periods.
A reasonable starting point:
- 30 seconds for revenue-critical endpoints (payment processing, primary API)
- 60 seconds for all other production HTTP endpoints
- 120 seconds for TCP checks on infrastructure services
- 5 minutes for staging/development environments and internal tools
For heartbeat checks, the interval depends on the job's schedule. A job that runs every hour should have a grace period of 15-30 minutes beyond its expected completion time. A nightly backup that takes 45 minutes should have a window of about 2 hours.
Best Practice
Match check intervals to business criticality, not technical convenience. A payment endpoint that processes $10,000/hour deserves 30-second checks, while a staging environment can safely use 5-minute intervals.
Step 3: Configure Response Time Thresholds
A service can be technically "up" but practically unusable. Response time thresholds add a middle state -- degraded -- between fully healthy and completely down. This gives you early warning before a slow endpoint becomes an unreachable one.
Set thresholds based on your actual baseline performance, not arbitrary round numbers. If your API normally responds in 150ms, setting a degraded threshold at 10 seconds will never trigger until things are catastrophically broken. Instead:
- Measure your P95 response time over a normal week
- Set the degraded threshold at 2-3x the P95
- Set the down threshold at 5-10x the P95, or use the default timeout (typically 10-30 seconds)
For example, an endpoint with a 200ms P95 might use 500ms for degraded and 5 seconds for the timeout. When the degraded alert fires, you investigate. If nothing is done and performance continues to degrade, the down alert fires next.
Common Mistake
Setting thresholds at arbitrary round numbers (e.g., 10 seconds) instead of basing them on your actual P95 response time. A threshold that never fires until the service is already catastrophically broken provides no early warning value.
Step 4: Set Up Alerting
The goal of alerting is to reach the right person fast enough to matter. This requires two things: multiple notification channels and a clear escalation path.
A practical alerting setup:
- Immediate -- Slack/Teams channel notification and email to the engineering team. Everyone sees it; nobody is woken up yet.
- After 5 minutes unresolved -- PagerDuty or OpsGenie pages the on-call engineer. If the alert auto-resolves before this window, no page is sent.
- After 15 minutes unresolved -- Escalation to the engineering manager or secondary on-call. If the primary responder has not acknowledged, someone else takes over.
- Recovery -- All channels receive a recovery notification so the team knows the issue is resolved without checking manually.
Step 5: Handle Flapping
Flapping occurs when a service rapidly oscillates between up and down states. This can happen with overloaded servers, network instability, or DNS resolution hiccups.
Without flapping detection, each state change triggers its own alert -- so a service that bounces up and down 20 times in 10 minutes generates 20 separate notifications, flooding every channel and guaranteeing that your team ignores all of them.
Flapping detection solves this by tracking the number of state transitions within a rolling window. If a monitor transitions more than a threshold number of times (commonly 3) within 5 minutes, it enters a "flapping" state. While flapping, individual alerts are suppressed and replaced with a single consolidated notification.
This is one of those features that seems unnecessary until you need it -- and when you need it, it is the only thing standing between your team and a 2 AM notification storm.
Step 6: Create Status Pages
A public status page is the most underrated component of an uptime monitoring setup. It reduces support ticket volume during outages by giving users a self-service answer to the question every business dreads: "Is it down?"
An effective status page should include:
- Component-level status -- break your service into components (API, Dashboard, Authentication, Payment Processing) rather than showing a single "all systems operational" indicator.
- Uptime percentages -- per-component uptime over the last 30, 60, and 90 days gives users a historical view of reliability.
- Response time data -- sparkline charts showing recent performance trends.
- Maintenance windows -- scheduled downtime should be announced in advance and reflected on the status page so users are not surprised.
- Incident history -- a timeline of past incidents with duration and resolution details builds long-term trust.
The status page should be hosted on separate infrastructure from your main application. If your application server hosts the status page, they will go down together -- which defeats the entire purpose.
Best Practice
Break your status page into individual components (API, Dashboard, Authentication, Payments) rather than showing a single "all systems operational" indicator. Users care about the specific service they are trying to use.
Step 7: Configure Maintenance Windows
Planned maintenance should not trigger alerts. Most monitoring platforms support maintenance windows: scheduled periods during which monitoring continues but alerting is suppressed. The status page shows a maintenance badge, and your team is not paged for expected downtime.
Best practices for maintenance windows:
- Announce maintenance at least 24 hours in advance via the status page
- Set the maintenance window slightly longer than you expect the work to take
- Continue monitoring during maintenance so you have baseline data when the window closes
- Automatically end the maintenance window at the scheduled time (do not leave it open indefinitely)
Step 8: Set Up Reporting
Monitoring data is valuable beyond incident response. Weekly or monthly uptime reports give stakeholders visibility into reliability trends and SLA compliance.
A useful uptime report includes:
- Uptime percentage per monitor, compared against SLA targets (99.9%, 99.95%, 99.99%)
- Incident count and total downtime duration
- Longest outage -- identifies the single worst incident in the period
- Average and P95 response times -- reveals performance trends before they become outages
- Degraded event count -- how often services entered the degraded state
Automate these reports. If someone has to remember to pull the data and compile it, it will not happen consistently. Weekly reports on Monday morning and monthly reports on the 1st are a reasonable cadence.
How Metric Tower Handles Uptime Monitoring
Metric Tower's uptime monitoring supports all three check types -- HTTP, TCP, and heartbeat -- in a single interface. Each check type integrates with the same alerting, status page, and reporting systems, so you do not need separate tools for web monitoring and infrastructure monitoring.
Features worth highlighting:
- HTTP checks with custom expected status codes, response body validation, and TLS verification
- TCP checks for database ports, caches, and custom services with response time measurement
- Heartbeat endpoints that accept enriched payloads: exit codes, job duration, and custom metrics from your cron jobs
- Flapping detection that suppresses alert storms from unstable endpoints
- Public status pages with per-component uptime, sparkline graphs, and maintenance windows
- Automated SLA reports (weekly or monthly) delivered as PDFs to your inbox
- Response time thresholds with a degraded state that catches performance regressions before they become outages
- Content and IP change detection for catching defacements or DNS hijacking
Because Metric Tower is also a security scanning platform, your uptime monitors and vulnerability scans share the same dashboard and alerting integrations. You get operational and security visibility in one place rather than stitching together data from four different tools.
Key Takeaways
- 1 Map every critical path in your infrastructure before configuring monitors -- homepage-only monitoring misses the majority of real failures.
- 2 Base response time thresholds on your actual P95 measurements, and use a degraded state to catch performance regressions before they become outages.
- 3 Invest in flapping detection, status pages, and automated reporting -- these features separate a basic monitoring setup from a production-grade strategy your team can trust.
For background on monitoring concepts, see our explainer on what uptime monitoring is and why it matters. If you want to compare platforms before committing, our top uptime monitoring tools comparison covers the field.