At Codecafé, we've been building a product which will be unveiled soon. For this product, I work as the lead solo backend engineer where I happen to have spent the last few months doing some interesting engineering work.
Unfortunately, this blogpost isn't going to be detailed. It's basically a summary because I just wanted to write it. Hopefully someday when I have buttressed my knowledge, I can write more engineering focused articles.
The development for our product is distributed among the founders and I happen to have so much free time, so my churn rate has been really high. Some weeks ago, I became curious and decided to build a monitoring dashboard in an attempt to see what's going on in the backend memory wise.
The original idea was a dashboard where I can see the average time for a request. Locally, the requests are fast but what happens when we launch and have several users? We'll scale, I know.
I had help with Claude (my first time using Claude and I think it's way better than GPT?) for the dashboard because I can't save my life writing anything frontend.
Server watch
The dashboard monitors the application's metrics such as request counts (and to what endpoints) and the memory usage.
These classes are parent classes or should I call them the main utility classes. They're both used in another class
EnhancedHealthCheck
, where the requests are tracked and the memory is properly analysed.
The memory consumption monitored is basically due to database operations.
What tool?
Well, the key tool used is psutil. It's a library used to monitor systems and
process management. I also made use of motor
for MongoDB database connection and FastAPI for defining a route to
preview the dashboard.
The first time I ran the code, I was really worried when I saw a 82% usage. Apparently, this was unrelated to the app and was a culmination of some processes altogether on my laptop. Huge relief.
That's that, I'll get to it.
Application metrics
For the application metrics, I wanted to see what endpoints was visited the most. During development, I have a mental memory of the routes I visit a lot. I just wanted to have fun and see this in realtime.
So, the class for mananging the metrics contained three methods:
- The
add_error
which sustained a list of errors which had occured when using the application. 400s, 500s, etc. - The
get_average_response_time
which is the key method for me. This returned the average response time by diving the number of response times against the sum of the response times.... I think this is called mean. Oh, average. - The
get_error_rate
method which told me "Hey, be ashamed of yourself for all these errors!"
All of these put into consideration and we have the class:
import psutil
class ApplicationMetrics:
def __init__(self):
self.start_time = time.time()
self.request_count = 0
self.error_count = 0
self.response_times = []
self.endpoints_usage = {}
self.status_codes = {}
self.last_errors = []
self.max_stored_errors = 10
def add_error(self, error: Dict[str, Any]):
self.last_errors.insert(0, error)
if len(self.last_errors) > self.max_stored_errors:
self.last_errors.pop()
def get_average_response_time(self) -> float:
if not self.response_times:
return 0
return sum(self.response_times) / len(self.response_times)
def get_error_rate(self) -> float:
if self.request_count == 0:
return 0
return (self.error_count / self.request_count) * 100
Memory analytics
I made use of psutil
here. The methods:
get_process_memory_details()
does as it's named. It returns a detailed memory usage for the current process. For example, process A is handled by an endpoint, it returns that.get_system_memory_details()
was the method that made me panic because it was reading all system-wide info initially.
class MemoryAnalytics:
@staticmethod
def get_process_memory_details() -> Dict:
process = psutil.Process()
memory_info = process.memory_info()
threads_memory = []
for thread in process.threads():
try:
thread_proc = psutil.Process(thread.id)
threads_memory.append({
'thread_id': thread.id,
'memory_info': thread_proc.memory_info()._asdict()
})
except (psutil.NoSuchProcess, psutil.AccessDenied):
continue
return {
'process': {
'pid': process.pid,
'memory_percent': process.memory_percent(),
'memory_info': memory_info._asdict(),
'num_threads': process.num_threads(),
'open_files': len(process.open_files()),
'connections': len(process.connections())
},
'threads': sorted(
threads_memory,
key=lambda x: x['memory_info']['rss'],
reverse=True
)[:5] # Top 5 memory consuming threads
}
@staticmethod
def get_system_memory_details() -> Dict:
vm = psutil.virtual_memory()
swap = psutil.swap_memory()
return {
'virtual_memory': {
'total': vm.total,
'available': vm.available,
'used': vm.used,
'free': vm.free,
'percent': vm.percent,
'active': getattr(vm, 'active', None),
'inactive': getattr(vm, 'inactive', None),
'buffers': getattr(vm, 'buffers', None),
'cached': getattr(vm, 'cached', None),
'shared': getattr(vm, 'shared', None)
},
'swap_memory': {
'total': swap.total,
'used': swap.used,
'free': swap.free,
'percent': swap.percent,
'sin': swap.sin,
'sout': swap.sout
}
}
Now, what?
Well. These are skeletal classes that needs to be plugged somewhere or made use of. The EnhancedHealthCheck
contains
methods that dealt with time formatting, listening to requests to track them down, take a memory source to monitor (the
__init__
method), etc.
class EnhancedHealthCheck:
def __init__(self, app: FastAPI, db_client: AsyncIOMotorClient):
self.app = app
self.db_client = db_client
self.start_time = time.time()
self.metrics = ApplicationMetrics()
self.memory_analytics = MemoryAnalytics()
def format_uptime(self, seconds: float) -> str:
days = int(seconds // (24 * 3600))
seconds = seconds % (24 * 3600)
hours = int(seconds // 3600)
seconds %= 3600
minutes = int(seconds // 60)
seconds = int(seconds % 60)
parts = []
if days > 0:
parts.append(f"{days}d")
if hours > 0:
parts.append(f"{hours}h")
if minutes > 0:
parts.append(f"{minutes}m")
parts.append(f"{seconds}s")
return " ".join(parts)
async def track_request_metrics(self, request: Request, call_next):
start_time = time.time()
try:
response = await call_next(request)
# Update metrics
self.metrics.request_count += 1
response_time = time.time() - start_time
self.metrics.response_times.append(response_time)
# Track endpoint usage
endpoint = f"{request.method} {request.url.path}"
self.metrics.endpoints_usage[endpoint] = (
self.metrics.endpoints_usage.get(endpoint, 0) + 1
)
# Track status codes
self.metrics.status_codes[response.status_code] = (
self.metrics.status_codes.get(response.status_code, 0) + 1
)
return response
except Exception as e:
self.metrics.error_count += 1
self.metrics.add_error({
"timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC"),
"error": str(e),
"endpoint": f"{request.method} {request.url.path}"
})
raise
async def get_detailed_health(self, request: Request) -> Dict[str, Any]:
uptime_seconds = time.time() - self.start_time
# Check database connection
try:
await self.db_client.server_info()
db_status = "healthy"
except Exception as e:
db_status = f"error: {str(e)}"
memory_analysis = {
'process': self.memory_analytics.get_process_memory_details(),
'system': self.memory_analytics.get_system_memory_details()
}
return {
"status": "healthy" if db_status == "healthy" else "unhealthy",
"uptime": {
"seconds": uptime_seconds,
"formatted": self.format_uptime(uptime_seconds)
},
"database": {
"status": db_status
},
"memory_analysis": memory_analysis,
"application": {
"total_requests": self.metrics.request_count,
"error_count": self.metrics.error_count,
"error_rate": self.metrics.get_error_rate(),
"average_response_time": round(self.metrics.get_average_response_time(), 3),
"status_codes": self.metrics.status_codes,
"endpoints": self.metrics.endpoints_usage,
"recent_errors": self.metrics.last_errors
}
}
The last method is the get_dashboard_html()
method from Claude. I can't explain it but it's a beautiful one. AI is a
good companion o. Can they work on one that'll generate mockups or smtn in Figma? I need that urgently!
The get_dashboard_html()
is defined as:
def get_dashboard_html(self) -> str:
return """
<!DOCTYPE html>
<html>
<head>
<title>Enhanced Server Status Dashboard</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<style>
body {
font-family: Arial, sans-serif;
margin: 20px;
background-color: #f5f5f5;
padding-bottom: 40px;
}
.grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(350px, 1fr));
gap: 20px;
}
.card {
border: 1px solid #ddd;
border-radius: 8px;
padding: 15px;
background: white;
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
}
.metric-value {
font-size: 24px;
font-weight: bold;
color: #2c3e50;
}
.metric-label {
color: #7f8c8d;
font-size: 14px;
}
.healthy { color: #27ae60; }
.warning { color: #f39c12; }
.error { color: #c0392b; }
.refresh-btn {
position: fixed;
top: 20px;
right: 20px;
padding: 10px 20px;
background-color: #4CAF50;
color: white;
border: none;
border-radius: 4px;
cursor: pointer;
z-index: 1000;
transition: background-color 0.3s ease;
}
.refresh-btn:hover {
background-color: #45a049;
}
.refresh-btn:active {
transform: scale(0.98);
}
.last-updated {
position: fixed;
top: 60px;
right: 20px;
color: #666;
font-size: 12px;
z-index: 1000;
}
.header {
margin-bottom: 20px;
padding-right: 200px;
}
.memory-bar {
width: 100%;
height: 24px;
background: #f0f0f0;
border-radius: 4px;
overflow: hidden;
margin: 10px 0;
display: flex;
}
.bar-segment {
height: 100%;
display: inline-block;
text-align: center;
color: white;
font-size: 12px;
line-height: 24px;
}
.bar-segment.used {
background-color: #ff6b6b;
}
.bar-segment.cached {
background-color: #4ecdc4;
}
.bar-segment.free {
background-color: #95a5a6;
}
.memory-details, .memory-stats {
font-size: 14px;
margin: 10px 0;
}
.thread-list {
max-height: 200px;
overflow-y: auto;
}
.thread-item {
padding: 5px;
border-bottom: 1px solid #eee;
}
.error-item {
margin-bottom: 10px;
padding: 8px;
border-left: 4px solid #c0392b;
background-color: #fef2f2;
}
.error-item strong {
color: #c0392b;
}
</style>
</head>
<body>
<div class="header">
<h1>Enhanced Server Status Dashboard</h1>
</div>
<button class="refresh-btn" id="refresh-dashboard">
Refresh Dashboard
</button>
<div class="last-updated" id="last-updated"></div>
<div class="grid">
<div class="card">
<h2>System Overview</h2>
<div id="system-overview"></div>
</div>
<div class="card">
<h2>Resource Usage</h2>
<canvas id="resourceChart"></canvas>
</div>
<div class="card">
<h2>Memory Analysis</h2>
<div id="memory-analysis"></div>
</div>
<div class="card">
<h2>Application Metrics</h2>
<div id="app-metrics"></div>
</div>
<div class="card">
<h2>Recent Errors</h2>
<div id="recent-errors"></div>
</div>
<div class="card">
<h2>Endpoint Usage</h2>
<canvas id="endpointChart"></canvas>
</div>
</div>
<script>
let dashboardActive = true;
let updateInterval;
let isUpdating = false; // Flag to prevent multiple simultaneous updates
function formatBytes(bytes, decimals = 2) {
if (bytes === 0) return '0 Bytes';
const k = 1024;
const dm = decimals < 0 ? 0 : decimals;
const sizes = ['Bytes', 'KB', 'MB', 'GB', 'TB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i];
}
async function updateDashboard() {
if (isUpdating) return; // Prevent multiple simultaneous updates
isUpdating = true;
try {
const response = await fetch('/api/v1/health');
const data = await response.json();
// Update system overview
document.getElementById('system-overview').innerHTML = `
<div class="metric">
<div class="metric-value ${data.status}">${data.status}</div>
<div class="metric-label">Status</div>
</div>
<div class="metric">
<div class="metric-value">${data.uptime.formatted}</div>
<div class="metric-label">Uptime</div>
</div>
`;
// Update application metrics
document.getElementById('app-metrics').innerHTML = `
<div class="metric">
<div class="metric-value">${data.application.total_requests}</div>
<div class="metric-label">Total Requests</div>
</div>
<div class="metric">
<div class="metric-value">${data.application.error_rate.toFixed(2)}%</div>
<div class="metric-label">Error Rate</div>
</div>
<div class="metric">
<div class="metric-value">${data.application.average_response_time}s</div>
<div class="metric-label">Avg Response Time</div>
</div>
`;
// Update memory analysis
const memoryAnalysis = document.getElementById('memory-analysis');
const memData = data.memory_analysis;
memoryAnalysis.innerHTML = `
<h3>Process Memory</h3>
<div class="memory-metric">
<div class="metric-value">${memData.process.process.memory_percent.toFixed(2)}%</div>
<div class="metric-label">Process Memory Usage</div>
</div>
<div class="memory-details">
<p>RSS: ${formatBytes(memData.process.process.memory_info.rss)}</p>
<p>VMS: ${formatBytes(memData.process.process.memory_info.vms)}</p>
<p>Threads: ${memData.process.process.num_threads}</p>
<p>Open Files: ${memData.process.process.open_files}</p>
</div>
<h3>System Memory</h3>
<div class="memory-system">
<div class="memory-bar">
<div class="bar-segment used"
style="width: ${memData.system.virtual_memory.percent}%">
Used (${memData.system.virtual_memory.percent}%)
</div>
<div class="bar-segment cached"
style="width: ${(memData.system.virtual_memory.cached || 0) / memData.system.virtual_memory.total * 100}%">
Cached
</div>
<div class="bar-segment free"
style="width: ${100 - memData.system.virtual_memory.percent}%">
Free
</div>
</div>
<div class="memory-stats">
<p>Total: ${formatBytes(memData.system.virtual_memory.total)}</p>
<p>Available: ${formatBytes(memData.system.virtual_memory.available)}</p>
<p>Used: ${formatBytes(memData.system.virtual_memory.used)}</p>
<p>Cached: ${formatBytes(memData.system.virtual_memory.cached || 0)}</p>
</div>
</div>
<h3>Top Memory Threads</h3>
<div class="thread-list">
${memData.process.threads.map(thread => `
<div class="thread-item">
Thread ${thread.thread_id}: ${formatBytes(thread.memory_info.rss)}
</div>
`).join('')}
</div>
`;
// Update recent errors
document.getElementById('recent-errors').innerHTML =
data.application.recent_errors
.map(error => `
<div class="error-item">
<strong>${error.timestamp}</strong>
<p>${error.error}</p>
<small>${error.endpoint}</small>
</div>
`)
.join('');
// Update charts
updateResourceChart(data);
updateEndpointChart(data);
// Update last updated timestamp
updateLastUpdated();
} catch (error) {
console.error('Failed to update dashboard:', error);
} finally {
isUpdating = false; // Reset the update flag
}
}
function updateResourceChart(data) {
const ctx = document.getElementById('resourceChart').getContext('2d');
const processData = data.memory_analysis.process.process;
new Chart(ctx, {
type: 'bar',
data: {
labels: ['Memory Usage', 'Threads', 'Open Files', 'Connections'],
datasets: [{
label: 'Mercury Process Metrics',
data: [
processData.memory_percent,
processData.num_threads,
processData.open_files,
processData.connections
],
backgroundColor: [
'rgba(54, 162, 235, 0.2)', // Memory
'rgba(255, 99, 132, 0.2)', // Threads
'rgba(75, 192, 192, 0.2)', // Files
'rgba(255, 206, 86, 0.2)' // Connections
],
borderColor: [
'rgba(54, 162, 235, 1)',
'rgba(255, 99, 132, 1)',
'rgba(75, 192, 192, 1)',
'rgba(255, 206, 86, 1)'
],
borderWidth: 1
}]
},
options: {
responsive: true,
scales: {
y: {
beginAtZero: true,
title: {
display: true,
text: 'Count/Percentage'
}
}
},
plugins: {
title: {
display: true,
text: 'Mercury Process Resources'
},
tooltip: {
callbacks: {
label: function(context) {
const label = context.dataset.label || '';
const value = context.parsed.y;
if (context.dataIndex === 0) {
return `Memory: ${value.toFixed(2)}%`;
}
return `${context.label}: ${value}`;
}
}
}
}
}
});
}
function updateEndpointChart(data) {
const ctx = document.getElementById('endpointChart').getContext('2d');
const endpoints = Object.keys(data.application.endpoints);
const counts = Object.values(data.application.endpoints);
new Chart(ctx, {
type: 'pie',
data: {
labels: endpoints,
datasets: [{
data: counts,
backgroundColor: endpoints.map(() =>
`hsl(${Math.random() * 360}, 70%, 50%)`
)
}]
}
});
}
function updateLastUpdated() {
const now = new Date().toLocaleString();
document.getElementById('last-updated').textContent = `Last updated: ${now}`;
}
function startDashboardUpdates() {
dashboardActive = true;
updateDashboard();
updateInterval = setInterval(updateDashboard, 30000); // 30 seconds
}
function stopDashboardUpdates() {
dashboardActive = false;
if (updateInterval) {
clearInterval(updateInterval);
}
}
// Handle page visibility
document.addEventListener('visibilitychange', function() {
if (document.hidden) {
stopDashboardUpdates();
} else {
startDashboardUpdates();
}
});
// Handle refresh button click
document.addEventListener('DOMContentLoaded', function() {
document.getElementById('refresh-dashboard').addEventListener('click', function() {
updateDashboard();
});
});
// Initial load
startDashboardUpdates();
</script>
</body>
</html>
"""
Wrap it up!
Since I needed this to work in my FastAPI application, I had it all wrapped up in a function so I can just make use of it in my main app:
def setup_enhanced_health_routes(app: FastAPI, db_client: AsyncIOMotorClient) -> None:
health_checker = EnhancedHealthCheck(app, db_client)
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
return await health_checker.track_request_metrics(request, call_next)
@app.get("/health")
async def health(request: Request):
return await health_checker.get_detailed_health(request)
@app.get("/dashboard", response_class=HTMLResponse)
async def dashboard():
return health_checker.get_dashboard_html()
Here's how the dashboard looks:
A database client is attached so I can monitor the I/O operations and know when to panic and when not to.
Thanks
Thanks to Claude, truly. I can't do it all and that's why we have them(LLMs).