How We Score MCP Servers: A Deep Dive into the SpiderScore Model
The 3-layer SpiderScore model
SpiderRating evaluates MCP servers through three independent lenses, each capturing a different dimension of quality and trustworthiness.
Layer 1: Description Quality (35%)
Tool descriptions are the interface between an MCP server and the AI agent using it. A vague or misleading description can cause an agent to misuse a tool — or worse, skip a safer alternative in favor of a dangerous one.
We evaluate descriptions across 5 criteria:
- Intent clarity — Does the description clearly state what the tool does?
- Permission scope — Does it disclose what resources the tool accesses?
- Side effects — Does it mention modifications, deletions, or external calls?
- Capability disclosure — Does it explain the full range of the tool's capabilities?
- Operational boundaries — Does it define when NOT to use the tool?
A server with no quality signals in descriptions scores 0-2/10. A server with all quality signals scores 8-10/10. This calibration ensures the score is meaningful.
Layer 2: Security Analysis (35%)
Our static analyzer checks against 46 standardized rules (TS-E001 through TS-P002):
- 15 error codes (TS-E) — Malicious patterns: reverse shells, C2 beacons, credential theft, prompt injection, code execution, data exfiltration
- 11 warning codes (TS-W) — Suspicious patterns: typosquatting, toxic data flows, excessive permissions
- 18 config codes (TS-C) — Agent configuration: missing auth, weak sandboxing, SSRF risks
- 2 pin codes (TS-P) — Rug pull detection via SHA-256 content hashing
The security score starts at 10.0 and decreases with each finding. Critical issues also trigger hard constraints that cap the maximum grade.
Layer 3: Metadata Health (30%)
Metadata signals help assess the trustworthiness of the project itself:
- Provenance — Is the source code available? Is the license OSI-approved?
- Maintenance — When was the last commit? Are issues being addressed?
- Popularity — Stars, forks, and download counts as social proof
Calibration philosophy
We follow two principles:
- Conservative scoring — When uncertain, we score lower. A false sense of security is worse than an unnecessarily harsh rating.
- Minimize false positives — Every security finding should be actionable. A false positive erodes trust in the entire system.
The overall formula: security × 0.4 + descriptions × 0.35 + metadata × 0.25
Hard constraints
Some findings are so severe that they override the normal scoring:
| Constraint | Max Grade |
|---|---|
| Critical security issue | F |
| No source repository | D |
| Reverse shell detected | F |
| Credential exfiltration | F |
These constraints exist because no amount of good descriptions can compensate for a backdoor in the code.