How We Score MCP Servers: A Deep Dive into the SpiderScore Model

SpiderShield Team·March 7, 2026·6 min read

MethodologySecurityScoring

The 3-layer SpiderScore model

SpiderRating evaluates MCP servers through three independent lenses, each capturing a different dimension of quality and trustworthiness.

Layer 1: Description Quality (35%)

Tool descriptions are the interface between an MCP server and the AI agent using it. A vague or misleading description can cause an agent to misuse a tool — or worse, skip a safer alternative in favor of a dangerous one.

We evaluate descriptions across 5 criteria:

Intent clarity — Does the description clearly state what the tool does?
Permission scope — Does it disclose what resources the tool accesses?
Side effects — Does it mention modifications, deletions, or external calls?
Capability disclosure — Does it explain the full range of the tool's capabilities?
Operational boundaries — Does it define when NOT to use the tool?

A server with no quality signals in descriptions scores 0-2/10. A server with all quality signals scores 8-10/10. This calibration ensures the score is meaningful.

Layer 2: Security Analysis (35%)

Our static analyzer checks against 46 standardized rules (TS-E001 through TS-P002):

15 error codes (TS-E) — Malicious patterns: reverse shells, C2 beacons, credential theft, prompt injection, code execution, data exfiltration
11 warning codes (TS-W) — Suspicious patterns: typosquatting, toxic data flows, excessive permissions
18 config codes (TS-C) — Agent configuration: missing auth, weak sandboxing, SSRF risks
2 pin codes (TS-P) — Rug pull detection via SHA-256 content hashing

The security score starts at 10.0 and decreases with each finding. Critical issues also trigger hard constraints that cap the maximum grade.

Layer 3: Metadata Health (30%)

Metadata signals help assess the trustworthiness of the project itself:

Provenance — Is the source code available? Is the license OSI-approved?
Maintenance — When was the last commit? Are issues being addressed?
Popularity — Stars, forks, and download counts as social proof

Calibration philosophy

We follow two principles:

Conservative scoring — When uncertain, we score lower. A false sense of security is worse than an unnecessarily harsh rating.
Minimize false positives — Every security finding should be actionable. A false positive erodes trust in the entire system.

The overall formula: security × 0.4 + descriptions × 0.35 + metadata × 0.25

Hard constraints

Some findings are so severe that they override the normal scoring:

Constraint	Max Grade
Critical security issue	F
No source repository	D
Reverse shell detected	F
Credential exfiltration	F

These constraints exist because no amount of good descriptions can compensate for a backdoor in the code.

← Back to Blog