How We Score MCP Servers: A Deep Dive into the SpiderScore Model

SpiderShield Team··6 min read
MethodologySecurityScoring

The 3-layer SpiderScore model

SpiderRating evaluates MCP servers through three independent lenses, each capturing a different dimension of quality and trustworthiness.

Layer 1: Description Quality (35%)

Tool descriptions are the interface between an MCP server and the AI agent using it. A vague or misleading description can cause an agent to misuse a tool — or worse, skip a safer alternative in favor of a dangerous one.

We evaluate descriptions across 5 criteria:

  • Intent clarity — Does the description clearly state what the tool does?
  • Permission scope — Does it disclose what resources the tool accesses?
  • Side effects — Does it mention modifications, deletions, or external calls?
  • Capability disclosure — Does it explain the full range of the tool's capabilities?
  • Operational boundaries — Does it define when NOT to use the tool?

A server with no quality signals in descriptions scores 0-2/10. A server with all quality signals scores 8-10/10. This calibration ensures the score is meaningful.

Layer 2: Security Analysis (35%)

Our static analyzer checks against 46 standardized rules (TS-E001 through TS-P002):

  • 15 error codes (TS-E) — Malicious patterns: reverse shells, C2 beacons, credential theft, prompt injection, code execution, data exfiltration
  • 11 warning codes (TS-W) — Suspicious patterns: typosquatting, toxic data flows, excessive permissions
  • 18 config codes (TS-C) — Agent configuration: missing auth, weak sandboxing, SSRF risks
  • 2 pin codes (TS-P) — Rug pull detection via SHA-256 content hashing

The security score starts at 10.0 and decreases with each finding. Critical issues also trigger hard constraints that cap the maximum grade.

Layer 3: Metadata Health (30%)

Metadata signals help assess the trustworthiness of the project itself:

  • Provenance — Is the source code available? Is the license OSI-approved?
  • Maintenance — When was the last commit? Are issues being addressed?
  • Popularity — Stars, forks, and download counts as social proof

Calibration philosophy

We follow two principles:

  1. Conservative scoring — When uncertain, we score lower. A false sense of security is worse than an unnecessarily harsh rating.
  2. Minimize false positives — Every security finding should be actionable. A false positive erodes trust in the entire system.

The overall formula: security × 0.4 + descriptions × 0.35 + metadata × 0.25

Hard constraints

Some findings are so severe that they override the normal scoring:

ConstraintMax Grade
Critical security issueF
No source repositoryD
Reverse shell detectedF
Credential exfiltrationF

These constraints exist because no amount of good descriptions can compensate for a backdoor in the code.