Methodology

SpiderRating provides deterministic, transparent security ratings for AI tools — including MCP servers, Claude skills, and more. Same input always produces the same output — no AI black boxes, no subjective judgment.

What We Rate

MCP

MCP Servers

Model Context Protocol servers that expose tools, prompts, and resources to AI agents. Full 3-layer scoring: description quality, security analysis, and metadata health.

Skill

Claude Skills

Custom instructions and capabilities for Claude Code. Scored on description quality, malicious pattern detection (20+ rules), and community signals.

Tool

AI Tools

OpenAI plugins, function-calling tools, and other AI integrations. Coming soon — scoring model in development.

More

Connectors & Plugins

LangChain tools, browser extensions, and other connector types. The framework is designed to be extensible as new AI tool categories emerge.

3-Layer Scoring Model

Every item is scored across three independent layers, with weights adapted by type:

Description Quality

35%

How well tool descriptions communicate intent, scope, and side effects to LLMs.

Security Analysis

35%

Static analysis for 46+ security patterns: reverse shells, credential theft, prompt injection, toxic flows, and more.

Metadata Health

30%

Provenance signals: source availability, maintenance activity, community adoption, and download metrics.

SpiderScore = Description × 0.35 + Security × 0.35 + Metadata × 0.30

Scoring by Type

DimensionMCP ServersClaude Skills
Description5-dimension tool description scoring (intent, scope, side effects, capabilities, boundaries)Instruction clarity, scope definition, behavioral boundaries
Security46 static patterns (TS-E001~P002): reverse shells, C2, credential theft, code exec, exfiltration20+ malicious patterns, typosquat detection (Levenshtein), toxic flow analysis, rug pull detection
MetadataGitHub signals: stars, forks, license, commit recency, contributor countDownload count, author reputation, source availability, version history

5 Description Dimensions

DimensionWeightWhat It Measures
Intent Clarity20%Does the description start with an action verb and clearly distinguish this tool from others?
Permission Scope25%Does it define when to use the tool and what boundaries apply?
Side Effects20%Does it document error conditions and potential side effects?
Capability Disclosure20%Are parameters documented with examples and type information?
Operational Boundaries15%Overall description completeness — does it provide enough context for safe tool selection?

Security Scoring

Powered by TeeShield static analysis engine with 46+ standardized issue codes (TS-E001 through TS-P002).

Security = 10 - (3 x critical + 2 x high + 1 x medium + 0.25 x low) + architecture_bonus
  • Architecture bonus: +0 to +2 based on code quality signals (tests, error handling)
  • Score clamped to [0, 10]
  • Score of 10.0 means "zero issues found", not "proven secure"

Skill-Specific Detection

Malicious Pattern Detection

  • 20+ rule patterns for suspicious instructions
  • Typosquat detection (Levenshtein distance ≤ 2)
  • Prompt injection / exfiltration patterns

Advanced Analysis

  • Toxic flow: data source + public sink combinations
  • Rug pull detection via SHA-256 content pinning
  • Allowlist mode for approved-only skills

Metadata Signals

Provenance (40%)

  • Has source code
  • Has license
  • Identifiable owner
  • Repo age > 180 days
  • Not archived

Maintenance (35%)

  • Recent commits
  • Has releases
  • Multiple contributors
  • Has description

Popularity (25%)

  • Stars / downloads (log scale)
  • Forks (log scale)
  • Watchers / installs (log scale)

Hard Constraints

Regardless of the calculated score, these rules enforce safety floors:

ConditionEffectApplies To
Any critical security issueGrade forced to FAll types
Known malicious skillGrade forced to FSkills
Security score < 5.0Grade capped at CAll types
No source repositoryGrade capped at DAll types

Grade Thresholds

A
9.0 - 10
B
7.0 - 8.9
C
5.0 - 6.9
D
3.0 - 4.9
F
0 - 2.9

Reproducibility Guarantee

SpiderRating is fully deterministic. Given the same source code and metadata, it will always produce the same score. There is no randomness, no LLM-based scoring, and no network-dependent calculations. You can reproduce any rating by running teeshield scan <repo> locally.