Methodology

SpiderRating provides deterministic, transparent security ratings for AI tools — including MCP servers, Claude skills, and more. Same input always produces the same output — no AI black boxes, no subjective judgment.

What We Rate

MCP

MCP Servers

Model Context Protocol servers that expose tools, prompts, and resources to AI agents. Full 3-layer scoring: description quality, security analysis, and metadata health.

Skill

Claude Skills

Custom instructions and capabilities for Claude Code. Scored on description quality, malicious pattern detection (20+ rules), and community signals.

Tool

AI Tools

OpenAI plugins, function-calling tools, and other AI integrations. Coming soon — scoring model in development.

Connectors & Plugins

LangChain tools, browser extensions, and other connector types. The framework is designed to be extensible as new AI tool categories emerge.

3-Layer Scoring Model

Every item is scored across three independent layers, with weights adapted by type:

Description Quality

35%

How well tool descriptions communicate intent, scope, and side effects to LLMs.

Security Analysis

35%

Static analysis for 46+ security patterns: reverse shells, credential theft, prompt injection, toxic flows, and more.

Metadata Health

30%

Provenance signals: source availability, maintenance activity, community adoption, and download metrics.

SpiderScore = Description × 0.35 + Security × 0.35 + Metadata × 0.30

Scoring by Type

Dimension	MCP Servers	Claude Skills
Description	5-dimension tool description scoring (intent, scope, side effects, capabilities, boundaries)	Instruction clarity, scope definition, behavioral boundaries
Security	46 static patterns (TS-E001~P002): reverse shells, C2, credential theft, code exec, exfiltration	20+ malicious patterns, typosquat detection (Levenshtein), toxic flow analysis, rug pull detection
Metadata	GitHub signals: stars, forks, license, commit recency, contributor count	Download count, author reputation, source availability, version history

5 Description Dimensions

Dimension	Weight	What It Measures
Intent Clarity	20%	Does the description start with an action verb and clearly distinguish this tool from others?
Permission Scope	25%	Does it define when to use the tool and what boundaries apply?
Side Effects	20%	Does it document error conditions and potential side effects?
Capability Disclosure	20%	Are parameters documented with examples and type information?
Operational Boundaries	15%	Overall description completeness — does it provide enough context for safe tool selection?

Security Scoring

Security = 10 - (3 x critical + 2 x high + 1 x medium + 0.25 x low) + architecture_bonus

Architecture bonus: +0 to +2 based on code quality signals (tests, error handling)
Score clamped to [0, 10]
Score of 10.0 means "zero issues found", not "proven secure"

Skill-Specific Detection

Malicious Pattern Detection

20+ rule patterns for suspicious instructions
Typosquat detection (Levenshtein distance ≤ 2)
Prompt injection / exfiltration patterns

Advanced Analysis

Toxic flow: data source + public sink combinations
Rug pull detection via SHA-256 content pinning
Allowlist mode for approved-only skills

Metadata Signals

Provenance (40%)

Has source code
Has license
Identifiable owner
Repo age > 180 days
Not archived

Maintenance (35%)

Recent commits
Has releases
Multiple contributors
Has description

Popularity (25%)

Stars / downloads (log scale)
Forks (log scale)
Watchers / installs (log scale)

Hard Constraints

Regardless of the calculated score, these rules enforce safety floors:

Condition	Effect	Applies To
Any critical security issue	Grade forced to F	All types
Known malicious skill	Grade forced to F	Skills
Security score < 5.0	Grade capped at C	All types
No source repository	Grade capped at D	All types

Grade Thresholds

9.0 - 10

7.0 - 8.9

5.0 - 6.9

3.0 - 4.9

0 - 2.9

Reproducibility Guarantee

SpiderRating is fully deterministic. Given the same source code and metadata, it will always produce the same score. There is no randomness, no LLM-based scoring, and no network-dependent calculations. You can reproduce any rating by running teeshield scan <repo> locally.