Methodology
SpiderRating provides deterministic, transparent security ratings for AI tools — including MCP servers, Claude skills, and more. Same input always produces the same output — no AI black boxes, no subjective judgment.
What We Rate
MCP Servers
Model Context Protocol servers that expose tools, prompts, and resources to AI agents. Full 3-layer scoring: description quality, security analysis, and metadata health.
Claude Skills
Custom instructions and capabilities for Claude Code. Scored on description quality, malicious pattern detection (20+ rules), and community signals.
AI Tools
OpenAI plugins, function-calling tools, and other AI integrations. Coming soon — scoring model in development.
Connectors & Plugins
LangChain tools, browser extensions, and other connector types. The framework is designed to be extensible as new AI tool categories emerge.
3-Layer Scoring Model
Every item is scored across three independent layers, with weights adapted by type:
Description Quality
35%How well tool descriptions communicate intent, scope, and side effects to LLMs.
Security Analysis
35%Static analysis for 46+ security patterns: reverse shells, credential theft, prompt injection, toxic flows, and more.
Metadata Health
30%Provenance signals: source availability, maintenance activity, community adoption, and download metrics.
Scoring by Type
| Dimension | MCP Servers | Claude Skills |
|---|---|---|
| Description | 5-dimension tool description scoring (intent, scope, side effects, capabilities, boundaries) | Instruction clarity, scope definition, behavioral boundaries |
| Security | 46 static patterns (TS-E001~P002): reverse shells, C2, credential theft, code exec, exfiltration | 20+ malicious patterns, typosquat detection (Levenshtein), toxic flow analysis, rug pull detection |
| Metadata | GitHub signals: stars, forks, license, commit recency, contributor count | Download count, author reputation, source availability, version history |
5 Description Dimensions
| Dimension | Weight | What It Measures |
|---|---|---|
| Intent Clarity | 20% | Does the description start with an action verb and clearly distinguish this tool from others? |
| Permission Scope | 25% | Does it define when to use the tool and what boundaries apply? |
| Side Effects | 20% | Does it document error conditions and potential side effects? |
| Capability Disclosure | 20% | Are parameters documented with examples and type information? |
| Operational Boundaries | 15% | Overall description completeness — does it provide enough context for safe tool selection? |
Security Scoring
Powered by TeeShield static analysis engine with 46+ standardized issue codes (TS-E001 through TS-P002).
- Architecture bonus: +0 to +2 based on code quality signals (tests, error handling)
- Score clamped to [0, 10]
- Score of 10.0 means "zero issues found", not "proven secure"
Skill-Specific Detection
Malicious Pattern Detection
- 20+ rule patterns for suspicious instructions
- Typosquat detection (Levenshtein distance ≤ 2)
- Prompt injection / exfiltration patterns
Advanced Analysis
- Toxic flow: data source + public sink combinations
- Rug pull detection via SHA-256 content pinning
- Allowlist mode for approved-only skills
Metadata Signals
Provenance (40%)
- Has source code
- Has license
- Identifiable owner
- Repo age > 180 days
- Not archived
Maintenance (35%)
- Recent commits
- Has releases
- Multiple contributors
- Has description
Popularity (25%)
- Stars / downloads (log scale)
- Forks (log scale)
- Watchers / installs (log scale)
Hard Constraints
Regardless of the calculated score, these rules enforce safety floors:
| Condition | Effect | Applies To |
|---|---|---|
| Any critical security issue | Grade forced to F | All types |
| Known malicious skill | Grade forced to F | Skills |
| Security score < 5.0 | Grade capped at C | All types |
| No source repository | Grade capped at D | All types |
Grade Thresholds
Reproducibility Guarantee
SpiderRating is fully deterministic. Given the same source code and metadata, it will always produce the same score. There is no randomness, no LLM-based scoring, and no network-dependent calculations. You can reproduce any rating by running teeshield scan <repo> locally.