SkillBench Research Preview

Skills Ontology Bridge

SkillBench discovers skills from work products — repos, commits, PRs, dependencies, review comments. MIND-tech-ontology defines 3,333 skills from job market data. Here's what happens when you connect them: the taxonomy gets evidence, and the evidence reveals what no taxonomy captures. And this is just the public signal — with SkillBench telemetry (chatlogs, diffs, thought traces), these insights go far deeper.

3,333 MIND-tech
skills
9,896 Developers
profiled
189 Skills mapped
to taxonomy
65+ Behavioral skills
taxonomy misses

Look up any GitHub developer

Enter a GitHub username to generate their character sheet in real-time. Takes ~15 seconds.

Top Languages

Value Dimensions

Top Domains

Capabilities

How It Works — From Work Products to Taxonomy Alignment

📡
Harvest
Public work products: repos, commits, PRs, reviews, dependency graphs, READMEs, contribution calendars
Synthesize
LLM + deterministic scoring produces character sheets with evidence-backed skill assessments
🔗
Align
Map discovered skills to MIND-tech-ontology categories, domains, and implied-knowledge chains
💎
Enrich
Surface behavioral signals no taxonomy captures — and tee up what SkillBench telemetry (chatlogs, diffs, thought traces) will reveal next
📚

MIND-tech-ontology

Static Taxonomy
Programming Languages · 200+ skills
Python JavaScript TypeScript Rust Go C C++ Java Kotlin Ruby Swift PHP Scala Haskell Solidity
Frameworks · 500+ skills
React Vue Angular Django Spring Next.js Express Flutter Rails FastAPI PyTorch
Tools & Services · 800+ skills
Docker Kubernetes Git AWS GCP Terraform PostgreSQL Redis Neovim VS Code
Concepts & Patterns · 974 concepts
Microservices Event-Driven REST API GraphQL CI/CD Serverless MVC
19 Technical Domains
Backend Frontend Mobile DevOps ML/AI Data Science Cybersecurity Game Dev QA/Testing Embedded Blockchain IoT
What it knows:
Tool dependencies (Next.js → React → JavaScript), synonym resolution (PostgreSQL = Postgres), domain-to-skill mapping. 10,897 relationships total.
What it can't know:
Whether someone actually uses a skill. How well. How they collaborate. Whether they teach, lead, debug creatively, or verify their work. No behavioral evidence.
💡

SkillBench Discoveries

From Work Products
Languages — Evidence-Scored from repos, commit history + byte counts
Python ↔ MIND
heavy presence
TypeScript ↔ MIND
heavy presence
C ↔ MIND
heavy presence
GLSL SB Only
working
SourcePawn SB Only
foundational
Domains — Multi-Signal from repos + dependencies + activity
deep-learning ↔ ML/AI
0.88
developer-education SB Only
0.90
mathematical-visualization SB Only
0.85
open-source-leadership SB Only
0.95
Capabilities — Inferred from PR reviews, contribution patterns + project structure
code-review SB Only
0.60
mentoring SB Only
0.75
systems-architecture SB Only
0.90
cross-team-collaboration SB Only
0.45
ecosystem-building SB Only
0.80
Evidence from work products, not self-reports:
Every skill has a provenance chain: bytes written → repos → dependencies → PR reviews → contribution patterns. Presence levels are computed from public work products — not proficiency claims. With SkillBench telemetry (chatlogs, diffs, thought traces), we go much deeper.
🔗

The Bridge — Where They Meet

SkillBench skills derived from work products that map directly onto MIND-tech-ontology categories — now with evidence scores and behavioral context that the taxonomy alone cannot provide.

Python · heavy presence · 0.95
1.2M bytes across 12 repos, pre-AI foundation
Python (Programming Language)
MIND category: Programming Languages → impliesKnowingSkills: pip, virtualenv
React · moderate presence · 0.82
4 repos, 340K bytes, dependency-validated
React (Framework)
MIND chain: React → JavaScript → HTML/CSS
Docker · moderate presence · 0.70
Dockerfiles in 8 repos, compose patterns
Docker (Tool)
MIND domain: DevOps → solvesApplicationTasks: containerization
machine-learning · 0.88
PyTorch imports, model training repos, paper implementations
ML/AI (Domain)
MIND domain: ML/AI, Data Science
open-source-leadership · 0.95
287K followers, 220K-star project, 14yr tenure
No MIND equivalent
Taxonomy gap — behavioral skill with no static category
code-review · 0.78
18,976 review comments analyzed across cohort
No MIND equivalent
Practice-based skill — tone, depth, and turnaround patterns
🎯

Developer Spotlight — What a Character Sheet Reveals

Real Data
LT

Linus Torvalds

Kernel Immortan
Level 14 · Settler · 20,427 XP
C heavy
1.39B bytes across 7 repos, 14+ years. Creator of Linux kernel.
Assembly moderate
9.7M bytes in kernel. Systems-level architecture.
Operating Systems heavy
220K+ star repo. Domain score: 0.95
Hardware Design light
Recent: guitar pedals, OpenSCAD, analog circuits
0.95
Quality
0.90
Reliability
0.70
Creativity
0.30
Teaching
0.60
Collaboration
0.85
Consistency
Source: Public GitHub work products → 83K READMEs, 90K commits, 41K dependency files, 19K review comments across 9,896 developers
AK

Andrej Karpathy

Neural Architect
Level 13 · Settler · 15,200 XP
Python heavy
Deep learning repos, nanoGPT, educational implementations.
Deep Learning heavy
Former Tesla AI Director. Published research implementations.
Teaching heavy
Educational repos, video courses, clear documentation. Score: 0.85
C moderate
llm.c — LLM training in pure C. Systems-level ML.
0.90
Quality
0.75
Reliability
0.80
Creativity
0.85
Teaching
0.40
Collaboration
0.80
Consistency
Carrol's insight validated: Karpathy's "Python" is research Python — not business analyst Python. SkillBench knows this from work products alone — imagine what telemetry reveals.
💎

What No Taxonomy Captures — Even From Public Work Products

Andela's MIND-tech-ontology is excellent for tool-stack matching. But "Python as a software engineer vs. Python as a business analyst" — that requires evidence from actual work. These are the skill dimensions SkillBench discovers from public work products alone — dimensions no taxonomy contains. With telemetry (chatlogs, diffs, thought traces), we go deeper still.

Value Dimensions (6-axis)

Quality, reliability, creativity, teaching, collaboration, consistency — scored from contribution patterns, not self-assessment. Linus scores 0.30 on teaching; Karpathy scores 0.85. Same "Python expert" — completely different profiles.

torvalds: {quality: 0.95, teaching: 0.30}
karpathy: {quality: 0.90, teaching: 0.85}

AI-Era Signals

Pre-AI commit ratio, copilot-era activity shifts. Did this developer's patterns change post-2023? Are they adapting or coasting? No taxonomy tracks temporal skill evolution.

pre_ai_commit_ratio: 0.87
copilot_era_change: "stable"

Collaboration Patterns

Review ratio, collaboration breadth, PR review depth. 18,976 review comments analyzed across our cohort reveal how developers actually work with others — not whether they list "teamwork" on a resume.

review_ratio: 0.001 (Torvalds — commits, not reviews)
collaboration_breadth: 2 (focused maintainer)

Skill-in-Context

Python in ML research vs. Python in web dev vs. Python in education. SkillBench links language presence to domain evidence. This is exactly the "Python as a software engineer ≠ Python as a business analyst" insight Carrol identified.

karpathy: Python → {deep-learning, education}
ruanyf: Python → {web-development, teaching}

Character Classes & Archetypes

LLM-synthesized archetypes from behavioral evidence: "Kernel Immortan", "Neural Architect", "Vim War Chief". These capture what a developer is — not just what tools they use.

ThePrimeagen → "Vim War Chief"
gaearon → "Ecosystem Architect"
3b1b → "Mathematical Sage"

Learning Velocity & Growth

New domain exploration over time. Torvalds recently started hardware design (OpenSCAD, guitar pedals). Taxonomies are snapshots; SkillBench tracks trajectories.

torvalds/new_domain: hardware-design (2025+)
evidence_score: 0.30 (early exploration)
Now imagine adding real telemetry

Everything above is from public work products alone.

Character sheets, value dimensions, skill-in-context — all derived from what's visible on GitHub. SkillBench telemetry captures what happens during the work: chatlogs with AI assistants, diffs as they're authored, thought traces, iteration patterns. Here's what changes.
🔒 Developer-Owned
Telemetry data stays on the developer's machine. Nothing leaves until they explicitly review and push it. The developer decides what to share — not their employer, not the platform.
🛡️ Privacy by Architecture
Proprietary code is excluded by default. Open-source and permissively-licensed projects are auto-classified. Employers and marketplaces receive skill signals — never raw session data, never code.
✊ Trust = Adoption
Developers are the customer, not the product. Community features, ladder levels, and social profiles are opt-in with granular visibility controls. No surveillance. No surprises.
📄 From work products (what you see above)
Skill Presence
"Python: heavy presence" — based on byte counts, repo counts, and dependency signals
Coarse tier: trace / light / moderate / heavy. Presence, not proficiency.
AI Adoption
"copilot_era_change: stable" — inferred from contribution calendar shifts pre/post 2023
Binary: changed or didn't. No visibility into how.
Collaboration
"review_ratio: 0.001" — count of PR reviews vs. commits
Quantity only. No insight into review quality or mentoring style.
Learning
"New domain: hardware-design (2025+)" — a new repo appeared
We know they started. We don't know if they're actually learning.
Debugging Skill
Not observable
Work products show the fix. Not the process of finding it.
⚡ With SkillBench telemetry
Skill Proficiency (actual)
"Python: verified proficient — with 83% autonomous authorship, type hints in 91% of functions, consistently uses generators over list comprehensions for large datasets"
authorship_ratio: 0.83
delegation_to_ai: 0.17
pattern_sophistication: "advanced"
AI Delegation Patterns
"Delegates boilerplate (tests, docstrings) to copilot. Writes core algorithms manually. Rejects 40% of AI suggestions — edits another 30% before accepting."
copilot_accept_rate: 0.30
copilot_edit_rate: 0.30
copilot_reject_rate: 0.40
delegation_style: "selective"
Collaboration Quality
"Reviews average 4.2 comments, catches logic errors 3x more than style issues. Response time: 2.1hrs. Mentoring signal: explains why, not just what."
review_depth: "substantive"
mentoring_signal: 0.82
review_turnaround: "2.1h"
Learning Velocity
"Hardware-design: 3 iteration cycles visible in diffs. Error rate dropping 15% per week. Asking AI increasingly specific questions — moving from 'how do I' to 'optimize this for'."
learning_curve: "accelerating"
question_sophistication: ↑ 0.4 → 0.7
error_rate_trend: -15%/week
Debugging Approach
"Systematic: adds logging first, reproduces before fixing, writes regression test. Avg time-to-root-cause: 12min. Rarely uses AI for debugging — prefers to reason through it."
debug_strategy: "systematic"
ai_for_debug: 0.15
regression_test_rate: 0.89
This is the difference between a snapshot and a movie.
Model-Agnostic by Design
SkillBench captures developer behavior across Claude, Codex, Copilot, Gemini, and Cursor — simultaneously. No model company can be the neutral arbiter of training effectiveness across competing platforms. As Andela scales training partnerships with OpenAI, Anthropic, NVIDIA, and GitHub, SkillBench is the only measurement infrastructure that works across all of them.