2 min readfrom Machine Learning

Karpathy dropped a 200-line GPT, so I used the math to turn pandas DataFrames into searchable context windows and open sourced it (and automated my stats pipeline). [P]

TL;DR: I got tired of manually running Shapiro-Wilk tests and copy-pasting p-values at 2 AM. I built an open-source, async Python pipeline called StatForge that automates the statistical decision layer, writes APA methods, and lets you chat with your dataset using a microgpt-inspired retrieval system.

Hey everyone,

The hardest part of data analysis isn't the computation (we all have scipy and statsmodels). It's the plumbing—the sequence of choices between loading a CSV and having a defensible result.

I built StatForge to handle the plumbing.

How the pipeline works:

  • Lazy Loading: Detects 15+ formats (CSV, Parquet, SPSS, SQLite) and lazily imports dependencies so you don't pay for bloat.
  • Autonomous Assumption Checks: It doesn't just pass/fail normality. If a Shapiro-Wilk test returns a borderline p = 0.048, it flags it, runs both parametric and non-parametric tests, and compares the robustness of the results.
  • The Plugin Registry: Uses a register decorator pattern for easy custom model injection.

The microgpt Chat Mode: When Karpathy released his 200-line GPT, the way he loaded a corpus (docs: list[str]) changed how I looked at DataFrames. What if each row is a document? StatForge converts datasets into this format, scores rows against plain-English queries, pulls the top-k most relevant rows into a context window, and hits the Anthropic API (or a built-in rule engine). No vector DBs, no FAISS, just clean strings.

You can run a full analysis with one command!

I wrote a deep-dive on the architecture and the philosophy behind it here: https://shekhawatsamvardhan.medium.com/andrej-karpathy-dropped-a-200-line-gpt-d153e9557463

Repo is here if you want to break it or contribute: https://github.com/samvardhan03/statforge

Would love to hear how you handle your own stats plumbing, or if there are specific edge cases the decision tree should catch!

submitted by /u/Weary_Possible8913
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#no-code spreadsheet solutions
#rows.com
#generative AI for data analysis
#Excel alternatives for data analysis
#conversational data analysis
#data analysis tools
#natural language processing for spreadsheets
#financial modeling with spreadsheets
#big data management in spreadsheets
#large dataset processing
#row zero
#real-time data collaboration
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#spreadsheet API integration
#data cleaning solutions
#automated anomaly detection
#StatForge