VMEvalKit 🎥🧠

Github video reasoning evaluation toolkit
VMevalkit slack community for VMEvalKit

A framework to evaluate reasoning capabilities in video generation models at scale.

Invitation to Collaborate 🤝

VMEvalKit is meant to be a permissively open-source shared playground for everyone. If you’re interested in machine cognition, video models, evaluation, or anything anything 🦄✨, we’d love to build with you:

🧪 Add new reasoning tasks (planning, causality, social, physical, etc.)
🎥 Plug in new video models (APIs or open-source)
📊 Experiment with better evaluation metrics and protocols
🧱 Improve infrastructure, logging, and the web dashboard
📚 Use VMEvalKit in your own research and share back configs/scripts
🌟🎉 Or Anything anything 🦄✨

💬 Join us on Slack to ask questions, propose ideas, or start a collab: Slack Invite 🚀

Research

Here we keep track of papers spinned off from this code infrastructure and some works in progress.

"Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven's Matrices"

This paper implements our experimental framework and demonstrates that leading video generation models (Sora-2 etc) can perform visual reasoning tasks with >60% success rates. See results.

License

Apache 2.0

Citation

If you find VMEvalKit useful in your research, please cite:

@misc{VMEvalKit,
  author       = {VMEvalKit Team},
  title        = {VMEvalKit: A framework for evaluating reasoning abilities in foundational video models},
  year         = {2025},
  howpublished = {\url{https://github.com/Video-Reason/VMEvalKit}}
}