A new state of the art LLM (at least for creative writing and basic reasoning) but what lies behind the numbers that were put out?
Is it for real, and are AI agents about to grab your mouse and shake your cursor? Plus, results on my own Simple Bench, and new tools from Runway (Act-One), HeyGen (Zoom Calls) and an updated NotebookLM. AI, without the hype.
00:00 – Introduction
00:57 – Claude 3.5 Sonnet (New) Paper
02:06 – Demo
02:58 – OSWorld
04:29 – Benchmarks compared + OpenAI Response
08:30 – Tau-Bench
13:09 – SimpleBench Results
17:05 – Yellowstone Detour
17:29 – Runway Act-One
18:44 – HeyGen Interactive Avatars + Demo
21:06 – NotebookLM Update