Episodios

  • 📆 🎂 - ThursdAI #52 - Moshi Voice, Qwen2 finetunes, GraphRag deep dive and more AI news on this celebratory 1yr ThursdAI
    Jul 4 2024
    Hey everyone! Happy 4th of July to everyone who celebrates! I celebrated today by having an intimate conversation with 600 of my closest X friends 😂 Joking aside, today is a celebratory episode, 52nd consecutive weekly ThursdAI show! I've been doing this as a podcast for a year now!Which means, there are some of you, who've been subscribed for a year 😮 Thank you! Couldn't have done this without you. In the middle of my talk at AI Engineer (I still don't have the video!) I had to plug ThursdAI, and I asked the 300+ audience who is a listener of ThursdAI, and I saw a LOT of hands go up, which is honestly, still quite humbling. So again, thank you for tuning in, listening, subscribing, learning together with me and sharing with your friends! This week, we covered a new (soon to be) open source voice model from KyutAI, a LOT of open source LLM, from InternLM, Cognitive Computations (Eric Hartford joined us), Arcee AI (Lukas Atkins joined as well) and we have a deep dive into GraphRAG with Emil Eifrem CEO of Neo4j (who shares why it was called Neo4j in the first place, and that he's a ThursdAI listener, whaaat? 🤯), this is definitely a conversation you don't want to miss, so tune in, and read a breakdown below:TL;DR of all topics covered: * Voice & Audio* KyutAI releases Moshi - first ever 7B end to end voice capable model (Try it)* Open Source LLMs * Microsoft Updated Phi-3-mini - almost a new model * InternLM 2.5 - best open source model under 12B on Hugging Face (HF, Github)* Microsoft open sources GraphRAG (Announcement, Github, Paper)* OpenAutoCoder-Agentless - SOTA on SWE Bench - 27.33% (Code, Paper)* Arcee AI - Arcee Agent 7B - from Qwen2 - Function / Tool use finetune (HF)* LMsys announces RouteLLM - a new Open Source LLM Router (Github)* DeepSeek Chat got an significant upgrade (Announcement)* Nomic GPT4all 3.0 - Local LLM (Download, Github)* This weeks Buzz* New free Prompts course from WandB in 4 days (pre sign up)* Big CO LLMs + APIs* Perplexity announces their new pro research mode (Announcement)* X is rolling out "Grok Analysis" button and it's BAD in "fun mode" and then paused roll out* Figma pauses the rollout of their AI text to design tool "Make Design" (X)* Vision & Video* Cognitive Computations drops DolphinVision-72b - VLM (HF)* Chat with Emil Eifrem - CEO Neo4J about GraphRAG, AI EngineerVoice & AudioKyutAI Moshi - a 7B end to end voice model (Try It, See Announcement)Seemingly out of nowhere, another french AI juggernaut decided to drop a major announcement, a company called KyutAI, backed by Eric Schmidt, call themselves "the first European private-initiative laboratory dedicated to open research in artificial intelligence" in a press release back in November of 2023, have quite a few rockstar co founders ex Deep Mind, Meta AI, and have Yann LeCun on their science committee.This week they showed their first, and honestly quite mind-blowing release, called Moshi (Japanese for Hello, Moshi Moshi), which is an end to end voice and text model, similar to GPT-4o demos we've seen, except this one is 7B parameters, and can run on your mac! While the utility of the model right now is not the greatest, not remotely close to anything resembling the amazing GPT-4o (which was demoed live to me and all of AI Engineer by Romain Huet) but Moshi shows very very impressive stats! Built by a small team during only 6 months or so of work, they have trained an LLM (Helium 7B) an Audio Codec (Mimi) a Rust inference stack and a lot more, to give insane performance. Model latency is 160ms and mic-to-speakers latency is 200ms, which is so fast it seems like it's too fast. The demo often responds faster than I'm able to finish my sentence, and it results in an uncanny, "reading my thoughts" type feeling. The most important part is this though, a quote of KyutAI post after the announcement : Developing Moshi required significant contributions to audio codecs, multimodal LLMs, multimodal instruction-tuning and much more. We believe the main impact of the project will be sharing all Moshi’s secrets with the upcoming paper and open-source of the model.I'm really looking forward to how this tech can be applied to the incredible open source models we already have out there! Speaking to out LLMs is now officially here in the Open Source, way before we got GPT-4o and it's exciting! Open Source LLMs Microsoft stealth update Phi-3 Mini to make it almost a new modelSo stealth in fact, that I didn't even have this update in my notes for the show, but thanks to incredible community (Bartowsky, Akshay Gautam) who made sure we don't miss this, because it's so huge. The model used additional post-training data leading to substantial gains on instruction following and structure output. We also improve multi-turn conversation quality, explicitly support <|system|> tag, and significantly improve reasoning capabilityPhi-3 June update is quite significant across the board, just look at some of these scores, 354.78% ...
    Más Menos
    1 h y 50 m
  • 📅 ThursdAI - Gemma 2, AI Engineer 24', AI Wearables, New LLM leaderboard
    Jun 27 2024
    Hey everyone, sending a quick one today, no deep dive, as I'm still in the middle of AI Engineer World's Fair 2024 in San Francisco (in fact, I'm writing this from the incredible floor 32 presidential suite, that the team here got for interviews, media and podcasting, and hey to all new folks who I’ve just met during the last two days!) It's been an incredible few days meeting so many ThursdAI community members, listeners and folks who came on the pod! The list honestly is too long but I've got to meet friends of the pod Maxime Labonne, Wing Lian, Joao Morra (crew AI), Vik from Moondream, Stefania Druga not to mention the countless folks who came up and gave high fives, introduced themselves, it was honestly a LOT of fun. (and it's still not over, if you're here, please come and say hi, and let's take a LLM judge selfie together!)On today's show, we recorded extra early because I had to run and play dress up, and boy am I relieved now that both the show and the talk are behind me, and I can go an enjoy the rest of the conference 🔥 (which I will bring you here in full once I get the recording!) On today's show, we had the awesome pleasure to have Surya Bhupatiraju who's a research engineer at Google DeepMind, talk to us about their newly released amazing Gemma 2 models! It was very technical, and a super great conversation to check out! Gemma 2 came out with 2 sizes, a 9B and a 27B parameter models, with 8K context (we addressed this on the show) and this 27B model incredible performance is beating LLama-3 70B on several benchmarks and is even beating Nemotron 340B from NVIDIA! This model is also now available on the Google AI studio to play with, but also on the hub! We also covered the renewal of the HuggingFace open LLM leaderboard with their new benchmarks in the mix and normalization of scores, and how Qwen 2 is again the best model that's tested! It's was a very insightful conversation, that's worth listening to if you're interested in benchmarks, definitely give it a listen. Last but not least, we had a conversation with Ethan Sutin, the co-founder of Bee Computer. At the AI Engineer speakers dinner, all the speakers received a wearable AI device as a gift, and I onboarded (cause Swyx asked me) and kinda forgot about it. On the way back to my hotel I walked with a friend and chatted about my life. When I got back to my hotel, the app prompted me with "hey, I now know 7 new facts about you" and it was incredible to see how much of the conversation it was able to pick up, and extract facts and eve TODO's! So I had to have Ethan on the show to try and dig a little bit into the privacy and the use-cases of these hardware AI devices, and it was a great chat! Sorry for the quick one today, if this is the first newsletter after you just met me and register, usually there’s a deeper dive here, expect a more in depth write-ups in the next sessions, as now I have to run down and enjoy the rest of the conference! Here's the TL;DR and my RAW show notes for the full show, in case it's helpful! * AI Engineer is happening right now in SF* Tracks include Multimodality, Open Models, RAG & LLM Frameworks, Agents, Al Leadership, Evals & LLM Ops, CodeGen & Dev Tools, Al in the Fortune 500, GPUs & Inference* Open Source LLMs * HuggingFace - LLM Leaderboard v2 - (Blog)* Old Benchmarks sucked and it's time to renew* New Benchmarks* MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper)* GPQA (Google-Proof Q&A Benchmark, paper). GPQA is an extremely hard knowledge dataset* MuSR (Multistep Soft Reasoning, paper).* MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper)* IFEval (Instruction Following Evaluation, paper)* 🤝 BBH (Big Bench Hard, paper). BBH is a subset of 23 challenging tasks from the BigBench dataset* The community will be able to vote for models, and we will prioritize running models with the most votes first* Mozilla announces Builders Accelerator @ AI Engineer (X)* Theme: Local AI * 100K non dilutive funding* Google releases Gemma 2 (X, Blog)* Big CO LLMs + APIs* UMG, Sony, Warner sue Udio and Suno for copyright (X)* were able to recreate some songs* sue both companies* have 10 unnamed individuals who are also on the suit* Google Chrome Canary has Gemini nano (X)* * Super easy to use window.ai.createTextSession()* Nano 1 and 2, at a 4bit quantized 1.8B and 3.25B parameters has decent performance relative to Gemini Pro* Behind a feature flag* Most text gen under 500ms * Unclear re: hardware requirements * Someone already built extensions* someone already posted this on HuggingFace* Anthropic Claude share-able projects (X)* Snapshots of Claude conversations shared with your team* Can share custom instructions* Anthropic has released new "Projects" feature for Claude AI to enable collaboration and enhanced workflows* Projects allow users to ground Claude's outputs in their own internal knowledge and documents* Projects can be customized with instructions to tailor ...
    Más Menos
    1 h y 21 m
  • 📅 ThursdAI - June 20th - 👑 Claude Sonnet 3.5 new LLM king, DeepSeek new OSS code king, Runway Gen-3 SORA competitor, Ilya's back & more AI news from this crazy week
    Jun 20 2024
    Hey, this is Alex. Don't you just love when assumptions about LLMs hitting a wall just get shattered left and right and we get new incredible tools released that leapfrog previous state of the art models, that we barely got used to, from just a few months ago? I SURE DO! Today is one such day, this week was already busy enough, I had a whole 2 hour show packed with releases, and then Anthropic decided to give me a reason to use the #breakingNews button (the one that does the news show like sound on the live show, you should join next time!) and announced Claude Sonnet 3.5 which is their best model, beating Opus while being 2x faster and 5x cheaper! (also beating GPT-4o and Turbo, so... new king! For how long? ¯\_(ツ)_/¯)Critics are already raving, it's been half a day and they are raving! Ok, let's get to the TL;DR and then dive into Claude 3.5 and a few other incredible things that happened this week in AI! 👇 TL;DR of all topics covered: * Open Source LLMs * NVIDIA - Nemotron 340B - Base, Instruct and Reward model (X)* DeepSeek coder V2 (230B MoE, 16B) (X, HF)* Meta FAIR - Chameleon MMIO models (X)* HF + BigCodeProject are deprecating HumanEval with BigCodeBench (X, Bench)* NousResearch - Hermes 2 LLama3 Theta 70B - GPT-4 level OSS on MT-Bench (X, HF)* Big CO LLMs + APIs* Gemini Context Caching is available * Anthropic releases Sonnet 3.5 - beating GPT-4o (X, Claude.ai)* Ilya Sutskever starting SSI.inc - safe super intelligence (X)* Nvidia is the biggest company in the world by market cap* This weeks Buzz * Alex in SF next week for AIQCon, AI Engineer. ThursdAI will be sporadic but will happen!* W&B Weave now has support for tokens and cost + Anthropic SDK out of the box (Weave Docs)* Vision & Video* Microsoft open sources Florence 230M & 800M Vision Models (X, HF)* Runway Gen-3 - (t2v, i2v, v2v) Video Model (X)* Voice & Audio* Google Deepmind teases V2A video-to-audio model (Blog)* AI Art & Diffusion & 3D* Flash Diffusion for SD3 is out - Stable Diffusion 3 in 4 steps! (X)ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.🦀 New king of LLMs in town - Claude 3.5 Sonnet 👑 Ok so first things first, Claude Sonnet, the previously forgotten middle child of the Claude 3 family, has now received a brain upgrade! Achieving incredible performance on many benchmarks, this new model is 5 times cheaper than Opus at $3/1Mtok on input and $15/1Mtok on output. It's also competitive against GPT-4o and turbo on the standard benchmarks, achieving incredible scores on MMLU, HumanEval etc', but we know that those are already behind us. Sonnet 3.5, aka Claw'd (which is a great marketing push by the Anthropic folks, I love to see it), is beating all other models on Aider.chat code editing leaderboard, winning on the new livebench.ai leaderboard and is getting top scores on MixEval Hard, which has 96% correlation with LMsys arena.While benchmarks are great and all, real folks are reporting real findings of their own, here's what Friend of the Pod Pietro Skirano had to say after playing with it: there's like a lot of things that I saw that I had never seen before in terms of like creativity and like how much of the model, you know, actually put some of his own understanding into your request-@SkiranoWhat's notable a capability boost is this quote from the Anthropic release blog: In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%. One detail that Alex Albert from Anthropic pointed out from this released was, that on GPQA (Graduate-Level Google-Proof Q&A) Benchmark, they achieved a 67% with various prompting techniques, beating PHD experts in respective fields in this benchmarks that average 65% on this. This... this is crazyBeyond just the benchmarks This to me is a ridiculous jump because Opus was just so so good already, and Sonnet 3.5 is jumping over it with agentic solving capabilities, and also vision capabilities. Anthropic also announced that vision wise, Claw'd is significantly better than Opus at vision tasks (which, again, Opus was already great at!) and lastly, Claw'd now has a great recent cutoff time, it knows about events that happened in February 2024! Additionally, claude.ai got a new capability which significantly improves the use of Claude, which they call artifacts. It needs to be turned on in settings, and then Claude will have access to files, and will show you in an aside, rendered HTML, SVG files, Markdown docs, and a bunch more stuff, and it'll be able to reference different files it creates, to create assets and then a game with these assets for example! 1 Ilya x 2 Daniels to build Safe SuperIntelligence Ilya Sutskever, Co-founder and failed board Coup participant (leader?) at OpenAI, has resurfaced after a long time of people wondering "where's Ilya" with one hell of an announcement. ...
    Más Menos
    1 h y 9 m
  • ThursdAI - June 13th, 2024 - Apple Intelligence recap, Elons reaction, Luma's Dream Machine, AI Engineer invite, SD3 & more AI news from this past week
    Jun 13 2024
    Happy Apple AI week everyone (well, those of us who celebrate, some don't) as this week we finally got told what Apple is planning to do with this whole generative AI wave and presented Apple Intelligence (which is AI, get it? they are trying to rebrand AI!)This weeks pod and newsletter main focus will be Apple Intelligence of course, as it was for most people compared to how the market reacted ($APPL grew over $360B in a few days after this announcement) and how many people watched each live stream (10M at the time of this writing watched the WWDC keynote on youtube, compared to 4.5 for the OpenAI GPT-4o, 1.8 M for Google IO) On the pod we also geeked out on new eval frameworks and benchmarks including a chat with the authors of MixEvals which I wrote about last week and a new benchmark called Live Bench from Abacus and Yan LecunPlus a new video model from Luma and finally SD3, let's go! 👇 TL;DR of all topics covered: * Apple WWDC recap and Apple Intelligence (X)* This Weeks Buzz* AI Engineer expo in SF (June 25-27) come see my talk, it's going to be Epic (X, Schedule)* Open Source LLMs * Microsoft Samba - 3.8B MAMBA + Sliding Window Attention beating Phi 3 (X, Paper)* Sakana AI releases LLM squared - LLMs coming up with preference algorithms to train better LLMS (X, Blog)* Abacus + Yan Lecun release LiveBench.AI - impossible to game benchmark (X, Bench* Interview with MixEval folks about achieving 96% arena accuracy with 5000x less price* Big CO LLMs + APIs* Mistral announced a 600M series B round* Revenue at OpenAI DOUBLED in the last 6 month and is now at $3.4B annualized (source)* Elon drops lawsuit vs OpenAI * Vision & Video* Luma drops DreamMachine - SORA like short video generation in free access (X, TRY IT)* AI Art & Diffusion & 3D* Stable Diffusion Medium weights are here (X, HF, FAL)* Tools* Google releases GenType - create an alphabet with diffusion Models (X, Try It)Apple IntelligenceTechnical LLM details Let's dive right into what wasn't show on the keynote, in a 6 minute deep dive video from the state of the union for developers and in a follow up post on machine learning blog, Apple shared some very exciting technical details about their on device models and orchestration that will become Apple Intelligence. Namely, on device they have trained a bespoke 3B parameter LLM, which was trained on licensed data, and uses a bunch of very cutting edge modern techniques to achieve quite an incredible on device performance. Stuff like GQA, Speculative Decoding, a very unique type of quantization (which they claim is almost lossless) To maintain model , we developed a new framework using LoRA adapters that incorporates a mixed 2-bit and 4-bit configuration strategy — averaging 3.5 bits-per-weight — to achieve the same accuracy as the uncompressed models [...] on iPhone 15 Pro we are able to reach time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per secondThese small models (they also have a bespoke image diffusion model as well) are going to be finetuned with a lot of LORA adapters for specific tasks like Summarization, Query handling, Mail replies, Urgency and more, which gives their foundational models the ability to specialize itself on the fly to the task at hand, and be cached in memory as well for optimal performance. Personal and Private (including in the cloud) While these models are small, they will also benefit from 2 more things on device, a vector store of your stuff (contacts, recent chats, calendar, photos) they call semantic index and a new thing apple is calling App Intents, which developers can expose (and the OS apps already do) that will allows the LLM to use tools like moving files, extracting data across apps, and do actions, this already makes the AI much more personal and helpful as it has in its context things about me and what my apps can do on my phone. Handoff to the Private Cloud (and then to OpenAI)What the local 3B LLM + context can't do, it'll hand off to the cloud, in what Apple claims is a very secure way, called Private Cloud, in which they will create a new inference techniques in the cloud, on Apple Silicon, with Secure Enclave and Secure Boot, ensuring that the LLM sessions that run inference on your data are never stored, and even Apple can't access those sessions, not to mention train their LLMs on your data. Here are some benchmarks Apple posted for their On-Device 3B model and unknown size server model comparing it to GPT-4-Turbo (not 4o!) on unnamed benchmarks they came up with. In cases where Apple Intelligence cannot help you with a request (I'm still unclear when this actually would happen) IOS will now show you this dialog, suggesting you use chatGPT from OpenAI, marking a deal with OpenAI (in which apparently nobody pays nobody, so neither Apple is getting paid by OpenAI to be placed there, nor does Apple pay OpenAI for the additional compute, tokens, and inference) Implementations across...
    Más Menos
    1 h y 46 m
  • 📅 ThursdAI - Jun 6th - 👑 Qwen2 Beats Llama-3! Jina vs. Nomic for Multimodal Supremacy, new Chinese SORA, Suno & Udio user uploads & more AI news
    Jun 7 2024
    Hey hey! This is Alex! 👋 Some podcasts have 1 or maaaybe 2 guests an episode, we had 6! guests today, each has had an announcement, an open source release, or a breaking news story that we've covered! (PS, this edition is very multimodal so click into the Substack as videos don't play in your inbox)As you know my favorite thing is to host the folks who make the news to let them do their own announcements, but also, hitting that BREAKING NEWS button when something is actually breaking (as in, happened just before or during the show) and I've actually used it 3 times this show! It's not every week that we get to announce a NEW SOTA open model with the team that worked on it. Junyang (Justin) Lin from Qwen is a friend of the pod, a frequent co-host, and today gave us the breaking news of this month, as Qwen2 72B, is beating LLama-3 70B on most benchmarks! That's right, a new state of the art open LLM was announced on the show, and Justin went deep into details 👏 (so don't miss this conversation, listen to wherever you get your podcasts) We also chatted about SOTA multimodal embeddings with Jina folks (Bo Wand and Han Xiao) and Zach from Nomic, dove into an open source compute grant with FALs Batuhan Taskaya and much more! TL;DR of all topics covered: * Open Source LLMs * Alibaba announces Qwen 2 - 5 model suite (X, HF)* Jina announces Jina-Clip V1 - multimodal embeddings beating CLIP from OAI (X, Blog, Web Demo)* Nomic announces Nomic-Embed-Vision (X, BLOG)* MixEval - arena style rankings with Chatbot Arena model rankings with 2000× less time (5 minutes) and 5000× less cost ($0.6) (X, Blog)* Vision & Video* Kling - open access video model SORA competitor from China (X)* This Weeks Buzz * WandB supports Mistral new finetuning service (X)* Register to my June 12 workshop on building Evals with Weave HERE* Voice & Audio* StableAudio Open - X, BLOG, TRY IT* Suno launches "upload your audio" feature to select few - X * Udio - upload your own audio feature - X* AI Art & Diffusion & 3D* Stable Diffusion 3 weights are coming on June 12th (Blog)* JasperAI releases Flash Diffusion (X, TRY IT, Blog)* Big CO LLMs + APIs* Group of ex-OpenAI sign a new letter - righttowarn.ai * A hacker releases TotalRecall - a tool to extract all the info from MS Recall Feature (Github)Open Source LLMs QWEN 2 - new SOTA open model from Alibaba (X, HF)This is definitely the biggest news for this week, as the folks at Alibaba released a very surprising and super high quality suite of models, spanning from a tiny 0.5B model to a new leader in open models, Qwen 2 72B To add to the distance from Llama-3, these new models support a wide range of context length, all large, with 7B and 72B support up to 128K context. Justin mentioned on stage that actually finding sequences of longer context lengths is challenging, and this is why they are only at 128K.In terms of advancements, the highlight is advanced Code and Math capabilities, which are likely to contribute to overall model advancements across other benchmarks as well. It's also important to note that all models (besides the 72B) are now released with Apache 2 license to help folks actually use globally, and speaking of globality, these models have been natively trained with 27 additional languages, making them considerably better at multilingual prompts! One additional amazing thing was, that a finetune was released by Eric Hartford and Cognitive Computations team, and AFAIK this is the first time a new model drops with an external finetune. Justing literally said "It is quite amazing. I don't know how they did that. Well, our teammates don't know how they did that, but, uh, it is really amazing when they use the Dolphin dataset to train it."Here's the Dolphin finetune metrics and you can try it out hereThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Jina-Clip V1 and Nomic-Embed-Vision SOTA multimodal embeddingsIt's quite remarkable that we got 2 separate SOTA of a similar thing during the same week, and even more cool that both companies came to talk about it on ThursdAI! First we welcomed back Bo Wang from Jina (who joined by Han Xiao the CEO) and Bo talked about multimodal embeddings that beat OpenAI CLIP (which both conceded was a very low plank) Jina Clip V1 is apache 2 open sourced, while Nomic Embed is beating it on benchmarks but is CC-BY-NC non commercially licensed, but in most cases, if you're embedding, you'd likely use an API, and both companies offer these embeddings via their respective APIsOne thing to note about Nomic, is that they have mentioned that these new embeddings are backwards compatible with the awesome Nomic embed endpoints and embeddings, so if you've used that, now you've gone multimodal! Because these models are fairly small, there are now web versions, thanks to transformer.js, of Jina and Nomic Embed (caution, this ...
    Más Menos
    1 h y 44 m
  • 📅 ThursdAI - May 30 - 1000 T/s inference w/ SambaNova, <135ms TTS with Cartesia, SEAL leaderboard from Scale & more AI news
    May 31 2024
    Hey everyone, Alex here! Can you believe it's already end of May? And that 2 huge AI companies conferences are behind us (Google IO, MSFT Build) and Apple's WWDC is just ahead in 10 days! Exciting! I was really looking forward to today's show, had quite a few guests today, I'll add all their socials below the TL;DR so please give them a follow and if you're only in reading mode of the newsletter, why don't you give the podcast a try 🙂 It's impossible for me to add the density of knowledge that's being shared on stage for 2 hours here in the newsletter! Also, before we dive in, I’m hosting a free workshop soon, about building evaluations from scratch, if you’re building anything with LLMs in production, more than welcome to join us on June 12th (it’ll be virtual)TL;DR of all topics covered: * Open Source LLMs * Mistral open weights Codestral - 22B dense coding model (X, Blog)* Nvidia open sources NV-Embed-v1 - Mistral based SOTA embeddings (X, HF)* HuggingFace Chat with tool support (X, demo)* Aider beats SOTA on Swe-Bench with 26% (X, Blog, Github)* OpenChat - Sota finetune of Llama3 (X, HF, Try It)* LLM 360 - K2 65B - fully transparent and reproducible (X, Paper, HF, WandB)* Big CO LLMs + APIs* Scale announces SEAL Leaderboards - with private Evals (X, leaderboard)* SambaNova achieves >1000T/s on Llama-3 full precision* Groq hits back with breaking 1200T/s on Llama-3* Anthropic tool support in GA (X, Blogpost)* OpenAI adds GPT4o, Web Search, Vision, Code Interpreter & more to free users (X)* Google Gemini & Gemini Flash are topping the evals leaderboards, in GA(X)* Gemini Flash finetuning coming soon* This weeks Buzz (What I learned at WandB this week)* Sponsored a Mistral hackathon in Paris* We have an upcoming workshop in 2 parts - come learn with me* Vision & Video* LLama3-V - Sota OSS VLM (X, Github)* Voice & Audio* Cartesia AI - super fast SSM based TTS with very good sounding voices (X, Demo)* Tools & Hardware* Jina Reader (https://jina.ai/reader/) * Co-Hosts and Guests* Rodrigo Liang (@RodrigoLiang) & Anton McGonnell (@aton2006) from SambaNova* Itamar Friedman (@itamar_mar) Codium* Arjun Desai (@jundesai) - Cartesia* Nisten Tahiraj (@nisten) - Cohost* Wolfram Ravenwolf (@WolframRvnwlf)* Eric Hartford (@erhartford)* Maziyar Panahi (@MaziyarPanahi)Scale SEAL leaderboards (Leaderboard)Scale AI has announced their new initiative, called SEAL leaderboards, which aims to provide yet another point of reference in how we understand frontier models and their performance against each other. We've of course been sharing LMSys arena rankings here, and openLLM leaderboard from HuggingFace, however, there are issues with both these approaches, and Scale is approaching the measuring in a different way, focusing on very private benchmarks and dataset curated by their experts (Like Riley Goodside) The focus of SEAL is private and novel assessments across Coding, Instruction Following, Math, Spanish and more, and the main reason they keep this private, is so that models won't be able to train on these benchmarks if they leak to the web, and thus show better performance due to data contamination. They are also using ELO scores (Bradley-Terry) and I love this footnote from the actual website: "To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts"This means they are taking the contamination thing very seriously and it's great to see such dedication to being a trusted source in this space. Specifically interesting also that on their benchmarks, GPT-4o is not better than Turbo at coding, and definitely not by 100 points like it was announced by LMSys and OpenAI when they released it! Gemini 1.5 Flash (and Pro) in GA and showing impressive performance As you may remember from my Google IO recap, I was really impressed with Gemini Flash, and I felt that it went under the radar for many folks. Given it's throughput speed, 1M context window, and multimodality and price tier, I strongly believed that Google was onto something here. Well this week, not only was I proven right, I didn't actually realize how right I was 🙂 as we heard breaking news from Logan Kilpatrick during the show, that the models are now in GA, and that Gemini Flash gets upgraded to 1000 RPM (requests per minute) and announced that finetuning is coming and will be free of charge! Not only with finetuning won't cost you anything, inference on your tuned model is going to cost the same, which is very impressive. There was a sneaky price adjustment from the announced pricing to the GA pricing that upped the pricing by 2x on output tokens, but even despite that, Gemini Flash with $0.35/1MTok for input and $1.05/1MTok on output is probably the best deal there is right now for LLMs of this level. This week it was also confirmed both on LMsys, and on Scale SEAL leaderboards that Gemini Flash is a very good coding LLM, beating Claude Sonnet and LLama-3 70B! SambaNova ...
    Más Menos
    1 h y 53 m
  • 📅 ThursdAI - May 23 - OpenAI troubles, Microsoft Build, Phi-3 small/large, new Mistral & more AI news
    May 23 2024
    Hello hello everyone, this is Alex, typing these words from beautiful Seattle (really, it only rained once while I was here!) where I'm attending Microsoft biggest developer conference BUILD. This week we saw OpenAI get in the news from multiple angles, none of them positive and Microsoft clapped back at Google from last week with tons of new AI product announcements (CoPilot vs Gemini) and a few new PCs with NPU (Neural Processing Chips) that run alongside CPU/GPU combo we're familiar with. Those NPUs allow for local AI to run on these devices, making them AI native devices! While I'm here I also had the pleasure to participate in the original AI tinkerers thanks to my friend Joe Heitzberg who operates and runs the aitinkerers.org (of which we are a local branch in Denver) and it was amazing to see tons of folks who listen to ThursdAI + read the newsletter and talk about Weave and evaluations with all of them! (Btw, one the left is Vik from Moondream, which we covered multiple times). I Ok let's get to the news: TL;DR of all topics covered: * Open Source LLMs * HuggingFace commits 10M in ZeroGPU (X)* Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)* Mistral 7B 0.3 - Base + Instruct (HF)* LMSys created a "hard prompts" category (X)* Cohere for AI releases Aya 23 - 3 models, 101 languages, (X)* Big CO LLMs + APIs* Microsoft Build recap - New AI native PCs, Recall functionality, Copilot everywhere * Will post a dedicated episode to this on Sunday* OpenAI pauses GPT-4o Sky voice because Scarlet Johansson complained* Microsoft AI PCs - Copilot+ PCs (Blog)* Anthropic - Scaling Monosemanticity paper - about mapping the features of an LLM (X, Paper)* Vision & Video* OpenBNB - MiniCPM-Llama3-V 2.5 (X, HuggingFace)* Voice & Audio* OpenAI pauses Sky voice due to ScarJo hiring legal counsel* Tools & Hardware* Humane is looking to sell (blog)Open Source LLMs Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)Just in time for Build, Microsoft has open sourced the rest of the Phi family of models, specifically the small (7B) and the Medium (14B) models on top of the mini one we just knew as Phi-3. All the models have a small context version (4K and 8K) and a large that goes up to 128K (tho they recommend using the small if you don't need that whole context) and all can run on device super quick. Those models have MIT license, so use them as you will, and are giving an incredible performance comparatively to their size on benchmarks. Phi-3 mini, received an interesting split in the vibes, it was really good for reasoning tasks, but not very creative in it's writing, so some folks dismissed it, but it's hard to dismiss these new releases, especially when the benchmarks are that great! LMsys just updated their arena to include a hard prompts category (X) which select for complex, specific and knowledge based prompts and scores the models on those. Phi-3 mini actually gets a big boost in ELO ranking when filtered on hard prompts and beats GPT-3.5 😮 Can't wait to see how the small and medium versions perform on the arena.Mistral gives us function calling in Mistral 0.3 update (HF)Just in time for the Mistral hackathon in Paris, Mistral has released an update to the 7B model (and likely will update the MoE 8x7B and 8x22B Mixtrals) with function calling and a new vocab. This is awesome all around because function calling is important for agenting capabilities, and it's about time all companies have it, and apparently the way Mistral has it built in matches the Cohere Command R way and is already supported in Ollama, using raw mode. Big CO LLMs + APIsOpen AI is not having a good week - Sky voice has paused, Employees complainOpenAI is in hot waters this week, starting with pausing the Sky voice (arguably the best most natural sounding voice out of the ones that launched) due to complains for Scarlett Johansson about this voice being similar to hers. Scarlett appearance in the movie Her, and Sam Altman tweeting "her" to celebrate the release of the incredible GPT-4o voice mode were all talked about when ScarJo has released a statement saying she was shocked when her friends and family told her that OpenAI's new voice mode sounds just like her. Spoiler, it doesn't really, and they hired an actress and have had this voice out since September last year, as they outlined in their blog following ScarJo complaint. Now, whether or not there's legal precedent here, given that Sam Altman reached out to Scarlet twice, including once a few days before the event, I won't speculate, but for me, personally, not only Sky doesn't sound like ScarJo, it was my favorite voice even before they demoed it, and I'm really sad that it's paused, and I think it's unfair to the actress who was hired for her voice. See her own statement: Microsoft Build - CoPilot all the thingsI have recorded a Built recap with Ryan Carson from...
    Más Menos
    1 h y 43 m
  • 📅 ThursdAI - May 16 - OpenAI GPT-4o, Google IO recap, LLama3 hackathon, Yi 1.5, Nous Hermes Merge & more AI news
    May 17 2024
    Wow, holy s**t, insane, overwhelming, incredible, the future is here!, "still not there", there are many more words to describe this past week. (TL;DR at the end of the blogpost)I had a feeling it's going to be a big week, and the companies did NOT disappoint, so this is going to be a very big newsletter as well. As you may have read last week, I was very lucky to be in San Francisco the weekend before Google IO, to co-host a hackathon with Meta LLama-3 team, and it was a blast, I will add my notes on that in This weeks Buzz section. Then on Monday, we all got to watch the crazy announcements from OpenAI, namely a new flagship model called GPT-4o (we were right, it previously was im-also-a-good-gpt2-chatbot) that's twice faster, 50% cheaper (in English, significantly more so in other languages, more on that later) and is Omni (that's the o) which means it is end to end trained with voice, vision, text on inputs, and can generate text, voice and images on the output. A true MMIO (multimodal on inputs and outputs, that's not the official term) is here and it has some very very surprising capabilities that blew us all away. Namely the ability to ask the model to "talk faster" or "more sarcasm in your voice" or "sing like a pirate", though, we didn't yet get that functionality with the GPT-4o model, it is absolutely and incredibly exciting. Oh and it's available to everyone for free! That's GPT-4 level intelligence, for free for everyone, without having to log in!What's also exciting was how immediate it was, apparently not only the model itself is faster (unclear if it's due to newer GPUs or distillation or some other crazy advancements or all of the above) but that training an end to end omnimodel reduces the latency to incredibly immediate conversation partner, one that you can interrupt, ask to recover from a mistake, and it can hold a conversation very very well. So well, that indeed it seemed like, the Waifu future (digital girlfriends/wives) is very close to some folks who would want it, while we didn't get to try it (we got GPT-4o but not the new voice mode as Sam confirmed) OpenAI released a bunch of videos of their employees chatting with Omni (that's my nickname, use it if you'd like) and many online highlighted how thirsty / flirty it sounded. I downloaded all the videos for an X thread and I named one girlfriend.mp4, and well, just judge for yourself why: Ok, that's not all that OpenAI updated or shipped, they also updated the Tokenizer which is incredible news to folks all around, specifically, the rest of the world. The new tokenizer reduces the previous "foreign language tax" by a LOT, making the model way way cheaper for the rest of the world as wellOne last announcement from OpenAI was the desktop app experience, and this one, I actually got to use a bit, and it's incredible. MacOS only for now, this app comes with a launcher shortcut (kind of like RayCast) that let's you talk to ChatGPT right then and there, without opening a new tab, without additional interruptions, and it even can understand what you see on the screen, help you understand code, or jokes or look up information. Here's just one example I just had over at X. And sure, you could always do this with another tab, but the ability to do it without context switch is a huge win. OpenAI had to do their demo 1 day before GoogleIO, but even during the excitement about GoogleIO, they had announced that Ilya is not only alive, but is also departing from OpenAI, which was followed by an announcement from Jan Leike (who co-headed the superailgnment team together with Ilya) that he left as well. This to me seemed like a well executed timing to give dampen the Google news a bit. Google is BACK, backer than ever, Alex's Google IO recapOn Tuesday morning I showed up to Shoreline theater in Mountain View, together with creators/influencers delegation as we all watch the incredible firehouse of announcements that Google has prepared for us. TL;DR - Google is adding Gemini and AI into all it's products across workspace (Gmail, Chat, Docs), into other cloud services like Photos, where you'll now be able to ask your photo library for specific moments. They introduced over 50 product updates and I don't think it makes sense to cover all of them here, so I'll focus on what we do best."Google with do the Googling for you" Gemini 1.5 pro is now their flagship model (remember Ultra? where is that? 🤔) and has been extended to 2M tokens in the context window! Additionally, we got a new model called Gemini Flash, which is way faster and very cheap (up to 128K, then it becomes 2x more expensive)Gemini Flash is multimodal as well and has 1M context window, making it an incredible deal if you have any types of videos to process for example. Kind of hidden but important was a caching announcement, which IMO is a big deal, big enough it could post a serious risk to RAG based companies. Google has claimed they have a way to introduce caching of the LLM ...
    Más Menos
    1 h y 54 m