Just links

Training superhuman coding models at Cursor

Случайно наткнулся на видео, где ребята из Cursor обсуждают всякое разное про LLM. Обычно в подобных подкастах все высказывания очень поверхносные, чтобы случайно не выдать каких-нибудь секретов. А тут на удивление упомянули довольно много технических деталей.

Краткий список затронутых тем:
- Как делать RL, когда нет одного правильного ответа?
- Что делать, если вероятность получить "правильный" ответ очень маленькая?
- Как сделать, чтобы модель могла ориентироваться в большом проекте?
- Как поддерживать long context?
- Как делать credit assignment для memory tool?
- Как cursor может обучаться на пользовательских данных.
- Почему плохо смотреть на лайки/дизлайки ответов.
- Какая инфра нужна для больших RL тренировок.

Судя по количеству просмотров, если сам этим не занимаешься, то смотреть не очень интересно. Но мне понравилось!

🔥12❤2😱1

8.35K views18:22

Just links

https://mlcommons.org/benchmarks/training/

MLCommons

Benchmark MLPerf Training | MLCommons Version 2.0 Results

The MLPerf Benchmark Suites measures how fast machine learning systems can train models to a target quality metric using v2.0 results.

8.25K views16:08

Just links

How to factor 2048 bit RSA integers with less than a million noisy qubits https://arxiv.org/abs/2505.15917

arXiv.org

How to factor 2048 bit RSA integers with less than a million noisy qubits

Planning the transition to quantum-safe cryptosystems requires understanding the cost of quantum attacks on vulnerable cryptosystems. In Gidney+Ekerå 2019, I co-published an estimate stating...

8.39K views15:30

Just links

Forwarded from Axis of Ordinary

0:55

This media is not supported in your browser

VIEW IN TELEGRAM

We made Claude, Gemini, o3 battle each other for world domination.

We taught them Diplomacy—the strategy game where winning requires alliances, negotiation, and betrayal.

Here's what happened:

DeepSeek turned warmongering tyrant. Claude couldn't lie—everyone exploited it ruthlessly. Gemini 2.5 Pro nearly conquered Europe with brilliant tactics. Then o3 orchestrated a secret coalition, backstabbed every ally, and won.

More: https://every.to/diplomacy

❤3👍2

9.16K views15:04

Just links

Scene-Centric Unsupervised Panoptic Segmentation https://openaccess.thecvf.com/content/CVPR2025/html/Hahn_Scene-Centric_Unsupervised_Panoptic_Segmentation_CVPR_2025_paper.html

❤3

9.82K views18:52

Just links

A 2D-CFT Factory: Critical Lattice Models from Competing Anyon Condensation Processes in SymTO/SymTFT https://arxiv.org/abs/2506.05324

arXiv.org

A 2D-CFT Factory: Critical Lattice Models from Competing Anyon...

In this paper, we introduce a ``CFT factory'' : a novel algorithm of methodically generating 2D lattice models that would flow to 2D conformal fixed points in the infrared. These 2D models are...

10.5K views19:03

Just links

Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet) https://crfm.stanford.edu/2025/05/28/fast-kernels.html

👍3

4.95K views03:21

Just links

Bulk Excitations of Invertible Phases https://arxiv.org/abs/2506.11288

arXiv.org

Bulk Excitations of Invertible Phases

Recent developments in the study of topological defects highlight the importance of understanding the multi-dimensional structure of bulk excitations inside a quantum system. When the bulk ground...

❤1

3.72K views06:08

Just links

Forwarded from Vladislav 🇺🇸🚜

https://livecodebenchpro.com/

3.63K views06:09

Just links

https://livecodebenchpro.com/

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? https://arxiv.org/abs/2506.11928

3.8K viewsedited 06:13

Just links

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions https://arxiv.org/abs/2506.09038

arXiv.org

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries,...

😁7👍3

3.13K views12:38

Just links

LLM-First Search: Self-Guided Exploration of the Solution Space https://arxiv.org/abs/2506.05213

arXiv.org

LLM-First Search: Self-Guided Exploration of the Solution Space

Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While...

❤1

2.77K views05:21

Just links

Is there a Half-Life for the Success Rates of AI Agents? https://www.tobyord.com/writing/half-life

Toby Ord

Is there a Half-Life for the Success Rates of AI Agents? — Toby Ord

Building on the recent empirical work of Kwa et al. (2025), I show that within their suite of research-engineering tasks the performance of AI agents on longer-duration tasks can be explained by an extremely simple mathematical model — a constant rate of…

👍2🔥1

2.66K views10:53

Just links

Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models https://arxiv.org/abs/2506.11487

arXiv.org

Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models

Recent advancements, such as DeepSeek-Prover-V2-671B and Kimina-Prover-Preview-72B, demonstrate a prevailing trend in leveraging reinforcement learning (RL)-based large-scale training for...

🔥1

2.5K views06:41

Just links

CyberGym Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale https://www.cybergym.io/

www.cybergym.io

CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

CyberGym is a large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 historical vulnerabilities from 188 large software projects.

2.55K views18:25

2025/07/14 16:14:26
Back to Top

HTML Embed Code:

<iframe width="100%" src="https://www.tgoop.com/buyppe/web?embed=1" title="Telegram Web" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>