Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions https://arxiv.org/abs/2505.18492
arXiv.org
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction...
Mathematical reasoning lies at the heart of artificial intelligence, underpinning applications in education, program verification, and research-level mathematical discovery. Mathematical...
Quantized Transport of Disordered Superconducting Fractional Quantum Hall Edges https://arxiv.org/abs/2505.20398
arXiv.org
Quantized Transport of Disordered Superconducting $ν=2/3$...
The $ν=2/3$ fractional quantum Hall (FQH) edge states, which have counter-propagating modes, are known to flow under relevant neutral disorders into a stable Kane-Fisher-Polchinski (KFP)...
Generalization Bias in Large Language Model Summarizati https://arxiv.org/abs/2504.00025
arXiv.org
Generalization Bias in Large Language Model Summarization of...
Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize...
❤1👎1
Forwarded from Hacker News
A Lean companion to Analysis I (Score: 150+ in 6 hours)
Link: https://readhacker.news/s/6vp2P
Comments: https://readhacker.news/c/6vp2P
Link: https://readhacker.news/s/6vp2P
Comments: https://readhacker.news/c/6vp2P
What's new
A Lean companion to “Analysis I”
Almost 20 years ago, I wrote a textbook in real analysis called “Analysis I”. It was intended to complement the many good available analysis textbooks out there by focusing more on foun…
❤3
Forwarded from Боря программирует
Training superhuman coding models at Cursor
Случайно наткнулся на видео, где ребята из Cursor обсуждают всякое разное про LLM. Обычно в подобных подкастах все высказывания очень поверхносные, чтобы случайно не выдать каких-нибудь секретов. А тут на удивление упомянули довольно много технических деталей.
Краткий список затронутых тем:
- Как делать RL, когда нет одного правильного ответа?
- Что делать, если вероятность получить "правильный" ответ очень маленькая?
- Как сделать, чтобы модель могла ориентироваться в большом проекте?
- Как поддерживать long context?
- Как делать credit assignment для memory tool?
- Как cursor может обучаться на пользовательских данных.
- Почему плохо смотреть на лайки/дизлайки ответов.
- Какая инфра нужна для больших RL тренировок.
Судя по количеству просмотров, если сам этим не занимаешься, то смотреть не очень интересно. Но мне понравилось!
Случайно наткнулся на видео, где ребята из Cursor обсуждают всякое разное про LLM. Обычно в подобных подкастах все высказывания очень поверхносные, чтобы случайно не выдать каких-нибудь секретов. А тут на удивление упомянули довольно много технических деталей.
Краткий список затронутых тем:
- Как делать RL, когда нет одного правильного ответа?
- Что делать, если вероятность получить "правильный" ответ очень маленькая?
- Как сделать, чтобы модель могла ориентироваться в большом проекте?
- Как поддерживать long context?
- Как делать credit assignment для memory tool?
- Как cursor может обучаться на пользовательских данных.
- Почему плохо смотреть на лайки/дизлайки ответов.
- Какая инфра нужна для больших RL тренировок.
Судя по количеству просмотров, если сам этим не занимаешься, то смотреть не очень интересно. Но мне понравилось!
🔥12❤2😱1
How to factor 2048 bit RSA integers with less than a million noisy qubits https://arxiv.org/abs/2505.15917
arXiv.org
How to factor 2048 bit RSA integers with less than a million noisy qubits
Planning the transition to quantum-safe cryptosystems requires understanding the cost of quantum attacks on vulnerable cryptosystems. In Gidney+Ekerå 2019, I co-published an estimate stating...
Forwarded from Axis of Ordinary
This media is not supported in your browser
VIEW IN TELEGRAM
We made Claude, Gemini, o3 battle each other for world domination.
We taught them Diplomacy—the strategy game where winning requires alliances, negotiation, and betrayal.
Here's what happened:
DeepSeek turned warmongering tyrant. Claude couldn't lie—everyone exploited it ruthlessly. Gemini 2.5 Pro nearly conquered Europe with brilliant tactics. Then o3 orchestrated a secret coalition, backstabbed every ally, and won.
More: https://every.to/diplomacy
❤3👍2
Scene-Centric Unsupervised Panoptic Segmentation https://openaccess.thecvf.com/content/CVPR2025/html/Hahn_Scene-Centric_Unsupervised_Panoptic_Segmentation_CVPR_2025_paper.html
❤3
A 2D-CFT Factory: Critical Lattice Models from Competing Anyon Condensation Processes in SymTO/SymTFT https://arxiv.org/abs/2506.05324
arXiv.org
A 2D-CFT Factory: Critical Lattice Models from Competing Anyon...
In this paper, we introduce a ``CFT factory'' : a novel algorithm of methodically generating 2D lattice models that would flow to 2D conformal fixed points in the infrared. These 2D models are...
Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet) https://crfm.stanford.edu/2025/05/28/fast-kernels.html
👍3
Bulk Excitations of Invertible Phases https://arxiv.org/abs/2506.11288
arXiv.org
Bulk Excitations of Invertible Phases
Recent developments in the study of topological defects highlight the importance of understanding the multi-dimensional structure of bulk excitations inside a quantum system. When the bulk ground...
❤1
Just links
https://livecodebenchpro.com/
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? https://arxiv.org/abs/2506.11928
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions https://arxiv.org/abs/2506.09038
arXiv.org
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries,...
😁7👍3
LLM-First Search: Self-Guided Exploration of the Solution Space https://arxiv.org/abs/2506.05213
arXiv.org
LLM-First Search: Self-Guided Exploration of the Solution Space
Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While...
❤1
Is there a Half-Life for the Success Rates of AI Agents? https://www.tobyord.com/writing/half-life
Toby Ord
Is there a Half-Life for the Success Rates of AI Agents? — Toby Ord
Building on the recent empirical work of Kwa et al. (2025), I show that within their suite of research-engineering tasks the performance of AI agents on longer-duration tasks can be explained by an extremely simple mathematical model — a constant rate of…
👍2🔥1
Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models https://arxiv.org/abs/2506.11487
arXiv.org
Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models
Recent advancements, such as DeepSeek-Prover-V2-671B and Kimina-Prover-Preview-72B, demonstrate a prevailing trend in leveraging reinforcement learning (RL)-based large-scale training for...
🔥1
CyberGym Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale https://www.cybergym.io/
www.cybergym.io
CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale
CyberGym is a large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 historical vulnerabilities from 188 large software projects.