Causally Reliable Concept Bottleneck ModelsGiovanni De FeliceArianna Casanova Floreset al.2025NeurIPS 2025
Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo MethodsIsha PuriShivchander Sudalairajet al.2025NeurIPS 2025
BenchmarkCards: Standardized Documentation for Large Language Model BenchmarksAnna SokolElizabeth Dalyet al.2025NeurIPS 2025
Musings on AI Muses: Support for Human CreativityJohn RichardsJacquelyn Martinoet al.2025NeurIPS 2025
MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram GenerationBasel ShbitaFarhan Ahmedet al.2025NeurIPS 2025
Verifiable Chemical Reasoning through Tool-Calling Agentic WorkflowGabrielle GaudeauShinnosuke Tanakaet al.2025NeurIPS 2025
Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous EvaluationJung koo Kang2025NeurIPS 2025
Uncertainty-Aware Prediction of Climate Extremes Using Fine-Tuned Time-Series Foundation ModelsImran NasimJoao Lucas de Sousa Almeida2025NeurIPS 2025
SafeCOMM: Investigating Safety Degradation in Fine-Tuned Telecom Large Language ModelsAladin DjuheraSwanand Ravindra Kadheet al.2025NeurIPS 2025