IBM at ICSE 2025

About

ICSE, the IEEE/ACM International Conference on Software Engineering, is the premier software engineering conference. IBM Research is excited to sponsor ICSE this year as a Platinum sponsor.  We invite all attendees to visit us during the event at our booth, from Wednesday April 30th to Friday May 2nd.

We look forward to meeting you and telling you more about our latest work and career opportunities at IBM Research. At our booth we’ll be demoing projects on a broad range of AI topics.

Presentation times of conference workshops, demos, papers, and tutorials can be found see the agenda section at the bottom of this page. Note: All times are displayed in your local time.

Congratulations to the IBM team winning the Distinguished Paper Award at ICSE SEIP for ASTER: Natural and Multi-language Unit Test Generation with LLMs.

Learn more about our work in AI for Code.

Read our accepted papers at ICSE 2025

Career opportunities

Visit us at the IBM Booth to meet with IBM researchers to speak about what its like to work at IBM and future job opportunities .

Explore all current IBM Research openings

Agenda

  • Abstract

    Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code- and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task as suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.

    Authors

  • Abstract

    One of the central tasks in software maintenance is being able to understand and develop code changes. Thus, given a natural language description of the desired new operation of a function, an (human or AI) agent might be asked to generate the set of edits to that function to implement the desired new operation; likewise, given a set of edits to a function, an agent might be asked to generate a changed description, of that function’s new workings. Thus, there is an incentive to train a neural model for change-related tasks. Motivated by this, we offer a new, “natural”, large dataset of coupled changes to code and documentation mined from actual high-quality GitHub projects, where each sample represents a single commit where the code and the associated docstring were changed together. We present the methodology for gathering the dataset, and some sample, challenging (but realistic) tasks where our dataset provides opportunities for both learning and evaluation. We find that current models (specifically Llama 3.1, 405B, Mixtral 8x22B) do find these maintenance-related tasks challenging.

    Authors

Upcoming events

View all events