Short paper

Measuring What Matters: An Aggregate Metric for Assessing Enterprise Code Summaries

Abstract

Evaluating the quality of code summaries is essential for enterprise software, where the complexity and scale of codebases introduce unique challenges that are inadequately addressed by existing public code datasets and evaluation methods. These methods, typically designed for small and straightforward code snippets, often over-look critical issues such as repetitiveness, verbosity, and incompleteness—issues that are particularly prominent in enterprise-level code summaries. While correctness has been extensively studied, other dimensions critical to enterprise contexts, such as distinctiveness and completeness, remain underexplored. To address these gaps, we propose a novel evaluation framework that emphasizes aggregated metrics tailored to enterprise needs, prioritizing both distinctiveness and completeness. This framework introduces metrics designed to penalize verbosity and redundancy while rewarding informativeness and alignment with the underlying code. Initial experiments conducted on human-annotated enterprise Java datasets demonstrate the effectiveness of our approach by improving the RMSE values by 7.4% over the baselines. Correlation studies of our distinctiveness and completeness metrics with human ratings also shows improvement of 32% and 5.3% respectively over the baselines.

Related