Rui Zhang, Alan J. Bivens
HPDC 2007
In distributed, service-oriented environments, performance problem localization is required to provide self-healing capabilities and deliver the desired quality of service (QoS). This paper presents an automated approach to identifying system elements causing performance problems. Applying probabilistic inference to collected response time and elapsed time data, the approach 1) infers elapsed time for services where data is missing, 2) estimates the response time degradation caused by different services using the duration, abnormality and response time correlation of their elapsed times, and 3) identifies the services that are the most important causes of slow response time and yield the most benefit if recovered. The approach has been used to localize a performance problem on the test bed of a real-world service-oriented Grid. Evaluation using simulations shows that the approach consistently achieves better accuracy than traditional techniques in various service-oriented settings. Copyright 2007 ACM.
Rui Zhang, Alan J. Bivens
HPDC 2007
Rui Zhang, Bruno C. D. S. Oliveira, et al.
INFOSCALE 2007
Zhang Rui, Steve Moyle, et al.
CCGrid 2005
Pradeep Varma
SAC 2007