
Xin Li Kai Shen Michael C. Huang | Lingkun Chu |
We conducted measurements in three distinct system environments: a rack-mounted server farm for a popular Internet service (Ask.com search engine), a set of office desktop computers (Univ. of Rochester),and a geographically distributed network testbed (PlanetLab). Our preliminary measurement on over 300 machines for varying multi-monthperiods finds 2 suspected soft errors. In particular, our result on the Internet serversindicates that, with high probability, the soft error rate is at least twoorders of magnitude lower than those reported previously.We provide discussions that attribute the low error rate to several factors in today's production system environments. As a contrast, our measurementunintentionally discovers permanent (or hard) memory faults on 9 out of 212 Ask.com machines, suggesting the relative commonness of hard memory faults.
Environmental noises can affect the operation of microelectronics to create softerrors. As opposed to a "hard" error, a soft error does not leave lastingeffects once it is corrected or the machine restarts. A primary noise mechanismin today's machines is particle strike. Particles hitting the silicon chipcreate electron-hole pairs which, through diffusion, can collect at circuitnodes and outweigh the charge stored and create a flip of logical state,resulting in an error. The soft error problem at sea-level was first discoveredby Intel in 1978 [9].
Understanding the memory softerror rate is an important part in assessing whole-system reliability. In the presence of inexplicable system failures, software developers and systemadministrators sometimes point to possible occurrences of soft errors withoutsolid evidence. As another motivating example, recent studies have investigatedthe influence of soft errors on software systems [10] andparallel applications [5], based on presumably knownsoft error rate and occurrence patterns. Understanding realisticerror occurrences would help quantify the results of such studies.
A number of soft error measurement studies have been performed in the past. Probably the most extensive test results published were from IBM [12,14-16].Particularly in a 1992 test, IBM reported 5950FIT (Failures In Time, specifically, errors in
hours) of error rate for a vendor 4Mbit DRAM.The most recently published results that we are aware of were based on tests in 2001 at Sony and Osaka University [8]. They tested 0.18
m and 0.25
m SRAM devices to study the influence of altitude, technology, and different sources of particles on the soft error rate, though the paper does not report any absolute error rate. To the bestof our knowledge, Normand's 1996 paper [11] reported the only field test on production systems. In one 4-month test, they found 4 errors out of 4 machines with total 8.8Gbit memory. In another 30-week test, they found 2 errors out of 1 machine with 1Gbit memory.Recently, Tezzaron [13] collected error rates reported by various sources and concluded that 1000-5000FIT per Mbit would be a reasonable error rate for modern memory devices. In summary, these studies all suggest soft error rates in the rangeof 200-5000FIT per Mbit.
Most of the earlier measurements (except [8]) were over adecade old and most of them (except [11]) were conducted inartificial computing environments where the target devices are dedicated for themeasurement. Given the scaling of technology and the countermeasures deployedat different levels of system design, the trends of error rate in real-worldsystems are not clear. Less obvious environmental factors may also play a role.For example, the way a machine is assembled and packaged as well as the memorychip layout on the main computer board can affect the chance of particlestrikes and consequently the error rate.
We believe it is desirable to measure memory soft errors in today'srepresentative production system environments. Measurement on production systemsposes significant challenges. The infrequent nature of soft errors demandslong-term monitoring. As such, our measurement must not introduce any noticeableperformance impact on the existing running applications. Additionally, toachieve wide deployment of such measurements, we need to consider the caseswhere we do not have administrative control on measured machines. In suchcases, we cannot perform any task requiring the privileged accesses and ourmeasurement tool can be run only at user level. The rest of this paper describesour measurement methodology, deployed measurements in production systems, ourpreliminary results and the result analysis.
We present two soft error measurement approaches targetingdifferent production system environments. The first approach, memory controller direct checking, requires administrative control on the machine and works only with ECC memory. The second approach, non-intrusiveuser-level monitoring, does not require administrative control and worksbest with non-ECC memory. For each approach, we describe its methodology,implementation, and analyze its performance impact on existing running applications in the system.
An ECC memory module contains extra circuitry storing redundant information. Typically it implements single error correction and double error detection (SEC-DED). When an error is encountered, the memory controller hub (a.k.a. Northbridge) records necessary error information in some special-purpose registers. Meanwhile, if the error involves a single bit, then it is corrected automatically by the controller. The memory controllertypically signals the BIOS firmware when an error is discovered. The BIOS error-recording policies vary significantly from machine to machine. In most cases, single-bit errors are ignored and never recorded. The BIOS typically clears the error information in memory controller registers on receiving error signals. Due to the BIOS error handling, the operating system would not be directly informed of memory errors without reconfiguring the memory controller.
Our memory controller direct checking of soft errors includes twocomponents:
Both hardware configuration and software probing in this approach require administrative privilege.The implementation involves modifications to the memory controller driver inside the OS kernel. The functionality of our implementationis similar to the Bluesmoke tool [3] for Linux. The maindifference concerns exposing additional error information for our monitoringpurpose.
In this approach, the potential performance impact on existing runningapplications includes the software overhead of controller register probing and memory bandwidth consumption due to scrubbing. With low frequency memory scrubbing and software probing, this measurement approach has a negligible impact on running applications.
Our second approach employs a user-level tool that transparently recruits memory on the target machine and periodically checks for any unexpected bit flips. Since our monitoring program competes for the memory with runningapplications, the primary issue in this approach is to determine an appropriate amount of memory for monitoring. Recruiting more memory makes the monitoring more effective. However, we must leave enough memory so that the performance impact on other running applications is limited. This is important since we target production systems hosting real live applications and our monitoring must be long running to be effective.
This approach does not require administrative control on the target machine. At the same time, it works best with non-ECC memory since the common SEC-DED feature in ECC memory would automatically correct single-bit errors and consequently our user-level tool cannot observe them.
Earlier studies like Acharya and Setia [1] and Cipar et al. [4] have proposed techniques to transparently steal idle memory from non-dedicated computing facilities.Unlike many of the earlier studies, we do not have administrative control of the target system and our tool must function completely at user level.The system statistics that we can use are limited to those explicitly exposed by the OS (e.g., the Linux file system) and those that can be measured by user-level micro-benchmarks [2].
). Under this policy, within any time period of duration
, every physical memory page can be recruited for no more than once (otherwise the second recruitment would have recruited a page that was recently allocated and used). Therefore, within any time period of duration
, the additional application page evictions induced by our monitoring is bounded by the physical memory size
. If we know the page fault I/O throughput
, we can then bound the application slowdown induced by our monitoring to
.Below we present a detailed design to our approach, which includes three components.
. Under the LRU replacement order, application pages that are accessed more often are unlikely to be evicted before the recruited pages in our monitoring tool.The touching also serves the purpose of error checking. Weread every single word of the page and examines if the pattern written initially still remains. If not, it indicates an error just occurred in the most recent period.
).We should detect these evicted pages and release them from the monitoring pool.We discuss some implementation issues in practice. First, the OS typicallyattempts to maintain a certain minimum amount of free memory (e.g., to avoiddeadlocks when reclaiming pages) and a reclamation is triggered when the freememory amount falls below the threshold (we call). We canmeasure of a particular system by running a simple user-level micro-benchmark. At the memory recruitment, we are aware that the practical free memory in the system is the nominal free amount subtract.
Second, it may not be straightforward to detect evicted pages from the monitoring pool. Some systems provide direct interface to check the in-corestatus of memory pages (e.g., the system call). Without such direct interface, we can tell the in-core status of a memory page by simply measuring the time of accessing anydata on the page. Note that due to OS prefetching, the access to a single pagemight result in the swap-in of multiple contiguous out-of-core pages. To addressthis, each time we detect an out-of-core recruited page, we discard severaladjacent pages (up to the OS prefetching limit) along with it.
We measure the performance impact of our user-level monitoring tool on existingrunning applications. Our tests are done in a machine with a 2.8GHzPentium4 processor and1GB main memory. The machine runs Linux 2.6.18 kernel. We examine three applications in our test: 1) the Apache web server running the static requestportion of the SPECweb99 benchmark;2) MCF from SPEC CPU2000 -- a memory-intensive vehicle scheduling program for mass transportation; and 3) compilation and linking of the Linux 2.6.18 kernel. The first is a typical server workload while the other two are representative workstation workloads.
We set the periodic memory touching interval
according to a desired application slowdown bound. An accurate setting requires the knowledge of the page fault I/O throughput. Here weuse a simple estimation of I/O throughput as half the peak sequential disk accessthroughput (around 57MB/s for our disk). Therefore a periodic memory touching interval
=30.48minutes is needed for achieving 2% application slowdown bound.Our program was able to recruit 376.70MB, 619.24MB and 722.77MB on average(out of the total 1GB)when it runs with Apache, MCF, and Linux compilation respectively. At the same time, the slowdown is 0.52%, 0.26%, and 1.20% for the three applications respectively. The slowdown for Apache is calculated as "
" while theslowdown for the other two applications is calculated as"
". The monitoring-induced slowdown can be reduced by increasing the periodic memory touching interval
. Such adjustment may at the same time reduce the amount of recruited memory.
To validate the effectiveness of our measurement approaches and resulted implementation, we carried out a set of accelerated tests with guaranteed erroroccurrences. To generate soft errors, we heated the memory chip using a heatgun. The machine under test contains 1GB DDR2 memory with ECC andthe memory controller is Intel E7525 with ECC.The ECC feature has a large effect on our two approaches -- the controllerdirect checking requires ECC memory while the user-level monitoring worksbest with non-ECC memory. To consider both scenarios, we provide resultsfor two tests -- one with the ECC hardware enabled and the other with ECCdisabled. The results on error discovery are shown in Table 1.
Overall, results suggest that both controller direct probing and user-levelmonitoring can discover soft errors at respective targeted environments. With ECC enabled, all the single-bit errors are automatically corrected by the ECC hardware and thus the user-level monitoring cannot observe them. We also noticed that when ECC was enabled, the user-level monitoring found less multi-bit errors than the controller direct checking did. This is because the user-level approach was only able to monitor part of the physical memory space (approximately 830MB out of 1GB).
We have deployed our measurement in three distinct production system environments: a rack-mounted server farm, a set of office desktopcomputers, and a geographically distributed network testbed.We believe these measurement targets represent many of today's production computer system environments.
Aside from the respective pre-deployment test periods, we received no complaint on application slowdown for all three measurement environments.
So far we detected no errors on UR desktop computers and PlanetLab machines.At the same time, our measurements on Ask.com servers logged 8288 memory errorsconcentrating on 11 (out of 212) servers. These errors on the Ask.com serverswarrant more explanations:
In Table 2, we list the overalltime-memory extent (defined as the product of time and the average amount of considered memory over time) and discovered errors for all deployed measurement environments.
, the probability that
errors happen is:
is the average error rate (i.e., the error occurs
times on average for every unit of time-memory extent). Andparticularly the probability for no error occurrence (
= 0) during a measurement over time-memory extent
is:For a given
and the number of error occurrences
, let us call
a
-probability upper-bound of the average error occurrence rate if:
![$\displaystyle \forall \lambda > \Lambda: Pr_{\lambda, T}[N = k]< 1-p$](/image.pl?url=https%3a%2f%2fweb.archive.org%2fweb%2f20170214005146im_%2fhttp%3a%2f%2fwww.ece.rochester.edu%2f%7exinli%2fusenix07%2fimg18.png&f=jpg&w=240)
-probability upper-bound, then the chance for
error occurrence during a measurement of time-memory extent
is always less than
.We apply the above analysis and metric definition on the error measurementresults of our deployed measurements. We first look at UR desktop measurementin which no error is reported. According to Equation (2), we know that
is a
-probability upper-bound of the average error occurrence rate. Consequently,since
428 GB
day for the UR desktop measurement, we can calculate that 54.73FIT per Mbit is a 99%-probability upper-bound of the average error occurrence rate for this environment.
We then examine the Ask.com environment excluding 9 servers with hard errors.In this environment, 2 (or fewer) soft errors over
73,571 GB
day yields a 99%-probability upper-bound of the average error occurrence rate at 0.56FIT per Mbit. This is much lower than previously-reported error rate (200-5000FIT per Mbit) that we summarized in Section 1.
Our preliminary result suggests that the memory soft error rate in two realproduction systems (a rack-mounted server environment and a desktop PC environment) is much lower than what the previous studies concluded. Particularly in the server environment, with high probability, the soft error rate is at least two orders of magnitude lower than those reported previously. We discuss several potential causes for this result.
An understanding on the memory soft error rate demystifies an important part of whole-system reliability in today's production computersystems. It also provides the basis for evaluating whether software-level countermeasures against memory soft errors are urgently needed.Our results arestill preliminary and our measurements are ongoing. We hope to be able to draw more complete conclusions from future measurement results.Additionally, soft errors can occur on componentsother than memory, which may affect system reliability in different ways. In the future, we also plan to devise methodologies to measure soft errors in other computer system components such as CPU register, SRAM cache, and system bus.
| [1] | A. Acharya and S. Setia. Availability and utility of idle memory in workstation clusters. InSIGMETRICS, pages 35-46, 1999. | |
| [2] | A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau. Informationand control in gray-box systems. InSOSP, pages 43-56, 2001. | |
| [3] | EDAC Project. http://bluesmoke.sourceforge.net | |
| [4] | J. Cipar, M. D. Corner, and E. D. Berger. Transparent contribution of memory. InUSENIX, 2006. | |
| [5] | C. da Lu and D. A. Reed. Assessing fault sensitivity in MPI applications. InSupercomputing, 2004. | |
| [6] | J. Douceur and W. Bolosky. Progress-based Regulation ofLow-importance Processes. InSOSP, pages 247-260, Kiawah Island, SC, Dec. 1999. | |
| [7] | A. H. Johnston. Scaling and technology issues for softerror rates. In4th Annual Research Conf. on Reliability, 2000. | |
| [8] | H. Kobayashi, K. Shiraishi, H. Tsuchiya, H. Usuki, Y. Nagai, andK. Takahisa. Evaluation of LSI soft errors induced by terrestrial cosmicrays and alpha particles. Technical report, Sony Corporation and RCNP Osaka University, 2001. | |
| [9] | T. C. May and M. H. Woods. Alpha-particle-induced softerrors in dynamic memories. IEEE Trans. on Electron Devices,26(1):2-9, 1979. | |
| [10] | A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. J.F. Lie, D. Mannaru, A. Riska, and D. S. Milojicic. Susceptibility ofcommodity systems and software to memory soft errors. IEEE Trans. onComputers, 53(12):1557-1568, 2004. | |
| [11] | E. Normand. Single event upset at ground level. IEEE Trans. on Nuclear Science, 43(6):2742-2750, 1996. | |
| [12] | T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P.Muhlfeld, C. J. Montrose, M. W. Curtis, and J. L. Walsh. Field testing forcosmic ray soft errors in semiconductor memories. IBM J. of Researchand Development, 40(1):41-50, 1996. | |
| [13] | Tezzaron Semiconductor. Soft errors in electronic memory. White paper, 2004.http://www.tezzaron.com/about/papers/papers.html | |
| [14] | J. F. Ziegler. Terrestrial cosmic rays. IBM J.of Research and Development, 40(1):19-39, 1996. | |
| [15] | J. F. Ziegler et al. IBM experiments in soft fails incomputer electronics (1978-1994). IBM J. of Research andDevelopment, 40(1):3-18, 1996. | |
| [16] | J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T.J. O'Gorman, and J. M. Ross. Accelerated testing for cosmic soft-errorrate. IBM J. of Research and Development, 40(1):51-72, 1996. | |
| [17] | J. F. Ziegler, M. E. Nelson, J. D. Shell, R. J. Peterson, C. J.Gelderloos, H. P. Muhlfeld, and C. J. Montrose. Cosmic ray soft error rates of16-Mb DRAM memory chips. IEEE J. of Solid-State Circuits, 33(2),1998. |