Smp machine check error

smp machine check error

I am getting the following error message while trying to deploy SMP Agent "Username or I see a command window in the physical machine during HP ProLiant. This is the story: We bought 8 machines with one AMD EPYC 7282 CPU each (including Jul 13 05:05:17 granat5 kernel: [Hardware Error]: Check Information. Machine check in kernel mode. Caused by (from MCSR=20000000): Data Cache Push Parity Error. Oops: Machine check, sig: 7 [#2]. SMP.

Smp machine check error - remarkable, very

A new customSetting has been created (starting with 8.0 HF6 and later) to enforce the connection string to resolve the real machine:
key="AgentPushPreferFqdn".
Default value = 0

The core setting is used to force FQDN instead of netbios for access.

Just set it to be "1" (AgentPushPreferFqdn) in the CoreSettings.config  (under c:\programdata\symantec\smp\settings) on your SMP.

 

Note:
Please understand that we rely on having the RPC server is available and properly configured in your environment. In some situations you may need to troubleshoot this "The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)" with your network team.

Even after enabling the mentioned coresetting above, the agent push may still fail the same way.
In the NS logs you may see this warning:

(ClientMachine316813.domain.com) Intermediate discovery failures: WMI, NetAPI, Registry
(NetAPI) Failed to retrieve name/domain: API error with HRESULT: 0x00000035
(WMI) Failed to retrieve name/domain: The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
(Registry) Failed to retrieve name/domain: The network path was not found.
-----------------------------------------------------------------------------------------------------
Date: 4/9/2019 10:42:36 AM, Tick Count: 37212234 (10:20:12.2340000), Size: 577 B
Process: AeXSvc (5068), Thread ID: 69, Module: Altiris.NS.dll
Priority: 2, Source: DiscoverMachines.All

In order to validate that you have a network/configuration issue, you could try for example to connect to 'Services' (using Connect to another computer ... option under the Services Console) from your SMP server to one of the affected machine. If you get "Error 1722: The RPC server is unavailable" on the same machines that we are failing to push, then you need to troubleshoot the RPC service on those client machines.

Some suggestions around this issue are the following:

Make sure that the following services are running on the Target Machines:

  • Remote Procedure Call (RPC)
  • Computer Browser
  • Server
  • Remote Registry
  • Windows Management Instrumentation
  • Netlogon
  • Remote Desktop Services
  • Windows Remote Management (WS-Management)

Also, you could look at these pages:

http://support-uk.avanquest.com/en/support/solutions/articles/17000070208-error-1722-the-rpc-server-is-unavailable-
https://www.techjunkie.com/rpc-server-is-unavailable/

 

Additional monitoring - ECC errors? #1508

I did some quick googling to see if it was possible to monitor ECC errors as this seems like a no brainer benefit to netdata. I haven't found any documentation/pull request adding this and I believe it would be very valuable to sysadmins who monitor bare metal.

The only results I found were for the actual kernel module, EDAC:

http://bluesmoke.sourceforge.net/

It seems this was put in upstream back in kernel 2.6. Is this still a thing? If so, how can netdata properly monitor this while maintaining it's low-memory and resource footprint?

Here are entries in the syslog that show EDAC finding and correcting errors(This actually crashes the system for some reason, however, it is detectable in some way):

More examples of errors on a system of mine:

I'm sure there are a lot of sysadmins out there that have to look after old systems. This would greatly benefit us as well on any future system if this kernel module is still supported on newer hardware containing ECC.

I wouldn't mind doing more research on this to see if there is anything in memory I could find that netdata can quickly query. I am at work right now and about to head home right now.

Quick thought: Maybe a plugin would suffice for now to parse the output of dmesg and just report on failures?

I am starting to seriously learn python and would not mind writing the plugin but it may take me a lot of trial and error.

Will report back if I find anything.

sudo ras-mc-ctl--errors

No Memory errors.

 

No PCIe AER errors.

 

No Extlog errors.

 

MCE events:

12019-07-1520:41:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x942000460082110a,addr=0x243e9f840,tsc=0x8b99a7f84108,walltime=0x5d2c8276,cpuid=0x000706a1,bank=0x00000001

22019-07-1601:34:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x942000460082110a,addr=0x24b9df840,tsc=0xa38afb430944,walltime=0x5d2cc722,cpuid=0x000706a1,bank=0x00000001

32019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d95741ee28,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

42019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d957436320,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

52019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d957451d82,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

62019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d957456482,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

72019-07-1603:20:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000400082110a,tsc=0xac3468f91976,walltime=0x5d2cdffa,cpuid=0x000706a1,bank=0x00000001

82019-07-1603:20:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000400082110a,tsc=0xac3468fb7a3a,walltime=0x5d2cdffa,cpuid=0x000706a1,bank=0x00000001

92019-07-1615:08:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000460082110a,tsc=0xe60f3181c782,walltime=0x5d2d85ea,cpuid=0x000706a1,bank=0x00000001

102019-07-1615:08:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000460082110a,tsc=0xe60f31852002,walltime=0x5d2d85ea,cpuid=0x000706a1,bank=0x00000001

112019-07-1702:52:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x942000460082110a,addr=0x249c5f840,tsc=0x11f964ae442b2,walltime=0x5d2e2aea,cpuid=0x000706a1,bank=0x00000001

122019-07-1715:24:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000440082110a,tsc=0x15d0984e5de54,walltime=0x5d2edb2a,cpuid=0x000706a1,bank=0x00000001

Machine Check Error occurs when installing Ubuntu (include image of log)

Here is the image for the logs: Image here. The installation process hung at this point. The message was not displayed.

The MCEs (at bottom of image) occurred soon after I selected "install Ubuntu" from the menu. I don't have any idea what , , , or mean. Can someone explain them ? And, based on your experience or expertise, what may be the problem that triggered these messages? RAM, CPU, PSU or something else?

Also, the log mentions . Where can I run any command like this in this situation?

Here are some spec for my setup:

  • USB stick for Ubuntu 16.04, created with UNetBootin;
  • Processor: Xeon E5-1650 v3;
  • Motherboard: ASRock X99 WS-E;
  • Power supply: EVGA SUPERNOVA 1600 G2 120-G2-1600-X1;
  • RAM: 16GB 288-Pin SDRAM DDR4 2400 ECC Registered;
  • GPU: EVGA GTX 680;

If any more information is helpful, please let me know. I really appreciate your help!

Edit: Just to be clear, my computer does not have any OS installed yet. I am building it from scratch. I encountered this problem when I was trying to install Ubuntu. Later, I made a Windows USB stick, but it didn't work either. After the Windows logo was displayed for 5 seconds, the screen went black and nothing happened.

US6912670B2 - Processor internal error handling in an SMP server - Google Patents

Processor internal error handling in an SMP server Download PDF

Info

Publication number
US6912670B2
US6912670B2US10/054,017US5401702AUS6912670B2US 6912670 B2US6912670 B2US 6912670B2US 5401702 AUS5401702 AUS 5401702AUS 6912670 B2US6912670 B2US 6912670B2
Authority
US
United States
Prior art keywords
error
logic
internal
processor
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US10/054,017
Other versions
US20030140285A1 (en
Inventor
Bruce James Wilkie
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo International Ltd
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines CorpfiledCriticalInternational Business Machines Corp
Priority to US10/054,017priorityCriticalpatent/US6912670B2/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATIONreassignmentINTERNATIONAL BUSINESS MACHINES CORPORATIONASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: WILKIE, BRUCE J.
Publication of US20030140285A1publicationCriticalpatent/US20030140285A1/en
Application grantedgrantedCritical
Publication of US6912670B2publicationCriticalpatent/US6912670B2/en
Assigned to LENOVO INTERNATIONAL LIMITEDreassignmentLENOVO INTERNATIONAL LIMITEDASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: INTERNATIONAL BUSINESS MACHINES CORPORATION
Activelegal-statusCriticalCurrent
Adjusted expirationlegal-statusCritical

Links

  • 238000001514detection methodMethods0.000claimsabstractdescription13
  • 230000000875correspondingEffects0.000claimsdescription10
  • 230000004044responseEffects0.000abstractdescription11
  • 238000000034methodMethods0.000abstractdescription2
  • 238000010586diagramMethods0.000description6
  • 230000001186cumulativeEffects0.000description4
  • 230000002093peripheralEffects0.000description4
  • 230000004048modificationEffects0.000description2
  • 238000006011modification reactionMethods0.000description2
  • 230000006399behaviorEffects0.000description1
  • 238000001816coolingMethods0.000description1
  • 230000000977initiatoryEffects0.000description1
  • 230000000644propagatedEffects0.000description1
  • 238000009877renderingMethods0.000description1

Images

Classifications

    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00—Error detection; Error correction; Monitoring
    • G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766—Error or fault reporting or storing
    • G06F11/0772—Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00—Error detection; Error correction; Monitoring
    • G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0721—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
    • G06F11/0724—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU] in a multiprocessor or a multi-core unit
    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00—Error detection; Error correction; Monitoring
    • G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793—Remedial or corrective actions

Abstract

Description

1. Field of the Present Invention

The present invention generally relates to the field of microprocessor-based data processing systems and more particularly to a system and method for efficient handling of processor internal errors in a symmetric multiprocessor server system.

2. History of Related Art

Interrupt handling is well known in the field of microprocessors and microprocessor-based data processing devices. Traditionally, the handling of processor internal errors (IERRs) in a symmetric multiprocessor (SMP) system has been the responsibility of a System management interrupt (SMI) handler. The SMI typically performs the tasks of logging the error condition and setting the appropriate controls to remove the faulty processor from the available resources.

Unfortunately, delegating processor internal error handling to the SMI is problematic. More specifically, the SMI is not immediately available when a server is powered-on. The SMI is usually installed as part of the power on self test (POST). If an internal error occurs before the SMI is installed and functioning, status cannot be reported and the system will probably halt. In addition, relying on the SMI to handler IERRs assumes that at least one of the processor is sufficiently operable to execute the SMI. If this assumption is not met, system behavior is unpredictable and the system will more than likely abort operation with little information to indicate the reason for the failure. Moreover, while it might be tempting to use the service processor found on many server blades to respond to the error and execute the SMI, the response latency of conventional service processors relative to high end SMP servers is too great to ensure that erroneous data is not propagated thereby possibly contaminating stored data records.

It would therefore be highly desirable to implement a data processing system in which processor internal errors are handled expeditiously. It would be further desirable if the implemented solution did not rely on the main processors to handle processor internal errors. It would be still further desirable if the response performance of the implemented solution was compatible with the requirements of high end multiprocessor systems.

The problem identified above is in large part addressed by a system and method for handling processor internal errors in a data processing system. The data processing system typically includes a set of main microprocessors that have access to a common system memory via a system bus. The system may further include a service processor that is connected to at least one of the main processors. In addition, the system includes internal error handling hardware configured to log and process internal errors generated by one or more of the main processors. The internal error hardware may include error detection logic configured to receive internal error signals from the main processors. In response to receiving one or more IERR signals, the error detection logic is configured to assert and error detected signal that is received by error logging logic. The error logging logic is configured to update one or more error status register entries when the error detected signal is asserted. When the error logging logic has updated the status register entries, it is configured to assert an error logging complete signal that is received by processor control logic and by any external service processor, for purposes of maintaining system error logs. The processor control logic is configured to de-assert one or more processor enable signals based on the state of the error status registers. In addition, upon completion of the error status update by the error logging logic, the status register is configured to assert an error status updated signal that ultimately produces a system reset. By incorporating error logging and handling into dedicated hardware tied directly to the processor internal error signals, the invention provides a low cost, low response latency mechanism for handling processor internal errors in high performance multiprocessor systems.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description presented herein are not intended to limit the invention to the particular embodiment disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

Generally speaking, the present invention contemplates a system and method for handling processor internal errors in a multiprocessor system such as a high end SMP server system. The system incorporates internal error control logic to detect, log, and respond to processor internal errors generated by one or more of the system's multiple main processors. The control logic is configured to determine which processor(s) have issued an internal error, update an error status register to log the error, notify any external service processor of the error, and restart the system with any remaining functional processor(s).

Turning now to the drawings, is a block diagram of selected features of a data processing system according to one embodiment of the present invention. In the depicted embodiment, system 100 includes a set of main processors 102A through 102N (generically or collectively referred to as processor(s) 102) that are connected to a system bus 104. A common system memory 106 is accessible to each processor 102 via system bus 104. System memory 106 is typically implemented with a volatile storage medium such as an array of dynamic random access memory (DRAM) devices. Because each processor 102 has substantially equal access to system memory 106 (i.e., the memory access time is substantially independent of the processor), the depicted architecture of system 100 is commonly referred to as a symmetric multiprocessor system.

In the depicted embodiment of system 100, a bus bridge 108 provides an interface between system bus 104 and an I/O bus 110 to which one or more peripheral devices 114A through 114N (generically or collectively referred to as peripheral device(s) 114) are connected. I/O bus 110 is typically compliant with one of several industry standard I/O bus specifications including, as an example, the Peripheral Components Interface (PCI) bus as specified in PCI Local Bus Specification Rev 2.2 by the PCI Special Interest Group (www.pcisig.com). Peripheral devices 114 may include devices such as a graphics adapter, high-speed network adapter, hard-disk controller, and the like.

The depicted embodiment of system 100 further includes a general purpose I/O (GPIO) port 112 connected to I/O bus 110 and to which a service processor 116 is connected. Service processor 116 is used to provide support for low-level system functions such as power monitoring, cooling fan control, hardware error logging, and so forth.

System 100 according to the present invention further includes error logic 120. Error logic 120 is connected to the set of main processors 102 to provide a fast response to a processor internal error. The use of dedicated hardware to respond to processor internal errors beneficially eliminates dependence on error handling software that may or may not be available at the time a processor internal error is issued.

Referring now to , a block diagram of selected elements of error logic 120 according to one embodiment of the present invention is depicted. Error logic 120 comprises dedicated hardware that is integrated into the main system control logic and connected directly to system power. As such, error logic 120 is functional as soon power is applied to system 100 in contrast to system management interrupt (SMI) software modules, which are installed as part of the POST. In the depicted embodiment, error logic 120 includes an error detection unit 122 that is configured to receive processor internal error signals from each of the main processors 102. Processor internal error signals are generally asserted when a processor detects an error unrelated to processor bus operation. If, for example, a processor with an internal cache memory detects a parity error in the cache, the error may result in the assertion of the internal error signal. The internal error signal may be referred to herein as the IERR signal consistent with the notation commonly in use for the Pentium® family of processors from Intel Corporation.

Error detection unit 122 is further configured to assert an error detect signal 124 upon determining that one or more of the processor IERR signals has been asserted by its corresponding processor. Error detection unit 122 may include suitable latching circuitry to prevent an asserted IERR signal from being reset prematurely and additional logic to produce a pulse on error detect signal 124 in response to an IERR signal such that error detect 124 is pulsed once and only once for each internal error “event” where an event lasts from the assertion of any IERR signal until a system reset is initiated.

Error detect signal 124 provides an input to error logging unit 124. Error logging unit 124 is configured to document an internal error by capturing the identity of the offending processor. Because the error detection logic is not resident on the processor bus, it does not have visibility to the internal registers of the processors. In most cases when a processor asserts IERR, the processor has experienced an internal fatal error rendering most of its information unusable.

Error logging unit 126 is configured to record and preserve IERR information in an Error Status Register 128. Error status register 128 is configured to store internal error status for each processor 102 of system 100. Referring to , a selected portion of one embodiment of error status register 128 is depicted. In this embodiment, error status register 128 includes a set of bit pairs 140A through 140N (generically of collectively referred to as bit pair(s) 140) for each processor 102. A first bit 142 of each bit pair 140 is a “current” bit that indicates whether the corresponding processor 102 is currently asserting its internal error signal while a second bit 144 of each bit pair 140 is a “cumulative” bit that indicates whether the corresponding processor has previously asserted its internal error signal. Whereas the current bits 142 are cleared each time a system reset occurs, the cumulative bits 144 are preserved. Thus, the set of cumulative bits 144 indicate the cumulative set of processors 102 that have internal error problems.

In the depicted embodiment, error status register 128 is accessible to the other components in system 100 through a system interface, such as an Industry Standard Architecture (ISA) bus, identified by reference numeral 130. System interface 130 may include sufficient data, address, and control signals to permit processors 102 to read the contents of status register 128. In addition, error status register 128 may include a one or more bits set in response to an internal error event that provide an interrupt signal to service processor (SP) 116 such that service processor 116 is interrupted in response to a main processor internal error event. In response to an interrupt from error logging unit 128, service processor 116 may be programmed to take specific actions with respect to system power such as powering down and so forth. In addition, service processor 116 may be programmed to log or record additional information regarding the internal error. This additional information may include, for example, the time at which an internal error signal was asserted.

The depicted embodiment of error status register 128 further includes an I2C interface for connecting to an I2C bus thereby enabling communication between error status register 128 and an external device in the event that it becomes desirable to access the contents of register 128 externally.

Error logging unit 126, in addition to providing logged information to error status register 128, is configured to generate an error log complete signal 129 when the logging unit has completed its documentation of an internal error event. Error log complete signal 129 is provided to a system reset unit 132 and a processor control unit 134. System reset logic 132 is configured to generate a system reset that is provided to each processor 102 following an internal error event. System reset logic 132 may be further controlled by an error status updated signal 131 produced by error status register 128 indicating completion of a status register update following an internal error event. Processor control logic 134 is configured to generate a unique processor enable signal for each processor 102 in system 100 following an internal error event. The processor enable signals are de-asserted if the corresponding processor was responsible for the internal error event and the cause of the internal error could not be corrected. The combination of system reset unit 132 and processor control logic 134 provides means for initiating a system reset and enabling only those processors 102 that are functional following an internal error.

Turning now to , a flow diagram representing selected elements of a method 150 of responding to internal errors signals in a data processing system according to one embodiment of the invention is presented. Initially, the data processing system is executing (block 151) in a normal operating mode. For purposes of this disclosure, the normal operating mode represents any state following the application of power to the system in which the internal errors signals are not asserted. Accordingly, normal operating mode does not imply that an operating system has been installed and application programs are executing or capable of being executed. Instead, the normal operating mode could be achieved substantially immediately following the application of power to the system if none of the IERR signals is asserted.

The data processing system and, more particularly, the internal error logic of the system, monitors (block 152) for the assertion of an IERR signal by one or more of the main processors. As long as the main processors do not issue any internal error signals, the system remains in its normal operating mode. During this time, an operating system may be installed and one or more applications programs may be executing. If an internal error is detected, the error logic logs (block 154) the error and updates (block 156) the error status register as described in greater detail above. After updating the status register, the system disables (block 158) any nonfunctional main processors. The disabled processors would typically include any processors currently asserting their internal error signals as well as any processors that asserted their error signals previously. After disabling the appropriate main processors, the system determines (block 160) whether any functional processors remain in the system. If all processors are currently or have previously asserted their internal error signals, the error logic generates a system halt (block 164). If there are one or more functional processors remaining, the error logic initiates a reset (block 162) to restart the system with the functional processors. In this manner, the data processing system is able to respond to internal errors without relying on any error handling software or operating system code.

It will be apparent to those skilled in the art having the benefit of this disclosure that the present invention contemplates a system and method for responding to processor internal errors in a data processing system. It is understood that the form of the invention shown and described in the detailed description and the drawings are to be taken merely as presently preferred examples. It is intended that the following claims be interpreted broadly to embrace all the variations of the preferred embodiments disclosed.

Claims (14)

1. A data processing system, comprising:

multiple main processors connected to a system bus;

a system memory connected to the system bus and accessible to the main processors;

error logic, external to the main processors, and configured to receive internal error signals asserted by the main processors and to respond to an internal error signal by disabling a main processor asserting an internal error signal and restarting the system with any remaining functional main processors, wherein the error logic includes an error status register accessible via an I2C bus; and

a service processor configured to receive a service processor interrupt generated by the error logic.

2. The system of , wherein the error logic is further configured to record the internal error signal in an error status register of the error logic.

3. The system of , wherein the error status register includes at least a pair of bits corresponding to each of the main processors, wherein a first bit of each pair is indicative of whether the corresponding main processor is currently asserting its internal error signal and a second bit of each pair is indicative of whether the corresponding main processor has asserted its internal error signal previously.

4. The system of , wherein the error logic is functional substantially immediately following the application of power to the data processing system.

5. The system of , wherein the error logic includes an error detection unit configured to receive an internal error signal from each of the main processors and further configured to generate an error detect signal responsive to assertion of an internal error signal by any of the processors.

6. The system of , wherein the error logic further includes error logging logic configured to receive the error detect signal and, responsive thereto, to update an error statue register to reflect the internal error signal.

7. The system of , wherein responsive to the service processor interrupt, the service processor is configured to power down the system.

8. Error detection logic suitable for use in a data processing system having multiple main processors, wherein the error detection logic is external to the main processors and is configured to receive internal error signals asserted by the main processors and further configured to respond to an internal error signal by disabling a processor asserting signal, generating a service processor interrupt, and restarting the system with any remaining functional processors and further wherein the error detection logic includes an error status register externally accessible via an I2C bus.

9. The error logic of , wherein the error logic is further configured to record the internal error signal in the error status register of the error logic.

10. The error logic of , wherein the error status register includes at least a pair of bits corresponding to each of the main processors, wherein a first bit of each pair as indicative of whether the corresponding main processor is currently asserting its internal error signal and a second bit of each pair is indicative of whether the corresponding main processor has asserted its internal error signal previously.

11. The error logic of , wherein the error logic is functional substantially immediately following the application of power to the data processing system.

12. The error logic of , wherein the error logic includes an error detection unit configured to receive an internal error signal from each of the main processors and further configured to generate an error detect signal responsive to assertion of an internal error signal by any of the processors.

13. The error logic of , wherein the error logic further includes error logging logic configured to receive the error detect signal and, responsive thereto, to update the error status register to reflect the internal error signal.

14. The error logic of , wherein the error logic is further configured to generate the service processor interrupt responsive to error status register update.

US10/054,0172002-01-222002-01-22Processor internal error handling in an SMP server Active2023-06-23US6912670B2 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US10/054,017US6912670B2 (en) 2002-01-222002-01-22Processor internal error handling in an SMP server

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
US10/054,017US6912670B2 (en) 2002-01-222002-01-22Processor internal error handling in an SMP server

Publications (2)

ID=21988209

Family Applications (1)

Application NumberTitlePriority DateFiling Date
US10/054,017Active2023-06-23US6912670B2 (en) 2002-01-222002-01-22Processor internal error handling in an SMP server

Country Status (1)

Cited By (5)

Publication numberPriority datePublication dateAssigneeTitle
US20070150713A1 (en) *2005-12-222007-06-28International Business Machines CorporationMethods and arrangements to dynamically modify the number of active processors in a multi-node system
US20080005538A1 (en) *2006-06-302008-01-03Apparao Padmashree KDynamic configuration of processor core banks
US20090013221A1 (en) *2007-06-252009-01-08Hitachi Industrial Equipment System Co., Ltd.Multi-component system
US20110145662A1 (en) *2009-12-162011-06-16Microsoft CorporationCoordination of error reporting among multiple managed runtimes in the same process
US20130152081A1 (en) *2011-12-132013-06-13International Business Machines CorporationSelectable event reporting for highly virtualized partitioned systems

Families Citing this family (15)

Publication numberPriority datePublication dateAssigneeTitle
US7065599B2 (en) *2001-08-102006-06-20Sun Microsystems, Inc.Multiprocessor systems
US8302111B2 (en) 2003-11-242012-10-30Time Warner Cable Inc.Methods and apparatus for hardware registration in a network device
US7266726B1 (en) 2003-11-242007-09-04Time Warner Cable Inc.Methods and apparatus for event logging in an information network
US9213538B1 (en) 2004-02-062015-12-15Time Warner Cable Enterprises LlcMethods and apparatus for display element management in an information network
US8078669B2 (en) 2004-02-182011-12-13Time Warner Cable Inc.Media extension apparatus and methods for use in an information network
US7426657B2 (en) *2004-07-092008-09-16International Business Machines CorporationSystem and method for predictive processor failure recovery
US7650539B2 (en) *2005-06-302010-01-19Microsoft CorporationObserving debug counter values during system operation
US8370818B2 (en) 2006-12-022013-02-05Time Warner Cable Inc.Methods and apparatus for analyzing software interface usage
US7689567B2 (en) *2006-12-282010-03-30Sap AgError handling for intermittently connected mobile applications
US20080256400A1 (en) *2007-04-162008-10-16Chih-Cheng YangSystem and Method for Information Handling System Error Handling
TWI363962B (en) *2007-06-012012-05-11Holtek Semiconductor Inc
TWI369608B (en) *2008-02-152012-08-01Mstar Semiconductor IncMulti-microprocessor system and control method therefor
WO2010007469A1 (en) *2008-07-162010-01-21Freescale Semiconductor, Inc.Micro controller unit including an error indicator module
US9229843B2 (en) *2010-04-282016-01-05International Business Machines CorporationPredictively managing failover in high availability systems
CN102567177B (en) *2010-12-252014-12-10鸿富锦精密工业(深圳)有限公司System and method for detecting error of computer system

Citations (20)

Publication numberPriority datePublication dateAssigneeTitle
US4415973A (en) 1980-03-281983-11-15International Computers LimitedArray processor with stand-by for replacing failed section
US4860196A (en) 1986-12-011989-08-22Siemens AktiengesellschaftHigh-availability computer system with a support logic for a warm start
US5280606A (en) 1990-03-081994-01-18Nec CorporationFault recovery processing for supercomputer
US5325517A (en) 1989-05-171994-06-28International Business Machines CorporationFault tolerant data processing system
US5335471A (en) 1993-03-081994-08-09Kupiec Daniel JColumn enclosing kit
US5491788A (en) 1993-09-101996-02-13Compaq Computer Corp.Method of booting a multiprocessor computer where execution is transferring from a first processor to a second processor based on the first processor having had a critical error
US5530946A (en) *1994-10-281996-06-25Dell Usa, L.P.Processor failure detection and recovery circuit in a dual processor computer system and method of operation thereof
US5583987A (en) 1994-06-291996-12-10Mitsubishi Denki Kabushiki KaishaMethod and apparatus for initializing a multiprocessor system while resetting defective CPU's detected during operation thereof
US5864653A (en) 1996-12-311999-01-26Compaq Computer CorporationPCI hot spare capability for failed components
US5884019A (en) 1995-08-071999-03-16Fujitsu LimitedSystem and method for collecting dump information in a multi-processor data processing system
US5933614A (en) 1996-12-311999-08-03Compaq Computer CorporationIsolation of PCI and EISA masters by masking control and interrupt lines
US6158015A (en) 1998-03-302000-12-05Micron Electronics, Inc.Apparatus for swapping, adding or removing a processor in an operating computer system
US6233680B1 (en) *1998-10-022001-05-15International Business Machines CorporationMethod and system for boot-time deconfiguration of a processor in a symmetrical multi-processing system
US6378027B1 (en) *1999-03-302002-04-23International Business Machines CorporationSystem upgrade and processor service
US6516429B1 (en) *1999-11-042003-02-04International Business Machines CorporationMethod and apparatus for run-time deconfiguration of a processor in a symmetrical multi-processing system
US6536000B1 (en) *1999-10-152003-03-18Sun Microsystems, Inc.Communication error reporting mechanism in a multiprocessing computer system
US6550019B1 (en) *1999-11-042003-04-15International Business Machines CorporationMethod and apparatus for problem identification during initial program load in a multiprocessor system
US6574748B1 (en) *2000-06-162003-06-03Bull Hn Information Systems Inc.Fast relief swapping of processors in a data processing system
US6708297B1 (en) *2000-12-292004-03-16Emc CorporationMethod and system for monitoring errors on field replaceable units
US6742139B1 (en) *2000-10-192004-05-25International Business Machines CorporationService processor reset/reload

Patent Citations (21)

Publication numberPriority datePublication dateAssigneeTitle
US4415973A (en) 1980-03-281983-11-15International Computers LimitedArray processor with stand-by for replacing failed section
US4860196A (en) 1986-12-011989-08-22Siemens AktiengesellschaftHigh-availability computer system with a support logic for a warm start
US5325517A (en) 1989-05-171994-06-28International Business Machines CorporationFault tolerant data processing system
US5280606A (en) 1990-03-081994-01-18Nec CorporationFault recovery processing for supercomputer
US5335471A (en) 1993-03-081994-08-09Kupiec Daniel JColumn enclosing kit
US5491788A (en) 1993-09-101996-02-13Compaq Computer Corp.Method of booting a multiprocessor computer where execution is transferring from a first processor to a second processor based on the first processor having had a critical error
US5583987A (en) 1994-06-291996-12-10Mitsubishi Denki Kabushiki KaishaMethod and apparatus for initializing a multiprocessor system while resetting defective CPU's detected during operation thereof
US5530946A (en) *1994-10-281996-06-25Dell Usa, L.P.Processor failure detection and recovery circuit in a dual processor computer system and method of operation thereof
US5884019A (en) 1995-08-071999-03-16Fujitsu LimitedSystem and method for collecting dump information in a multi-processor data processing system
US5933614A (en) 1996-12-311999-08-03Compaq Computer CorporationIsolation of PCI and EISA masters by masking control and interrupt lines
US5864653A (en) 1996-12-311999-01-26Compaq Computer CorporationPCI hot spare capability for failed components
US6081865A (en) 1996-12-312000-06-27Compaq Computer CorporationIsolation of PCI and EISA masters by masking control and interrupt lines
US6158015A (en) 1998-03-302000-12-05Micron Electronics, Inc.Apparatus for swapping, adding or removing a processor in an operating computer system
US6233680B1 (en) *1998-10-022001-05-15International Business Machines CorporationMethod and system for boot-time deconfiguration of a processor in a symmetrical multi-processing system
US6378027B1 (en) *1999-03-302002-04-23International Business Machines CorporationSystem upgrade and processor service
US6536000B1 (en) *1999-10-152003-03-18Sun Microsystems, Inc.Communication error reporting mechanism in a multiprocessing computer system
US6516429B1 (en) *1999-11-042003-02-04International Business Machines CorporationMethod and apparatus for run-time deconfiguration of a processor in a symmetrical multi-processing system
US6550019B1 (en) *1999-11-042003-04-15International Business Machines CorporationMethod and apparatus for problem identification during initial program load in a multiprocessor system
US6574748B1 (en) *2000-06-162003-06-03Bull Hn Information Systems Inc.Fast relief swapping of processors in a data processing system
US6742139B1 (en) *2000-10-192004-05-25International Business Machines CorporationService processor reset/reload
US6708297B1 (en) *2000-12-292004-03-16Emc CorporationMethod and system for monitoring errors on field replaceable units

Cited By (8)

Publication numberPriority datePublication dateAssigneeTitle
US20070150713A1 (en) *2005-12-222007-06-28International Business Machines CorporationMethods and arrangements to dynamically modify the number of active processors in a multi-node system
US20080005538A1 (en) *2006-06-302008-01-03Apparao Padmashree KDynamic configuration of processor core banks
US20090013221A1 (en) *2007-06-252009-01-08Hitachi Industrial Equipment System Co., Ltd.Multi-component system
US7861115B2 (en) *2007-06-252010-12-28Hitachi Industrial Equipment Systems Co., Ltd.Multi-component system
US20110145662A1 (en) *2009-12-162011-06-16Microsoft CorporationCoordination of error reporting among multiple managed runtimes in the same process
US8429454B2 (en) 2009-12-162013-04-23Microsoft CorporationCoordination of error reporting among multiple managed runtimes in the same process
US20130152081A1 (en) *2011-12-132013-06-13International Business Machines CorporationSelectable event reporting for highly virtualized partitioned systems
US8924971B2 (en) 2011-12-132014-12-30International Business Machines CorporationSelectable event reporting for highly virtualized partitioned systems

Also Published As

Similar Documents

PublicationPublication DateTitle
US6912670B2 (en) Processor internal error handling in an SMP server
US7409580B2 (en) System and method for recovering from errors in a data processing system
TWI337707B (en) System and method for logging recoverable errors
US7260749B2 (en) Hot plug interfaces and failure handling
US6880113B2 (en) Conditional hardware scan dump data capture
US7685476B2 (en) Early notification of error via software interrupt and shared memory write
US7197670B2 (en) Methods and apparatuses for reducing infant mortality in semiconductor devices utilizing static random access memory (SRAM)
US7281171B2 (en) System and method of checking a computer system for proper operation
US6934879B2 (en) Method and apparatus for backing up and restoring data from nonvolatile memory
US7447943B2 (en) Handling memory errors in response to adding new memory to a system
US7430683B2 (en) Method and apparatus for enabling run-time recovery of a failed platform
US6615374B1 (en) First and next error identification for integrated circuit devices
US20040225831A1 (en) Methods and systems for preserving dynamic random access memory contents responsive to hung processor condition
US7877643B2 (en) Method, system, and product for providing extended error handling capability in host bridges
US4593391A (en) Machine check processing system
US7290128B2 (en) Fault resilient boot method for multi-rail processors in a computer system by disabling processor with the failed voltage regulator to control rebooting of the processors
US20170149925A1 (en) Processing cache data
US20030023932A1 (en) Method and apparatus for parity error recovery
US11068360B2 (en) Error recovery method and apparatus based on a lockup mechanism
US6904546B2 (en) System and method for interface isolation and operating system notification during bus errors
US6463492B1 (en) Technique to automatically notify an operating system level application of a system management event
JPH1165898A (en) Maintenance system for electronic computer
JP2005070993A (en) Device having transfer mode abnormality detection function and storage controller, and interface module for the controller
US11360839B1 (en) Systems and methods for storing error data from a crash dump in a computer system
JP3757407B2 (en) Control device

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WILKIE, BRUCE J.;REEL/FRAME:012605/0947

Effective date: 20020111

STCFInformation on status: patent grant

Free format text: PATENTED CASE

FPAYFee payment

Year of fee payment: 4

REMIMaintenance fee reminder mailed
FPAYFee payment

Year of fee payment: 8

SULPSurcharge for late payment

Year of fee payment: 7

ASAssignment

Owner name: LENOVO INTERNATIONAL LIMITED, HONG KONG

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:034194/0291

Effective date: 20140926

FPAYFee payment

Year of fee payment: 12

Smp machine check error - sorry

sudo ras-mc-ctl--errors

No Memory errors.

 

No PCIe AER errors.

 

No Extlog errors.

 

MCE events:

12019-07-1520:41:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x942000460082110a,addr=0x243e9f840,tsc=0x8b99a7f84108,walltime=0x5d2c8276,cpuid=0x000706a1,bank=0x00000001

22019-07-1601:34:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x942000460082110a,addr=0x24b9df840,tsc=0xa38afb430944,walltime=0x5d2cc722,cpuid=0x000706a1,bank=0x00000001

32019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d95741ee28,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

42019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d957436320,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

52019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d957451d82,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

62019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d957456482,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

72019-07-1603:20:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000400082110a,tsc=0xac3468f91976,walltime=0x5d2cdffa,cpuid=0x000706a1,bank=0x00000001

82019-07-1603:20:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000400082110a,tsc=0xac3468fb7a3a,walltime=0x5d2cdffa,cpuid=0x000706a1,bank=0x00000001

92019-07-1615:08:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000460082110a,tsc=0xe60f3181c782,walltime=0x5d2d85ea,cpuid=0x000706a1,bank=0x00000001

102019-07-1615:08:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000460082110a,tsc=0xe60f31852002,walltime=0x5d2d85ea,cpuid=0x000706a1,bank=0x00000001

112019-07-1702:52:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x942000460082110a,addr=0x249c5f840,tsc=0x11f964ae442b2,walltime=0x5d2e2aea,cpuid=0x000706a1,bank=0x00000001

122019-07-1715:24:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000440082110a,tsc=0x15d0984e5de54,walltime=0x5d2edb2a,cpuid=0x000706a1,bank=0x00000001

Additional monitoring - ECC errors? #1508

I did some quick googling to see if it was possible to monitor ECC errors as this seems like a no brainer benefit to netdata. I haven't found any documentation/pull request adding this and I believe it would be very valuable to sysadmins who monitor bare metal.

The only results I found were for the actual kernel module, EDAC:

http://bluesmoke.sourceforge.net/

It seems this was put in upstream back in kernel 2.6. Is this still a thing? If so, how can netdata properly monitor this while maintaining it's low-memory and resource footprint?

Here are entries in the syslog that show EDAC finding and correcting errors(This actually crashes the system for some reason, however, it is detectable in some way):

More examples of errors on a system of mine:

I'm sure there are a lot of sysadmins out there that have to look after old systems. This would greatly benefit us as well on any future system if this kernel module is still supported on newer hardware containing ECC.

I wouldn't mind doing more research on this to see if there is anything in memory I could find that netdata can quickly query. I am at work right now and about to head home right now.

Quick thought: Maybe a plugin would suffice for now to parse the output of dmesg and just report on failures?

I am starting to seriously learn python and would not mind writing the plugin but it may take me a lot of trial and error.

Will report back if I find anything.

US6912670B2 - Processor internal error handling in an SMP server - Google Patents

Processor internal error handling in an SMP server Download PDF

Info

Publication number
US6912670B2
US6912670B2US10/054,017US5401702AUS6912670B2US 6912670 B2US6912670 B2US 6912670B2US 5401702 AUS5401702 AUS 5401702AUS 6912670 B2US6912670 B2US 6912670B2
Authority
US
United States
Prior art keywords
error
logic
internal
processor
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US10/054,017
Other versions
US20030140285A1 (en
Inventor
Bruce James Wilkie
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo International Ltd
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines CorpfiledCriticalInternational Business Machines Corp
Priority to US10/054,017priorityCriticalpatent/US6912670B2/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATIONreassignmentINTERNATIONAL BUSINESS MACHINES CORPORATIONASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: WILKIE, BRUCE J.
Publication of US20030140285A1publicationCriticalpatent/US20030140285A1/en
Application grantedgrantedCritical
Publication of US6912670B2publicationCriticalpatent/US6912670B2/en
Assigned to LENOVO INTERNATIONAL LIMITEDreassignmentLENOVO INTERNATIONAL LIMITEDASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: INTERNATIONAL BUSINESS MACHINES CORPORATION
Activelegal-statusCriticalCurrent
Adjusted expirationlegal-statusCritical

Links

  • 238000001514detection methodMethods0.000claimsabstractdescription13
  • 230000000875correspondingEffects0.000claimsdescription10
  • 230000004044responseEffects0.000abstractdescription11
  • 238000000034methodMethods0.000abstractdescription2
  • 238000010586diagramMethods0.000description6
  • 230000001186cumulativeEffects0.000description4
  • 230000002093peripheralEffects0.000description4
  • 230000004048modificationEffects0.000description2
  • 238000006011modification reactionMethods0.000description2
  • 230000006399behaviorEffects0.000description1
  • 238000001816coolingMethods0.000description1
  • 230000000977initiatoryEffects0.000description1
  • 230000000644propagatedEffects0.000description1
  • 238000009877renderingMethods0.000description1

Images

Classifications

    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00—Error detection; Error correction; Monitoring
    • G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766—Error or fault reporting or storing
    • G06F11/0772—Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00—Error detection; Error correction; Monitoring
    • G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0721—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
    • G06F11/0724—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU] in a multiprocessor or a multi-core unit
    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00—Error detection; Error correction; Monitoring
    • G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793—Remedial or corrective actions

Abstract

Description

1. Field of the Present Invention

The present invention generally relates to the field of microprocessor-based data processing systems and more particularly to a system and method for efficient handling of processor internal errors in a symmetric multiprocessor server system.

2. History of Related Art

Interrupt handling is well known in the field of microprocessors and microprocessor-based data processing devices. Traditionally, the handling of processor internal errors (IERRs) in a symmetric multiprocessor (SMP) system has been the responsibility of a System management interrupt (SMI) handler. The SMI typically performs the tasks of logging the error condition and setting the appropriate controls to remove the faulty processor from the available resources.

Unfortunately, delegating processor internal error handling to the SMI is problematic. More specifically, the SMI is not immediately available when a server is powered-on. The SMI is usually installed as part of the power on self test (POST). If an internal error occurs before the SMI is installed and functioning, status cannot be reported and the system will probably halt. In addition, relying on the SMI to handler IERRs assumes that at least one of the processor is sufficiently operable to execute the SMI. If this assumption is not met, system behavior is unpredictable and the system will more than likely abort operation with little information to indicate the reason for the failure. Moreover, while it might be tempting to use the service processor found on many server blades to respond to the error and execute the SMI, the response latency of conventional service processors relative to high end SMP servers is too great to ensure that erroneous data is not propagated thereby possibly contaminating stored data records.

It would therefore be highly desirable to implement a data processing system in which processor internal errors are handled expeditiously. It would be further desirable if the implemented solution did not rely on the main processors to handle processor internal errors. It would be still further desirable if the response performance of the implemented solution was compatible with the requirements of high end multiprocessor systems.

The problem identified above is in large part addressed by a system and method for handling processor internal errors in a data processing system. The data processing system typically includes a set of main microprocessors that have access to a common system memory via a system bus. The system may further include a service processor that is connected to at least one of the main processors. In addition, the system includes internal error handling hardware configured to log and process internal errors generated by one or more of the main processors. The internal error hardware may include error detection logic configured to receive internal error signals from the main processors. In response to receiving one or more IERR signals, the error detection logic is configured to assert and error detected signal that is received by error logging logic. The error logging logic is configured to update one or more error status register entries when the error detected signal is asserted. When the error logging logic has updated the status register entries, it is configured to assert an error logging complete signal that is received by processor control logic and by any external service processor, for purposes of maintaining system error logs. The processor control logic is configured to de-assert one or more processor enable signals based on the state of the error status registers. In addition, upon completion of the error status update by the error logging logic, the status register is configured to assert an error status updated signal that ultimately produces a system reset. By incorporating error logging and handling into dedicated hardware tied directly to the processor internal error signals, the invention provides a low cost, low response latency mechanism for handling processor internal errors in high performance multiprocessor systems.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description presented herein are not intended to limit the invention to the particular embodiment disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

Generally speaking, the present invention contemplates a system and method for handling processor internal errors in a multiprocessor system such as a high end SMP server system. The system incorporates internal error control logic to detect, log, and respond to processor internal errors generated by one or more of the system's multiple main processors. The control logic is configured to determine which processor(s) have issued an internal error, update an error status register to log the error, notify any external service processor of the error, and restart the system with any remaining functional processor(s).

Turning now to the drawings, is a block diagram of selected features of a data processing system according to one embodiment of the present invention. In the depicted embodiment, system 100 includes a set of main processors 102A through 102N (generically or collectively referred to as processor(s) 102) that are connected to a system bus 104. A common system memory 106 is accessible to each processor 102 via system bus 104. System memory 106 is typically implemented with a volatile storage medium such as an array of dynamic random access memory (DRAM) devices. Because each processor 102 has substantially equal access to system memory 106 (i.e., the memory access time is substantially independent of the processor), the depicted architecture of system 100 is commonly referred to as a symmetric multiprocessor system.

In the depicted embodiment of system 100, a bus bridge 108 provides an interface between system bus 104 and an I/O bus 110 to which one or more peripheral devices 114A through 114N (generically or collectively referred to as peripheral device(s) 114) are connected. I/O bus 110 is typically compliant with one of several industry standard I/O bus specifications including, as an example, the Peripheral Components Interface (PCI) bus as specified in PCI Local Bus Specification Rev 2.2 by the PCI Special Interest Group (www.pcisig.com). Peripheral devices 114 may include devices such as a graphics adapter, high-speed network adapter, hard-disk controller, and the like.

The depicted embodiment of system 100 further includes a general purpose I/O (GPIO) port 112 connected to I/O bus 110 and to which a service processor 116 is connected. Service processor 116 is used to provide support for low-level system functions such as power monitoring, cooling fan control, hardware error logging, and so forth.

System 100 according to the present invention further includes error logic 120. Error logic 120 is connected to the set of main processors 102 to provide a fast response to a processor internal error. The use of dedicated hardware to respond to processor internal errors beneficially eliminates dependence on error handling software that may or may not be available at the time a processor internal error is issued.

Referring now to , a block diagram of selected elements of error logic 120 according to one embodiment of the present invention is depicted. Error logic 120 comprises dedicated hardware that is integrated into the main system control logic and connected directly to system power. As such, error logic 120 is functional as soon power is applied to system 100 in contrast to system management interrupt (SMI) software modules, which are installed as part of the POST. In the depicted embodiment, error logic 120 includes an error detection unit 122 that is configured to receive processor internal error signals from each of the main processors 102. Processor internal error signals are generally asserted when a processor detects an error unrelated to processor bus operation. If, for example, a processor with an internal cache memory detects a parity error in the cache, the error may result in the assertion of the internal error signal. The internal error signal may be referred to herein as the IERR signal consistent with the notation commonly in use for the Pentium® family of processors from Intel Corporation.

Error detection unit 122 is further configured to assert an error detect signal 124 upon determining that one or more of the processor IERR signals has been asserted by its corresponding processor. Error detection unit 122 may include suitable latching circuitry to prevent an asserted IERR signal from being reset prematurely and additional logic to produce a pulse on error detect signal 124 in response to an IERR signal such that error detect 124 is pulsed once and only once for each internal error “event” where an event lasts from the assertion of any IERR signal until a system reset is initiated.

Error detect signal 124 provides an input to error logging unit 124. Error logging unit 124 is configured to document an internal error by capturing the identity of the offending processor. Because the error detection logic is not resident on the processor bus, it does not have visibility to the internal registers of the processors. In most cases when a processor asserts IERR, the processor has experienced an internal fatal error rendering most of its information unusable.

Error logging unit 126 is configured to record and preserve IERR information in an Error Status Register 128. Error status register 128 is configured to store internal error status for each processor 102 of system 100. Referring to , a selected portion of one embodiment of error status register 128 is depicted. In this embodiment, error status register 128 includes a set of bit pairs 140A through 140N (generically of collectively referred to as bit pair(s) 140) for each processor 102. A first bit 142 of each bit pair 140 is a “current” bit that indicates whether the corresponding processor 102 is currently asserting its internal error signal while a second bit 144 of each bit pair 140 is a “cumulative” bit that indicates whether the corresponding processor has previously asserted its internal error signal. Whereas the current bits 142 are cleared each time a system reset occurs, the cumulative bits 144 are preserved. Thus, the set of cumulative bits 144 indicate the cumulative set of processors 102 that have internal error problems.

In the depicted embodiment, error status register 128 is accessible to the other components in system 100 through a system interface, such as an Industry Standard Architecture (ISA) bus, identified by reference numeral 130. System interface 130 may include sufficient data, address, and control signals to permit processors 102 to read the contents of status register 128. In addition, error status register 128 may include a one or more bits set in response to an internal error event that provide an interrupt signal to service processor (SP) 116 such that service processor 116 is interrupted in response to a main processor internal error event. In response to an interrupt from error logging unit 128, service processor 116 may be programmed to take specific actions with respect to system power such as powering down and so forth. In addition, service processor 116 may be programmed to log or record additional information regarding the internal error. This additional information may include, for example, the time at which an internal error signal was asserted.

The depicted embodiment of error status register 128 further includes an I2C interface for connecting to an I2C bus thereby enabling communication between error status register 128 and an external device in the event that it becomes desirable to access the contents of register 128 externally.

Error logging unit 126, in addition to providing logged information to error status register 128, is configured to generate an error log complete signal 129 when the logging unit has completed its documentation of an internal error event. Error log complete signal 129 is provided to a system reset unit 132 and a processor control unit 134. System reset logic 132 is configured to generate a system reset that is provided to each processor 102 following an internal error event. System reset logic 132 may be further controlled by an error status updated signal 131 produced by error status register 128 indicating completion of a status register update following an internal error event. Processor control logic 134 is configured to generate a unique processor enable signal for each processor 102 in system 100 following an internal error event. The processor enable signals are de-asserted if the corresponding processor was responsible for the internal error event and the cause of the internal error could not be corrected. The combination of system reset unit 132 and processor control logic 134 provides means for initiating a system reset and enabling only those processors 102 that are functional following an internal error.

Turning now to , a flow diagram representing selected elements of a method 150 of responding to internal errors signals in a data processing system according to one embodiment of the invention is presented. Initially, the data processing system is executing (block 151) in a normal operating mode. For purposes of this disclosure, the normal operating mode represents any state following the application of power to the system in which the internal errors signals are not asserted. Accordingly, normal operating mode does not imply that an operating system has been installed and application programs are executing or capable of being executed. Instead, the normal operating mode could be achieved substantially immediately following the application of power to the system if none of the IERR signals is asserted.

The data processing system and, more particularly, the internal error logic of the system, monitors (block 152) for the assertion of an IERR signal by one or more of the main processors. As long as the main processors do not issue any internal error signals, the system remains in its normal operating mode. During this time, an operating system may be installed and one or more applications programs may be executing. If an internal error is detected, the error logic logs (block 154) the error and updates (block 156) the error status register as described in greater detail above. After updating the status register, the system disables (block 158) any nonfunctional main processors. The disabled processors would typically include any processors currently asserting their internal error signals as well as any processors that asserted their error signals previously. After disabling the appropriate main processors, the system determines (block 160) whether any functional processors remain in the system. If all processors are currently or have previously asserted their internal error signals, the error logic generates a system halt (block 164). If there are one or more functional processors remaining, the error logic initiates a reset (block 162) to restart the system with the functional processors. In this manner, the data processing system is able to respond to internal errors without relying on any error handling software or operating system code.

It will be apparent to those skilled in the art having the benefit of this disclosure that the present invention contemplates a system and method for responding to processor internal errors in a data processing system. It is understood that the form of the invention shown and described in the detailed description and the drawings are to be taken merely as presently preferred examples. It is intended that the following claims be interpreted broadly to embrace all the variations of the preferred embodiments disclosed.

Claims (14)

1. A data processing system, comprising:

multiple main processors connected to a system bus;

a system memory connected to the system bus and accessible to the main processors;

error logic, external to the main processors, and configured to receive internal error signals asserted by the main processors and to respond to an internal error signal by disabling a main processor asserting an internal error signal and restarting the system with any remaining functional main processors, wherein the error logic includes an error status register accessible via an I2C bus; and

a service processor configured to receive a service processor interrupt generated by the error logic.

2. The system of , wherein the error logic is further configured to record the internal error signal in an error status register of the error logic.

3. The system of , wherein the error status register includes at least a pair of bits corresponding to each of the main processors, wherein a first bit of each pair is indicative of whether the corresponding main processor is currently asserting its internal error signal and a second bit of each pair is indicative of whether the corresponding main processor has asserted its internal error signal previously.

4. The system of , wherein the error logic is functional substantially immediately following the application of power to the data processing system.

5. The system of , wherein the error logic includes an error detection unit configured to receive an internal error signal from each of the main processors and further configured to generate an error detect signal responsive to assertion of an internal error signal by any of the processors.

6. The system of , wherein the error logic further includes error logging logic configured to receive the error detect signal and, responsive thereto, to update an error statue register to reflect the internal error signal.

7. The system of , wherein responsive to the service processor interrupt, the service processor is configured to power down the system.

8. Error detection logic suitable for use in a data processing system having multiple main processors, wherein the error detection logic is external to the main processors and is configured to receive internal error signals asserted by the main processors and further configured to respond to an internal error signal by disabling a processor asserting signal, generating a service processor interrupt, and restarting the system with any remaining functional processors and further wherein the error detection logic includes an error status register externally accessible via an I2C bus.

9. The error logic of , wherein the error logic is further configured to record the internal error signal in the error status register of the error logic.

10. The error logic of , wherein the error status register includes at least a pair of bits corresponding to each of the main processors, wherein a first bit of each pair as indicative of whether the corresponding main processor is currently asserting its internal error signal and a second bit of each pair is indicative of whether the corresponding main processor has asserted its internal error signal previously.

11. The error logic of , wherein the error logic is functional substantially immediately following the application of power to the data processing system.

12. The error logic of , wherein the error logic includes an error detection unit configured to receive an internal error signal from each of the main processors and further configured to generate an error detect signal responsive to assertion of an internal error signal by any of the processors.

13. The error logic of , wherein the error logic further includes error logging logic configured to receive the error detect signal and, responsive thereto, to update the error status register to reflect the internal error signal.

14. The error logic of , wherein the error logic is further configured to generate the service processor interrupt responsive to error status register update.

US10/054,0172002-01-222002-01-22Processor internal error handling in an SMP server Active2023-06-23US6912670B2 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US10/054,017US6912670B2 (en) 2002-01-222002-01-22Processor internal error handling in an SMP server

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
US10/054,017US6912670B2 (en) 2002-01-222002-01-22Processor internal error handling in an SMP server

Publications (2)

ID=21988209

Family Applications (1)

Application NumberTitlePriority DateFiling Date
US10/054,017Active2023-06-23US6912670B2 (en) 2002-01-222002-01-22Processor internal error handling in an SMP server

Country Status (1)

Cited By (5)

Publication numberPriority datePublication dateAssigneeTitle
US20070150713A1 (en) *2005-12-222007-06-28International Business Machines CorporationMethods and arrangements to dynamically modify the number of active processors in a multi-node system
US20080005538A1 (en) *2006-06-302008-01-03Apparao Padmashree KDynamic configuration of processor core banks
US20090013221A1 (en) *2007-06-252009-01-08Hitachi Industrial Equipment System Co., Ltd.Multi-component system
US20110145662A1 (en) *2009-12-162011-06-16Microsoft CorporationCoordination of error reporting among multiple managed runtimes in the same process
US20130152081A1 (en) *2011-12-132013-06-13International Business Machines CorporationSelectable event reporting for highly virtualized partitioned systems

Families Citing this family (15)

Publication numberPriority datePublication dateAssigneeTitle
US7065599B2 (en) *2001-08-102006-06-20Sun Microsystems, Inc.Multiprocessor systems
US8302111B2 (en) 2003-11-242012-10-30Time Warner Cable Inc.Methods and apparatus for hardware registration in a network device
US7266726B1 (en) 2003-11-242007-09-04Time Warner Cable Inc.Methods and apparatus for event logging in an information network
US9213538B1 (en) 2004-02-062015-12-15Time Warner Cable Enterprises LlcMethods and apparatus for display element management in an information network
US8078669B2 (en) 2004-02-182011-12-13Time Warner Cable Inc.Media extension apparatus and methods for use in an information network
US7426657B2 (en) *2004-07-092008-09-16International Business Machines CorporationSystem and method for predictive processor failure recovery
US7650539B2 (en) *2005-06-302010-01-19Microsoft CorporationObserving debug counter values during system operation
US8370818B2 (en) 2006-12-022013-02-05Time Warner Cable Inc.Methods and apparatus for analyzing software interface usage
US7689567B2 (en) *2006-12-282010-03-30Sap AgError handling for intermittently connected mobile applications
US20080256400A1 (en) *2007-04-162008-10-16Chih-Cheng YangSystem and Method for Information Handling System Error Handling
TWI363962B (en) *2007-06-012012-05-11Holtek Semiconductor Inc
TWI369608B (en) *2008-02-152012-08-01Mstar Semiconductor IncMulti-microprocessor system and control method therefor
WO2010007469A1 (en) *2008-07-162010-01-21Freescale Semiconductor, Inc.Micro controller unit including an error indicator module
US9229843B2 (en) *2010-04-282016-01-05International Business Machines CorporationPredictively managing failover in high availability systems
CN102567177B (en) *2010-12-252014-12-10鸿富锦精密工业(深圳)有限公司System and method for detecting error of computer system

Citations (20)

Publication numberPriority datePublication dateAssigneeTitle
US4415973A (en) 1980-03-281983-11-15International Computers LimitedArray processor with stand-by for replacing failed section
US4860196A (en) 1986-12-011989-08-22Siemens AktiengesellschaftHigh-availability computer system with a support logic for a warm start
US5280606A (en) 1990-03-081994-01-18Nec CorporationFault recovery processing for supercomputer
US5325517A (en) 1989-05-171994-06-28International Business Machines CorporationFault tolerant data processing system
US5335471A (en) 1993-03-081994-08-09Kupiec Daniel JColumn enclosing kit
US5491788A (en) 1993-09-101996-02-13Compaq Computer Corp.Method of booting a multiprocessor computer where execution is transferring from a first processor to a second processor based on the first processor having had a critical error
US5530946A (en) *1994-10-281996-06-25Dell Usa, L.P.Processor failure detection and recovery circuit in a dual processor computer system and method of operation thereof
US5583987A (en) 1994-06-291996-12-10Mitsubishi Denki Kabushiki KaishaMethod and apparatus for initializing a multiprocessor system while resetting defective CPU's detected during operation thereof
US5864653A (en) 1996-12-311999-01-26Compaq Computer CorporationPCI hot spare capability for failed components
US5884019A (en) 1995-08-071999-03-16Fujitsu LimitedSystem and method for collecting dump information in a multi-processor data processing system
US5933614A (en) 1996-12-311999-08-03Compaq Computer CorporationIsolation of PCI and EISA masters by masking control and interrupt lines
US6158015A (en) 1998-03-302000-12-05Micron Electronics, Inc.Apparatus for swapping, adding or removing a processor in an operating computer system
US6233680B1 (en) *1998-10-022001-05-15International Business Machines CorporationMethod and system for boot-time deconfiguration of a processor in a symmetrical multi-processing system
US6378027B1 (en) *1999-03-302002-04-23International Business Machines CorporationSystem upgrade and processor service
US6516429B1 (en) *1999-11-042003-02-04International Business Machines CorporationMethod and apparatus for run-time deconfiguration of a processor in a symmetrical multi-processing system
US6536000B1 (en) *1999-10-152003-03-18Sun Microsystems, Inc.Communication error reporting mechanism in a multiprocessing computer system
US6550019B1 (en) *1999-11-042003-04-15International Business Machines CorporationMethod and apparatus for problem identification during initial program load in a multiprocessor system
US6574748B1 (en) *2000-06-162003-06-03Bull Hn Information Systems Inc.Fast relief swapping of processors in a data processing system
US6708297B1 (en) *2000-12-292004-03-16Emc CorporationMethod and system for monitoring errors on field replaceable units
US6742139B1 (en) *2000-10-192004-05-25International Business Machines CorporationService processor reset/reload

Patent Citations (21)

Publication numberPriority datePublication dateAssigneeTitle
US4415973A (en) 1980-03-281983-11-15International Computers LimitedArray processor with stand-by for replacing failed section
US4860196A (en) 1986-12-011989-08-22Siemens AktiengesellschaftHigh-availability computer system with a support logic for a warm start
US5325517A (en) 1989-05-171994-06-28International Business Machines CorporationFault tolerant data processing system
US5280606A (en) 1990-03-081994-01-18Nec CorporationFault recovery processing for supercomputer
US5335471A (en) 1993-03-081994-08-09Kupiec Daniel JColumn enclosing kit
US5491788A (en) 1993-09-101996-02-13Compaq Computer Corp.Method of booting a multiprocessor computer where execution is transferring from a first processor to a second processor based on the first processor having had a critical error
US5583987A (en) 1994-06-291996-12-10Mitsubishi Denki Kabushiki KaishaMethod and apparatus for initializing a multiprocessor system while resetting defective CPU's detected during operation thereof
US5530946A (en) *1994-10-281996-06-25Dell Usa, L.P.Processor failure detection and recovery circuit in a dual processor computer system and method of operation thereof
US5884019A (en) 1995-08-071999-03-16Fujitsu LimitedSystem and method for collecting dump information in a multi-processor data processing system
US5933614A (en) 1996-12-311999-08-03Compaq Computer CorporationIsolation of PCI and EISA masters by masking control and interrupt lines
US5864653A (en) 1996-12-311999-01-26Compaq Computer CorporationPCI hot spare capability for failed components
US6081865A (en) 1996-12-312000-06-27Compaq Computer CorporationIsolation of PCI and EISA masters by masking control and interrupt lines
US6158015A (en) 1998-03-302000-12-05Micron Electronics, Inc.Apparatus for swapping, adding or removing a processor in an operating computer system
US6233680B1 (en) *1998-10-022001-05-15International Business Machines CorporationMethod and system for boot-time deconfiguration of a processor in a symmetrical multi-processing system
US6378027B1 (en) *1999-03-302002-04-23International Business Machines CorporationSystem upgrade and processor service
US6536000B1 (en) *1999-10-152003-03-18Sun Microsystems, Inc.Communication error reporting mechanism in a multiprocessing computer system
US6516429B1 (en) *1999-11-042003-02-04International Business Machines CorporationMethod and apparatus for run-time deconfiguration of a processor in a symmetrical multi-processing system
US6550019B1 (en) *1999-11-042003-04-15International Business Machines CorporationMethod and apparatus for problem identification during initial program load in a multiprocessor system
US6574748B1 (en) *2000-06-162003-06-03Bull Hn Information Systems Inc.Fast relief swapping of processors in a data processing system
US6742139B1 (en) *2000-10-192004-05-25International Business Machines CorporationService processor reset/reload
US6708297B1 (en) *2000-12-292004-03-16Emc CorporationMethod and system for monitoring errors on field replaceable units

Cited By (8)

Publication numberPriority datePublication dateAssigneeTitle
US20070150713A1 (en) *2005-12-222007-06-28International Business Machines CorporationMethods and arrangements to dynamically modify the number of active processors in a multi-node system
US20080005538A1 (en) *2006-06-302008-01-03Apparao Padmashree KDynamic configuration of processor core banks
US20090013221A1 (en) *2007-06-252009-01-08Hitachi Industrial Equipment System Co., Ltd.Multi-component system
US7861115B2 (en) *2007-06-252010-12-28Hitachi Industrial Equipment Systems Co., Ltd.Multi-component system
US20110145662A1 (en) *2009-12-162011-06-16Microsoft CorporationCoordination of error reporting among multiple managed runtimes in the same process
US8429454B2 (en) 2009-12-162013-04-23Microsoft CorporationCoordination of error reporting among multiple managed runtimes in the same process
US20130152081A1 (en) *2011-12-132013-06-13International Business Machines CorporationSelectable event reporting for highly virtualized partitioned systems
US8924971B2 (en) 2011-12-132014-12-30International Business Machines CorporationSelectable event reporting for highly virtualized partitioned systems

Also Published As

Similar Documents

PublicationPublication DateTitle
US6912670B2 (en) Processor internal error handling in an SMP server
US7409580B2 (en) System and method for recovering from errors in a data processing system
TWI337707B (en) System and method for logging recoverable errors
US7260749B2 (en) Hot plug interfaces and failure handling
US6880113B2 (en) Conditional hardware scan dump data capture
US7685476B2 (en) Early notification of error via software interrupt and shared memory write
US7197670B2 (en) Methods and apparatuses for reducing infant mortality in semiconductor devices utilizing static random access memory (SRAM)
US7281171B2 (en) System and method of checking a computer system for proper operation
US6934879B2 (en) Method and apparatus for backing up and restoring data from nonvolatile memory
US7447943B2 (en) Handling memory errors in response to adding new memory to a system
US7430683B2 (en) Method and apparatus for enabling run-time recovery of a failed platform
US6615374B1 (en) First and next error identification for integrated circuit devices
US20040225831A1 (en) Methods and systems for preserving dynamic random access memory contents responsive to hung processor condition
US7877643B2 (en) Method, system, and product for providing extended error handling capability in host bridges
US4593391A (en) Machine check processing system
US7290128B2 (en) Fault resilient boot method for multi-rail processors in a computer system by disabling processor with the failed voltage regulator to control rebooting of the processors
US20170149925A1 (en) Processing cache data
US20030023932A1 (en) Method and apparatus for parity error recovery
US11068360B2 (en) Error recovery method and apparatus based on a lockup mechanism
US6904546B2 (en) System and method for interface isolation and operating system notification during bus errors
US6463492B1 (en) Technique to automatically notify an operating system level application of a system management event
JPH1165898A (en) Maintenance system for electronic computer
JP2005070993A (en) Device having transfer mode abnormality detection function and storage controller, and interface module for the controller
US11360839B1 (en) Systems and methods for storing error data from a crash dump in a computer system
JP3757407B2 (en) Control device

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WILKIE, BRUCE J.;REEL/FRAME:012605/0947

Effective date: 20020111

STCFInformation on status: patent grant

Free format text: PATENTED CASE

FPAYFee payment

Year of fee payment: 4

REMIMaintenance fee reminder mailed
FPAYFee payment

Year of fee payment: 8

SULPSurcharge for late payment

Year of fee payment: 7

ASAssignment

Owner name: LENOVO INTERNATIONAL LIMITED, HONG KONG

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:034194/0291

Effective date: 20140926

FPAYFee payment

Year of fee payment: 12

Machine Check Error occurs when installing Ubuntu (include image of log)

Here is the image for the logs: Image here. The installation process hung at this point. The message was not displayed.

The MCEs (at bottom of image) occurred soon after I selected "install Ubuntu" from the menu. I don't have any idea what , , , or mean. Can someone explain them ? And, based on your experience or expertise, what may be the problem that triggered these messages? RAM, CPU, PSU or something else?

Also, the log mentions . Where can I run any command like this in this situation?

Here are some spec for my setup:

  • USB stick for Ubuntu 16.04, created with UNetBootin;
  • Processor: Xeon E5-1650 v3;
  • Motherboard: ASRock X99 WS-E;
  • Power supply: EVGA SUPERNOVA 1600 G2 120-G2-1600-X1;
  • RAM: 16GB 288-Pin SDRAM DDR4 2400 ECC Registered;
  • GPU: EVGA GTX 680;

If any more information is helpful, please let me know. I really appreciate your help!

Edit: Just to be clear, my computer does not have any OS installed yet. I am building it from scratch. I encountered this problem when I was trying to install Ubuntu. Later, I made a Windows USB stick, but it didn't work either. After the Windows logo was displayed for 5 seconds, the screen went black and nothing happened.

A new customSetting has been created (starting with 8.0 HF6 and later) to enforce the connection string to resolve the real machine:
key="AgentPushPreferFqdn".
Default value = 0

The core setting is used to force FQDN instead of netbios for access.

Just set it to be "1" (AgentPushPreferFqdn) in the CoreSettings.config  (under c:\programdata\symantec\smp\settings) on your SMP.

 

Note:
Please understand that we rely on having the RPC server is available and properly configured in your environment. In some situations you may need to troubleshoot this "The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)" with your network team.

Even after enabling the mentioned coresetting above, the agent push may still fail the same way.
In the NS logs you may see this warning:

(ClientMachine316813.domain.com) Intermediate discovery failures: WMI, NetAPI, Registry
(NetAPI) Failed to retrieve name/domain: API error with HRESULT: 0x00000035
(WMI) Failed to retrieve name/domain: The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
(Registry) Failed to retrieve name/domain: The network path was not found.
-----------------------------------------------------------------------------------------------------
Date: 4/9/2019 10:42:36 AM, Tick Count: 37212234 (10:20:12.2340000), Size: 577 B
Process: AeXSvc (5068), Thread ID: 69, Module: Altiris.NS.dll
Priority: 2, Source: DiscoverMachines.All

In order to validate that you have a network/configuration issue, you could try for example to connect to 'Services' (using Connect to another computer ... option under the Services Console) from your SMP server to one of the affected machine. If you get "Error 1722: The RPC server is unavailable" on the same machines that we are failing to push, then you need to troubleshoot the RPC service on those client machines.

Some suggestions around this issue are the following:

Make sure that the following services are running on the Target Machines:

  • Remote Procedure Call (RPC)
  • Computer Browser
  • Server
  • Remote Registry
  • Windows Management Instrumentation
  • Netlogon
  • Remote Desktop Services
  • Windows Remote Management (WS-Management)

Also, you could look at these pages:

http://support-uk.avanquest.com/en/support/solutions/articles/17000070208-error-1722-the-rpc-server-is-unavailable-
https://www.techjunkie.com/rpc-server-is-unavailable/

 

Additional monitoring - ECC errors? #1508 smp machine check error

I did some quick googling to see if it was possible to monitor ECC errors as this seems like a no brainer benefit to netdata. I haven't found any documentation/pull request adding this and I believe it would be very valuable to sysadmins who monitor bare metal.

The only results I found were for the actual kernel module, EDAC:

http://bluesmoke.sourceforge.net/

It seems this was put in upstream back in kernel 2.6. Is this still a thing? If so, how can netdata properly monitor this while maintaining it's low-memory and resource footprint?

Here are entries in the syslog that show EDAC finding and correcting errors(This actually crashes the system for some reason, however, it is detectable in some way):

More examples of errors on a system of mine:

I'm sure there are a lot of sysadmins out there that have to look after old systems. This would greatly benefit us as well on any future system if this kernel module is still supported on newer hardware containing ECC.

I wouldn't mind doing more research on this to see if there is anything in memory I could find that netdata can quickly query. I am at work right now and about to head home right now.

Quick thought: Maybe a plugin would suffice for now to parse the output of dmesg and just report on failures?

I am starting to seriously learn python and would not mind writing the plugin but it may take me a lot of trial and error.

Will report back if I find anything.

US6912670B2 - Processor internal error handling in an SMP smp machine check error - Google Patents

Processor internal error handling in an SMP server Download PDF

Info

Publication number
US6912670B2
US6912670B2US10/054,017US5401702AUS6912670B2US 6912670 B2US6912670 B2US 6912670B2US 5401702 AUS5401702 AUS 5401702AUS 6912670 B2US6912670 B2US 6912670B2
Authority
US
United States
Prior art keywords
error
logic
internal
processor
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion, smp machine check error. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US10/054,017
Other versions
US20030140285A1 (en
Inventor
Bruce James Wilkie
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo International Ltd
Original Assignee
International Business Machines Corp
Priority smp machine check error (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to flexlm error - 88, 309 accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines CorpfiledCriticalInternational Business Machines Corp
Priority to US10/054,017priorityCriticalpatent/US6912670B2/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATIONreassignmentINTERNATIONAL BUSINESS MACHINES CORPORATIONASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: WILKIE, BRUCE J.
Publication of US20030140285A1publicationCriticalpatent/US20030140285A1/en
Application grantedgrantedCritical
Publication of US6912670B2publicationCriticalpatent/US6912670B2/en
Assigned to LENOVO INTERNATIONAL LIMITEDreassignmentLENOVO INTERNATIONAL LIMITEDASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: INTERNATIONAL BUSINESS MACHINES CORPORATION
Activelegal-statusCriticalCurrent
Adjusted expirationlegal-statusCritical

Links

  • 238000001514detection methodMethods0.000claimsabstractdescription13
  • 230000000875correspondingEffects0.000claimsdescription10
  • 230000004044responseEffects0.000abstractdescription11
  • 238000000034methodMethods0.000abstractdescription2
  • 238000010586diagramMethods0.000description6
  • 230000001186cumulativeEffects0.000description4
  • 230000002093peripheralEffects0.000description4
  • 230000004048modificationEffects0.000description2
  • 238000006011modification reactionMethods0.000description2
  • 230000006399behaviorEffects0.000description1
  • 238000001816coolingMethods0.000description1
  • 230000000977initiatoryEffects0.000description1
  • 230000000644propagatedEffects0.000description1
  • 238000009877renderingMethods0.000description1

Images

Classifications

    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00—Error detection; Error correction; Monitoring
    • G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703—Error or fault processing not based on redundancy, i.e, smp machine check error. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766—Error or fault reporting or storing
    • G06F11/0772—Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00—Error detection; Error correction; Monitoring
    • G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with smp machine check error error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706—Error or fault processing smp machine check error based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0721—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
    • G06F11/0724—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU] in a multiprocessor or a multi-core unit
    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00—Error detection; Error correction; Monitoring
    • G06F11/07—Responding to the occurrence of a fault, e.g, smp machine check error. fault tolerance
    • G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, smp machine check error, or in data representation
    • G06F11/0793—Remedial or corrective actions

Abstract

Description

1. Field of the Present Invention

The present invention generally relates to the field of microprocessor-based data processing systems and more particularly to a system and method for efficient handling of processor internal errors in a symmetric multiprocessor server system.

2. History of Related Art

Interrupt handling is well known in the field of microprocessors and microprocessor-based data processing devices. Traditionally, the handling of processor internal errors (IERRs) in a symmetric multiprocessor (SMP) system has been the responsibility of a System management interrupt (SMI) handler. The SMI typically performs the tasks of logging the error condition and setting the appropriate controls to remove the faulty processor from the available resources.

Unfortunately, delegating processor internal error handling to the SMI is problematic. More specifically, the SMI is not immediately available when a server is powered-on. The SMI is usually installed as part of the power on self test (POST). If an internal error occurs before the SMI is installed and functioning, status cannot be reported and the system will probably halt. In addition, relying on the SMI to handler IERRs assumes that at least one of the processor is sufficiently operable to execute the SMI. If this assumption is not met, system behavior is unpredictable and the smp machine check error will more than likely abort operation with little information to indicate the reason for the failure. Moreover, while it might be tempting to use the service processor found on many server blades to respond to the error and execute the SMI, the response latency of conventional service processors relative to high end SMP servers is too smp machine check error to ensure that erroneous data is not propagated thereby possibly contaminating stored data records.

It would therefore be highly desirable to implement a data processing system in which processor internal errors are handled expeditiously. It would be further desirable if the implemented solution did not rely on the main processors to handle processor internal errors. It would be still further desirable if the response performance of the implemented solution was compatible with the requirements of high end multiprocessor systems.

The problem identified above is in large part addressed by a system and method for handling processor internal errors in a data processing system. The data processing system typically includes a set of main microprocessors that have access to a common system memory via a system bus. The system may further include a service processor that is connected to at least one of the main processors. In addition, the system includes internal error handling hardware configured to log and process internal errors generated by one or more of the main processors. The internal smp machine check error hardware may include error detection logic configured to receive internal error signals from the main processors. In response to receiving one or more IERR signals, the error detection logic is configured to assert and error detected signal that is received by error logging logic, smp machine check error. The error logging logic is configured to update one or more error status register entries when the error detected signal is asserted. When the error logging logic has updated the status register entries, it is configured to assert an error logging complete signal that is received by processor control logic and by any external service processor, for purposes of maintaining system error logs. The processor control logic is configured to de-assert one or more processor enable signals based on the state of the error status registers. In addition, upon completion of the smp machine check error status update by the error logging logic, smp machine check error, the status register is configured to assert an error status updated signal that ultimately produces a system reset. By incorporating error logging and handling into dedicated hardware tied directly to the processor internal error signals, the invention provides a low cost, low response latency mechanism for handling processor internal errors in high performance multiprocessor systems.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, smp machine check error, error c2146 wchar_t the drawings and detailed description presented herein are not intended to limit the invention to the particular embodiment disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

Generally speaking, the present invention contemplates a system and method for handling processor internal errors in a multiprocessor system such as a high end SMP server system. The system incorporates internal error control logic to detect, smp machine check error, log, and respond to processor internal errors generated by one or more of the system's multiple main processors. The control logic is configured to determine which processor(s) have issued an internal error, update an error status register to log the error, notify any external service processor of the error, and restart the system with any remaining functional processor(s).

Turning now to l2.ru runtime error drawings, is a block diagram of selected features of a data processing system according to one embodiment of the present invention. In the depicted embodiment, system 100 includes a set of main processors 102A through 102N (generically or collectively referred to as processor(s) 102) that are connected to a system bus 104. A common system memory 106 is accessible to each processor 102 via system bus 104, smp machine check error. System memory 106 is typically smp machine check error with a volatile storage medium such as an array of dynamic random access memory (DRAM) devices. Because each processor 102 has substantially equal access to system memory 106 (i.e., the memory access time is substantially independent of the processor), the depicted architecture of system 100 is commonly referred to as a symmetric multiprocessor system.

In the depicted embodiment of system 100, a bus bridge 108 provides an interface between system bus 104 and an I/O bus 110 to which one or more peripheral devices 114A through 114N (generically or collectively referred to as peripheral device(s) 114) are connected. I/O bus 110 is typically compliant with one of several industry standard I/O bus specifications including, as an example, the Peripheral Components Interface (PCI) bus as specified in PCI Local Bus Specification Rev 2.2 by the PCI Special Interest Group (www.pcisig.com). Peripheral devices 114 may include devices such as a graphics adapter, high-speed network adapter, hard-disk controller, and the like.

The depicted embodiment of system 100 further includes a general purpose I/O (GPIO) port 112 connected to I/O bus 110 and to which a service processor 116 is connected. Service processor 116 is used to provide support for low-level system functions such as power monitoring, cooling fan control, hardware error logging, and so forth.

System 100 according to the present invention further includes error logic 120. Error logic 120 is connected to the set of main processors 102 run-time error 429 balans provide a fast response to a processor internal error. The use of dedicated hardware to respond to processor internal errors beneficially eliminates dependence on error handling software that may or may not be available at the time a processor internal error is issued.

Referring now toa block diagram of selected elements of error logic 120 according to one embodiment of the present invention is depicted. Error logic 120 comprises dedicated hardware that is integrated into the main system control logic and connected directly to system power. As such, error logic 120 is functional as soon power is applied to system 100 in contrast to system management interrupt (SMI) software modules, smp machine check error are installed as part of the POST. In the depicted embodiment, error logic 120 includes an error detection unit 122 that is configured to receive processor internal error signals from each of the main processors 102, smp machine check error. Processor internal error signals are generally asserted when a processor detects an error unrelated to processor bus operation. If, smp machine check error, for example, a processor with an internal cache memory detects a parity error in the cache, smp machine check error, the error may result in the assertion of the internal error signal. The internal error signal may be referred to herein as the IERR signal consistent with the notation commonly in use for the Pentium® family of processors from Intel Corporation.

Error detection unit 122 is further configured to assert an error detect signal 124 upon determining that one or more of the processor IERR signals smp machine check error been asserted by its corresponding processor. Error detection unit 122 may include suitable latching circuitry to prevent an asserted IERR signal from being reset prematurely and additional logic to produce a pulse on error detect signal 124 in canon ix5000 error 5c00 to an IERR signal such that error detect 124 is pulsed once and only once for each internal error “event” where an event lasts from the assertion of any IERR signal until a system reset is initiated.

Error detect signal 124 provides an input to error logging unit 124. Error logging unit 124 is configured to document an internal error by capturing the identity of the offending processor. Because the error detection logic is not resident on the processor bus, it does not have visibility to the internal registers of the processors, smp machine check error. In most cases when a processor asserts IERR, the processor has experienced an internal fatal error rendering most of its information unusable.

Error logging unit 126 is configured to record and preserve IERR information in an Error Status Register 128. Error status register 128 is configured to store internal error status for each processor 102 of system 100. Referring toa selected portion of one embodiment of error status register 128 is depicted. In this embodiment, error status register 128 includes a set of bit pairs 140A through 140N (generically of collectively referred to as bit pair(s) 140) for each processor 102. A first bit 142 of each bit pair 140 is a “current” bit that indicates whether the corresponding processor 102 is currently asserting its internal error signal while a second bit 144 of each bit pair 140 is a “cumulative” bit that indicates whether the corresponding processor has previously asserted its internal error signal. Whereas the current bits 142 are cleared each time a system reset occurs, the cumulative bits 144 are preserved. Thus, the set of cumulative bits 144 indicate the cumulative set of processors 102 that have internal error problems.

In the depicted embodiment, error status register 128 is accessible to the other components in system 100 through a system interface, such as an Industry Standard Architecture (ISA) bus, identified by reference numeral 130. System interface 130 may include sufficient data, address, and control signals to permit processors 102 to read the contents of status register 128. In addition, error status register 128 may include a one or more bits set in response to an internal error event that provide an interrupt signal to service processor (SP) 116 such that service processor 116 is interrupted in response to a main processor internal error event. In response to an interrupt from error logging unit 128, service processor 116 may be programmed to take specific actions with respect to system power such as powering down smp machine check error so forth. In addition, service processor 116 may be programmed to log or record additional information regarding the internal error. This additional information may include, for example, the time at which an internal error signal was asserted.

The depicted embodiment of error status register 128 further includes an I2C interface for connecting to an I2C bus thereby enabling communication between error status register 128 and an external device in the event that it becomes desirable to access the contents of register 128 externally.

Error logging unit 126, in addition to providing logged information to error status register bde error 12289 capability not supported, is configured to generate an error log complete signal 129 when the logging unit has completed its documentation of an internal error event. Error log complete signal 129 is provided to a system reset unit 132 and a processor control unit 134. System reset logic 132 is configured to generate a system reset that is provided to each processor 102 following an internal error event. System reset logic 132 may be further controlled by an error status updated signal 131 produced by error status register 128 indicating completion of a status register update following an internal error event. Processor control logic 134 is configured to generate a unique processor enable signal for each processor 102 in system 100 following an internal error event, smp machine check error. The processor enable signals are de-asserted if the corresponding processor was responsible for the internal error event and the cause of the internal error could not be corrected. The combination of system reset unit 132 and processor control logic 134 provides means for initiating a system reset and enabling only those processors 102 that are functional following an internal error.

Turning now tosmp machine check error, a flow diagram representing selected elements of a method 150 of responding to internal errors signals in a data processing system according to one embodiment of the invention is presented. Initially, the data processing system is executing (block 151) in a normal operating mode. For purposes of this disclosure, the normal operating mode represents any state following the application of power to the system in which the internal errors signals are not asserted. Accordingly, normal operating mode does not imply that an operating system has been installed and application programs are executing or capable of being executed. Instead, the normal operating mode could be achieved substantially immediately following the application of power to the system if none of the IERR signals is asserted.

The data processing system and, more particularly, the internal error logic of the system, monitors (block 152) for the assertion of an IERR signal by one or more of the main processors. As long as the main processors do not issue any internal error signals, smp machine check error, the system remains in its normal operating mode. During this time, an operating system may be installed and one or more applications programs may be executing. If an internal error is detected, the error logic logs (block 154) the error and updates (block 156) the error status register as described in greater detail above. After updating the status register, the system disables (block 158) any nonfunctional main processors. The disabled processors would typically include any processors currently asserting their internal error signals as well as any processors that asserted their error signals previously, smp machine check error. After disabling the appropriate main processors, the system determines (block 160) whether any functional processors remain in the system. If all processors are currently or have previously asserted their internal error signals, smp machine check error, the error logic generates a system halt (block 164). If there are one or more functional processors remaining, the error logic initiates a reset (block 162) to restart the system with the functional processors. In this manner, smp machine check error, the data processing system is able to respond to internal errors without relying on any error handling software or operating system code.

It will be apparent to those skilled in the art having the benefit of this disclosure that the present invention contemplates a system and method for responding to processor internal errors in a data processing system. It is understood that the form of the invention shown and described in the detailed description and the drawings are to be taken merely as presently preferred examples. It is intended that the following claims be interpreted broadly to embrace all the variations of the preferred embodiments disclosed.

Claims (14)

1. A data processing system, comprising:

multiple main processors connected to a system bus;

a system memory connected to the system bus and accessible to the main processors;

error logic, external to the main processors, and configured to receive internal error signals asserted by the main processors and to respond to an internal error signal by disabling a main processor asserting an internal error signal and restarting the system with any remaining functional main processors, wherein the error logic includes an error status register accessible via an I2C bus; and

a service processor configured to receive a service processor interrupt generated by the error logic.

2. The system ofwherein the error logic is further configured to record the internal error signal in an error status register of the error logic.

3. The system ofwherein the error status register includes at least a pair of bits corresponding to each of the main processors, wherein a first bit of each pair is indicative of whether the corresponding main processor is currently asserting its internal error signal and a second bit of each pair is indicative of whether the corresponding main processor has asserted its internal error signal previously.

4. The system ofwherein the error logic is functional substantially immediately following the application of power to the data processing system.

5. The system ofwherein the error logic includes an error detection unit configured to receive an internal error signal from each of the main processors and further configured to generate an error detect signal responsive to assertion of an internal error signal by any of the processors.

6. The system ofwherein the error logic further includes error logging logic configured to receive the error detect signal and, responsive thereto, to update an error statue register to reflect the internal error signal.

7. The system ofwherein responsive to the service processor interrupt, the service processor is configured to power down the system.

8. Error detection logic suitable for use in a data processing system having multiple main processors, wherein the error detection logic is external to the main processors and is configured to receive internal error signals asserted by the main processors and further configured to respond to an internal error signal by disabling a processor asserting signal, generating a service processor interrupt, and restarting the smp machine check error with any remaining functional processors and further wherein the error detection logic includes an error status register externally accessible via an I2C smp machine check error. The error logic ofwherein the error logic is further configured to record the internal error signal in the error status register of the error logic.

10. The error logic ofwherein the error status register includes at least a pair of bits corresponding to each of the main processors, wherein a first bit of each pair as indicative of whether the corresponding main processor is currently asserting its internal error signal and a second bit of each pair is indicative of whether the corresponding main processor has asserted its internal error signal previously.

11, smp machine check error. The error logic ofsmp machine check error, wherein the error logic is functional substantially immediately following the application of power to the data processing system.

12. The error logic ofsmp machine check error, wherein the error logic includes an error detection unit configured to receive an internal error signal from each of the main processors and further configured to generate an error detect signal responsive to assertion of an internal error signal by any of the processors.

13. The error logic ofwherein the error logic further includes error logging logic configured to receive the error detect signal and, responsive thereto, to update the error status register to reflect the internal error signal.

14. The error logic ofwherein the error logic is further configured to generate the service processor interrupt responsive to error status register update.

US10/054,0172002-01-222002-01-22Processor internal error handling in an SMP server Active2023-06-23US6912670B2 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US10/054,017US6912670B2 (en) 2002-01-222002-01-22Processor internal error handling in an SMP server

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
US10/054,017US6912670B2 (en) 2002-01-222002-01-22Processor internal error handling in an SMP server

Publications (2)

ID=21988209

Family Applications (1)

Application NumberTitlePriority DateFiling Date
US10/054,017Active2023-06-23US6912670B2 smp machine check error (en) 2002-01-222002-01-22Processor internal error handling in an SMP server

Country Status (1)

Cited By (5)

Publication numberPriority datePublication dateAssigneeTitle
US20070150713A1 (en) *2005-12-222007-06-28International Business Machines CorporationMethods and arrangements to dynamically modify the number of active processors in a multi-node system
US20080005538A1 (en) *2006-06-302008-01-03Apparao Padmashree KDynamic configuration of processor core banks
US20090013221A1 (en) *2007-06-252009-01-08Hitachi Industrial Equipment System Co., Ltd.Multi-component system
US20110145662A1 (en) *2009-12-162011-06-16Microsoft CorporationCoordination of error reporting among multiple managed runtimes in the same process
US20130152081A1 smp machine check error (en) *2011-12-132013-06-13International Business Machines CorporationSelectable event reporting for highly virtualized partitioned systems

Families Citing this family (15)

Publication numberPriority datePublication dateAssigneeTitle
US7065599B2 (en) *2001-08-102006-06-20Sun Microsystems, Inc.Multiprocessor systems
US8302111B2 (en) 2003-11-242012-10-30Time Warner Cable Inc.Methods and apparatus for hardware registration in a network device
US7266726B1 (en) 2003-11-242007-09-04Time Warner Cable Inc.Methods and apparatus for event logging in an information network
US9213538B1 smp machine check error (en) 2004-02-062015-12-15Time Warner Cable Enterprises LlcMethods and apparatus for display element management in an information network
US8078669B2 (en) 2004-02-182011-12-13Time Warner Cable Inc.Media extension apparatus and methods for use in an information network
US7426657B2 (en) *2004-07-092008-09-16International Business Machines CorporationSystem and method for predictive processor failure recovery
US7650539B2 (en) *2005-06-302010-01-19Microsoft CorporationObserving debug counter values during system operation
US8370818B2 (en) 2006-12-022013-02-05Time Warner Cable Inc.Methods and apparatus for analyzing software interface usage
US7689567B2 (en) *2006-12-282010-03-30Sap AgError handling for intermittently connected mobile applications
US20080256400A1 (en) *2007-04-162008-10-16Chih-Cheng YangSystem and Method for Information Handling System Error Handling
TWI363962B (en) *2007-06-012012-05-11Holtek Semiconductor Inc
TWI369608B (en) *2008-02-152012-08-01Mstar Semiconductor IncMulti-microprocessor system and control method therefor
WO2010007469A1 (en) *2008-07-162010-01-21Freescale Semiconductor, Inc.Micro controller unit including an error indicator module
US9229843B2 (en) *2010-04-282016-01-05International Business Machines CorporationPredictively managing failover in high availability systems
CN102567177B (en) *2010-12-252014-12-10鸿富锦精密工业(深圳)有限公司System and method for detecting error of computer system

Citations (20)

Publication numberPriority datePublication dateAssigneeTitle
US4415973A (en) 1980-03-281983-11-15International Computers LimitedArray processor with stand-by for replacing failed section
US4860196A (en) 1986-12-011989-08-22Siemens AktiengesellschaftHigh-availability computer system with a support logic for a warm start
US5280606A (en) 1990-03-081994-01-18Nec CorporationFault recovery processing for supercomputer
US5325517A (en) 1989-05-171994-06-28International Business Machines CorporationFault tolerant data processing system
US5335471A (en) 1993-03-081994-08-09Kupiec Daniel JColumn enclosing kit
US5491788A (en) 1993-09-101996-02-13Compaq Computer Corp.Method of booting a multiprocessor computer where execution is transferring from a first processor to a second processor based on the first processor having had a critical error
US5530946A (en) *1994-10-281996-06-25Dell Usa, L.P.Processor failure detection and recovery circuit in a dual processor computer system and method of operation thereof
US5583987A (en) 1994-06-291996-12-10Mitsubishi Denki Kabushiki KaishaMethod and apparatus for initializing a multiprocessor system while resetting defective CPU's detected during operation thereof
US5864653A (en) 1996-12-311999-01-26Compaq Computer CorporationPCI hot spare capability for failed components
US5884019A (en) 1995-08-071999-03-16Fujitsu LimitedSystem and method for collecting dump information in a multi-processor data processing system
US5933614A (en) 1996-12-311999-08-03Compaq Computer CorporationIsolation of PCI and EISA masters by masking control and interrupt lines
US6158015A (en) smp machine check error 1998-03-302000-12-05Micron Electronics, Inc.Apparatus for swapping, adding or removing a processor in smp machine check error operating computer system
US6233680B1 (en) *1998-10-022001-05-15International Business Machines CorporationMethod and system for boot-time deconfiguration of a smp machine check error in a symmetrical multi-processing system
US6378027B1 (en) *1999-03-302002-04-23International Business Machines CorporationSystem upgrade and processor service
US6516429B1 (en) bios application error 501 vista *1999-11-042003-02-04International Business Machines CorporationMethod and apparatus for run-time deconfiguration of a processor in a symmetrical multi-processing system
US6536000B1 (en) *1999-10-152003-03-18Sun Microsystems, Inc.Communication error reporting mechanism in a multiprocessing computer system
US6550019B1 (en) *1999-11-042003-04-15International Business Machines CorporationMethod and apparatus for problem identification during initial program load in a multiprocessor system
US6574748B1 (en) *2000-06-162003-06-03Bull Hn Information Systems Inc.Fast relief swapping of processors in a data processing system
US6708297B1 (en) *2000-12-292004-03-16Emc CorporationMethod and system for monitoring errors on field replaceable units
US6742139B1 (en) *2000-10-192004-05-25International Business Machines CorporationService processor reset/reload

Patent Citations (21)

Publication numberPriority datePublication dateAssigneeTitle
US4415973A (en) 1980-03-281983-11-15International Computers LimitedArray processor with stand-by for replacing failed section
US4860196A (en) 1986-12-011989-08-22Siemens AktiengesellschaftHigh-availability computer system with a support logic for a warm start
US5325517A (en) 1989-05-171994-06-28International Business Machines CorporationFault tolerant data processing system
US5280606A (en) 1990-03-081994-01-18Nec CorporationFault recovery processing for supercomputer
US5335471A (en) 1993-03-081994-08-09Kupiec Daniel JColumn enclosing kit
US5491788A (en) 1993-09-101996-02-13Compaq Computer Corp.Method of booting a multiprocessor computer where execution is transferring from a first processor to a second processor based on the first processor having had a critical error
US5583987A 502 error postfix (en) 1994-06-291996-12-10Mitsubishi Denki Kabushiki KaishaMethod and apparatus for initializing a multiprocessor system while resetting defective CPU's detected during operation thereof
US5530946A (en) *1994-10-281996-06-25Dell Usa, smp machine check error, L.P.Processor failure detection and recovery circuit in a dual processor computer system and method of operation thereof
US5884019A (en) 1995-08-071999-03-16Fujitsu LimitedSystem and method for collecting dump information in a multi-processor data processing system
US5933614A (en) 1996-12-311999-08-03Compaq Computer CorporationIsolation of PCI and EISA masters by masking control and interrupt lines
US5864653A (en) 1996-12-311999-01-26Compaq Computer CorporationPCI hot spare capability for failed components
US6081865A (en) 1996-12-312000-06-27Compaq Computer CorporationIsolation of PCI and EISA masters by masking control and interrupt lines
US6158015A (en) 1998-03-302000-12-05Micron Electronics, Inc.Apparatus for swapping, adding or removing a processor in an operating computer system
US6233680B1 (en) *1998-10-022001-05-15International Business Machines CorporationMethod and system for boot-time deconfiguration of a processor in a symmetrical multi-processing system
US6378027B1 (en) *1999-03-302002-04-23International Business Machines CorporationSystem upgrade and processor service
US6536000B1 (en) *1999-10-152003-03-18Sun Microsystems, Inc.Communication error reporting mechanism in a multiprocessing computer system
US6516429B1 (en) *1999-11-042003-02-04International Business Machines CorporationMethod and apparatus for run-time deconfiguration of a processor in a symmetrical multi-processing system
US6550019B1 (en) *1999-11-042003-04-15International Business Machines CorporationMethod and apparatus for problem identification during initial program load in a multiprocessor system
US6574748B1 (en) *2000-06-162003-06-03Bull Hn Information Systems Inc.Fast relief swapping of processors in a data processing system
US6742139B1 (en) *2000-10-192004-05-25International Business Machines CorporationService processor reset/reload
US6708297B1 (en) *2000-12-292004-03-16Emc CorporationMethod and system for monitoring errors on field replaceable units

Cited By (8)

Publication numberPriority datePublication dateAssigneeTitle
US20070150713A1 (en) *2005-12-222007-06-28International Business Machines CorporationMethods and arrangements to dynamically modify the number of active processors in a multi-node system
US20080005538A1 (en) *2006-06-302008-01-03Apparao Padmashree KDynamic configuration of processor core banks smp machine check error
US20090013221A1 (en) *2007-06-252009-01-08Hitachi Industrial Equipment System Co., Ltd.Multi-component system
US7861115B2 (en) *2007-06-252010-12-28Hitachi Industrial Equipment Systems Co., Ltd.Multi-component system
US20110145662A1 (en) *2009-12-162011-06-16Microsoft CorporationCoordination of error reporting among multiple managed runtimes in the same process
US8429454B2 (en) 2009-12-162013-04-23Microsoft CorporationCoordination of error reporting among multiple managed runtimes in the same process
US20130152081A1 (en) *2011-12-132013-06-13International Business Machines CorporationSelectable event reporting for highly virtualized partitioned systems
US8924971B2 (en) 2011-12-132014-12-30International Business Machines CorporationSelectable event reporting for highly virtualized partitioned systems

Also Published As

Similar Documents

PublicationPublication DateTitle
US6912670B2 (en) Processor internal error handling in an SMP server
US7409580B2 (en) System and method for recovering from errors in a data processing system
TWI337707B (en) System and method for logging recoverable errors
US7260749B2 (en) Hot plug interfaces and http error sprashivai.ru handling
US6880113B2 smp machine check error (en) Conditional hardware scan dump data capture
US7685476B2 (en) imap error authentication for failed login roundcube Early notification of error via software interrupt and shared memory write
US7197670B2 (en) Methods and apparatuses for reducing infant mortality in semiconductor devices utilizing static random access memory (SRAM)
US7281171B2 (en) System and method of checking a computer system for proper operation
US6934879B2 (en) smp machine check error Method and apparatus for backing up and restoring data from nonvolatile memory
US7447943B2 (en) Handling memory errors in response to adding new memory to a system 5 42 52 block error #79
US7430683B2 (en) Method and apparatus for enabling run-time recovery of a failed platform
US6615374B1 (en) First and next error identification for integrated circuit devices smp machine check error
US20040225831A1 (en) Methods and systems for preserving dynamic random access memory contents responsive to hung processor condition
US7877643B2 (en) Method, smp machine check error, system, and product for providing extended error handling capability in host bridges
US4593391A (en) Machine check processing system
US7290128B2 (en) Fault resilient boot method for multi-rail processors in a computer system by disabling processor with the failed voltage regulator to control rebooting of the processors
US20170149925A1 (en) Processing cache data
US20030023932A1 (en) Method and apparatus for parity error recovery
US11068360B2 (en) Error recovery method and apparatus based on a lockup mechanism
US6904546B2 (en) System and method for interface isolation and operating system notification during bus errors
US6463492B1 (en) Technique to automatically notify an operating system level application abstract error delphi tstrings a system management event
JPH1165898A (en) Maintenance system for electronic computer
JP2005070993A (en) Device having transfer mode abnormality detection function and storage controller, and interface module for the controller
US11360839B1 (en) Systems and methods for storing error data from a crash dump in a computer system
JP3757407B2 (en) Control device

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WILKIE, BRUCE J.;REEL/FRAME:012605/0947

Effective date: smp machine check error 20020111

STCFInformation on status: patent grant

Free format text: smp machine check error PATENTED CASE

FPAYFee payment

Year of fee payment: 4

REMIMaintenance fee reminder mailed
FPAYFee payment

Year of fee payment: 8

SULPSurcharge for late payment

Year of fee payment: 7

ASAssignment

Owner name: LENOVO INTERNATIONAL LIMITED, HONG KONG

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:034194/0291

Effective date: 20140926

FPAYFee payment

Year of fee payment: 12

sudo ras-mc-ctl--errors

No Memory errors.

 

No PCIe AER errors.

 

No Extlog errors.

 

MCE events:

12019-07-1520:41:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x942000460082110a,addr=0x243e9f840,tsc=0x8b99a7f84108,walltime=0x5d2c8276,cpuid=0x000706a1,bank=0x00000001

22019-07-1601:34:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x942000460082110a,addr=0x24b9df840,tsc=0xa38afb430944,walltime=0x5d2cc722,cpuid=0x000706a1,bank=0x00000001

32019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d95741ee28,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

42019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d957436320,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

52019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d957451d82,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

62019-07-1601:50:08+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might smp machine check error uncorrected errors soon,mcgcap=0x00000c07,status=0x902000420082110a,tsc=0xa4d957456482,walltime=0x5d2ccae1,cpuid=0x000706a1,bank=0x00000001

72019-07-1603:20:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000400082110a,tsc=0xac3468f91976,walltime=0x5d2cdffa,cpuid=0x000706a1,bank=0x00000001

82019-07-1603:20:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000400082110a,tsc=0xac3468fb7a3a,walltime=0x5d2cdffa,cpuid=0x000706a1,bank=0x00000001

92019-07-1615:08:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000460082110a,tsc=0xe60f3181c782,walltime=0x5d2d85ea,cpuid=0x000706a1,bank=0x00000001

102019-07-1615:08:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000460082110a,tsc=0xe60f31852002,walltime=0x5d2d85ea,cpuid=0x000706a1,bank=0x00000001

112019-07-1702:52:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x942000460082110a,addr=0x249c5f840,tsc=0x11f964ae442b2,walltime=0x5d2e2aea,cpuid=0x000706a1,bank=0x00000001

122019-07-1715:24:09+0700error:corrected filtering(some unreported errors insame region)Generic CACHE Level-2Generic Error,mcg mcgstatus=0,mci Corrected_error Error_enabled Threshold based error status:green,Large number of corrected cache errors.System operating,but might leadto uncorrected errors soon,mcgcap=0x00000c07,status=0x902000440082110a,tsc=0x15d0984e5de54,walltime=0x5d2edb2a,cpuid=0x000706a1,bank=0x00000001

Machine Check Error occurs when installing Ubuntu (include image of log)

Here is the image for the logs: Image here. The installation process hung at this point. The message was not displayed.

The MCEs (at bottom of image) occurred soon after I selected "install Ubuntu" from the menu. I don't have any idea what,or mean. Can someone explain them ? And, based on your experience or expertise, what may be the problem that triggered these messages? RAM, CPU, PSU or something else?

Also, the log mentionssmp machine check error. Where can I run any command like this in this situation?

Here are some spec for my setup:

  • USB stick for Ubuntu 16.04, smp machine check error, created with UNetBootin;
  • Processor: Xeon E5-1650 v3;
  • Motherboard: ASRock X99 WS-E;
  • Power supply: EVGA SUPERNOVA 1600 G2 120-G2-1600-X1;
  • RAM: 16GB 288-Pin SDRAM DDR4 2400 ECC Registered;
  • GPU: EVGA GTX 680;

If any more information is helpful, please let me know. I really appreciate your help!

Edit: Just to be clear, smp machine check error, my computer does not have any OS installed yet. I am building it from scratch. I encountered this problem when I was trying to install Ubuntu, smp machine check error. Later, I made a Windows USB stick, but it didn't work either, smp machine check error. After the Windows logo was displayed for 5 seconds, the screen went black and nothing happened.

A new customSetting has been created (starting with 8.0 HF6 and later) to enforce the connection string to resolve the real machine:
key="AgentPushPreferFqdn".
Default value = 0

The core setting is used to force FQDN instead of netbios for access.

Just set it to be "1" (AgentPushPreferFqdn) in the CoreSettings.config  (under c:\programdata\symantec\smp\settings) on your SMP.

 

Note:
Please understand that we rely on having the RPC server is available and properly configured in your environment. In some situations you may need to troubleshoot this "The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)" with your network team.

Even after enabling the mentioned coresetting above, the agent push may still fail the same way.
In the NS logs you may see this warning:

(ClientMachine316813.domain.com) Intermediate discovery failures: WMI, smp machine check error, NetAPI, Registry
(NetAPI) Failed to retrieve name/domain: API error with HRESULT: 0x00000035
(WMI) Failed to retrieve name/domain: The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
(Registry) Failed to retrieve name/domain: The network path was not found.
-----------------------------------------------------------------------------------------------------
Date: 4/9/2019 10:42:36 AM, Tick Count: 37212234 (10:20:12.2340000), Size: 577 B
Process: AeXSvc (5068), Thread ID: 69, Module: Altiris.NS.dll
Priority: 2, Source: DiscoverMachines.All

In order to validate that you have a network/configuration issue, you could try for example to connect to 'Services' (using Connect to another computer . option under the Services Console) from your SMP server to one of the affected machine. If you get "Error 1722: The RPC server is unavailable" on the same machines that we are failing to push, then you need to troubleshoot the RPC service redmine mysql error those client machines.

Some suggestions around this issue are the following:

Make sure that the following services are running on the Target Machines:

  • Remote Procedure Call (RPC)
  • Computer Browser
  • Server
  • Remote Registry
  • Windows Management Instrumentation
  • Netlogon
  • Remote Desktop Services
  • Windows Remote Management (WS-Management)

Also, you could look at these smp machine check error https://www.techjunkie.com/rpc-server-is-unavailable/

 

1 Comments

  1. Excuse for that I interfere � At me a similar situation. I invite to discussion. Write here or in PM.

Leave a Comment