Sunday 3 July 2022

You must check these four things on your computer at least once a year

If you have purchased a new computer, perhaps have built your own PC or have a PC/notebook which is over a year old, you should check these things now and at least once a year.

Introduction

Modern computers are able to monitor various temperatures, fan speeds and voltages throughout the system. You can easily download software to display these values (often the BIOS setup/configuration menu system will display them too). Using SMART, we can also find out about the health of our internal disk drives too.

If you have received a new or second-hand system recently, a typical problem is that during transportation, the CPU heatsink may have come loose. This would mean that your CPU will get too hot and when this happens, modern PCs will usually 'throttle back' and go slow to avoid permanent damage to the CPU. 

If you have a computer that is a few years old, the thermal paste between the CPU and the CPU heatsink may have dried out. Again, your CPU may be getting too hot.

Also, if your PC\Notebook is older than a year or so, it may have accumulated dust and cobwebs! This can seriously affect the air flow through the system and can thus cause the hard drives and CPU to overheat.

If you have overclocked your system, you will need to be particularly careful, because you are probably running it at its maximum performance capabilities - any overheating may cause damage or a system crash!

Some monitoring programs which I recommend you to try are:

  • Core Temp - Displays CPU temperatures in stay-ontop box or in  System Tray.
  • Open Hardware Monitor - reports and plots many different parameters
  • Speed Fan - reports and can control fan speeds (depending on hardware compatibility)
  • HWInfo - very techy app. (can be dangerous) but extremely powerful. Tick the 'Sensors only' checkbox. Can report if Thermal Throttling has occurred due to CPU overheating.
Many other programs are available, including OEM manufacturer-specific (Asus, Gigabyte, Dell, etc.) applications to control fan speeds, voltages, etc. Use a search engine to try out some if you are feeling brave! This type of software can have hardware compatibility issues, so find one that works well on your PC\Notebook.

Here are screenshots of the ones mentioned above.

Temperatures can be display in the System Tray and set a threshold.

OHM can plot selected parameters.

SpeedFan can also control fans speeds (if compatible)



HWInfo is very powerful - use with care if exploring the other options!

1. CPU Temperature

Your CPU will have a maximum transistor junction temperature specified by the manufacturer (Intel or AMD) called TjMax. Core Temp will display this maximum temp. if it recognises your CPU and knows the specification. However, it may not be correct, so it is best to double-check on the CPU manufacturers website.

The TjMax temperature should never be exceeded because most CPUs will simply slow down if they get too hot and thus dissipate less heat. However, if your CPU Core Temperature is getting within 5 deg C of TjMax, you should be very worried! A CPU Core temperature of 60-70 deg C or less is desirable.

What you should do is stress your CPU (there is a Benchmark stress test tab in CPUID's CPU-Z application - or download a copy of Prime95). Then make a note of the maximum CPU temperature recorded by Core Temp and set a warning threshold above that so that if your CPU starts to get hotter after 6 months or so, you know that your thermal paste is drying up or the air vents are blocked or you have faulty fans!

2. CPU Fan Speed

When I worked for Research Machines, I was responsible for introducing hardware monitoring software into the PC production process in the factory. We produced up to 1000 PCs and notebooks a day. The software would run a stress test and then measure the temperatures, fans speeds and voltages. Because it knew what the expected values were for each system (depending on CPU, case type and mainboard model) we could detect and fail any system which had readings outside of the expected range.

After trialling the software for a few weeks, it quickly became apparent that measuring CPU temperature was NOT a good indication of a faulty heatsink or CPU fan! The reason for this was that if the CPU got too hot, then the CPU fan would simply speed up in order to reduce the temperature of the CPU. Therefore, unless the heatsink was completely missing, the CPU would never get too hot!

Also, because each mainboard was tested using Automated Test Equipment at the manufacturers factory, we very rarely saw any bad voltages on any of these new OEM mainboards or notebooks.

In fact, the most important parameter to monitor was the CPU fan speed! If the CPU heatsink had not been attached correctly or the wrong part used, then the fan would spin at a much faster rate than expected for that CPU+case type+mainboard combination. If you are producing hundreds of systems a day, you can quickly plot a graph from the logged data to show the min., max. and median values for each type of system.

Before the sensor monitoring software was introduced, RM had an overall system customer 'return rate' of 0.4% (half of these were always due to end user error). Once the monitoring software was introduced, the return rate fell to 0.2% (most of these returns were thus due to customer error), and 95% of the extra build faults that were detected in the factory burn-in test stage were due to incorrectly fitted heatsinks on the assembly line! An incorrectly fitted heatsink would be likely to come loose during delivery, thus causing a customer return - hence the great reduction in the end-user return rate.

The moral of this story is that your CPU temperatures may look OK, but your CPU fan may spinning like mad and desperately trying to keep the CPU cool! This could be due to:

  • Faulty CPU fan (or fan control software, if installed)
  • Loose CPU heatsink
  • Insufficient or dry thermal paste
  • High internal case air temperature due to blocked vents
  • Inadequate heatsink/fan assembly
Consider inspecting and cleaning your system every year and renewing the thermal paste every few years.

If your CPU is fitted with a stock Intel heatsink, you might want to consider replacing it with a better one. The Artic range are efficient and good value. I just replaced the stock Intel cooler for my Haswell 1150 CPU with an Artic  Alpine 12 CO and it is now much quieter and runs a lot cooler too. My i5 4670K now stays well below 60 deg C even under stress.

Arctic 12 CO for 115x CPUs

If you just need to replace any dried-out thermal paste on your heatsink with new paste, make sure to buy decent thermal paste which has good performance and does not dry out. Also ensure that when you replace the heatsink, it makes good contact with the top of the CPU (clean all surfaces with isopropyl alcohol first, it is very useful stuff!).

3. Hard Disk Temperatures and Disk Health

Because 'spinning rust' hard disks do not like getting too hot (over 60 deg C) you should check their temperatures in case they have a fever (or Covid19 ;-). Use SMART to do this for each drive (Open Hardware Monitor, Speed Fan, CPU-Z or HWInfo will display disk temperatures via SMART).

SMART will also display your disk's health, although in my experience you rarely get a SMART 'fail' indication until it is too late. If you suspect a disk may be on it's way out, use CrystalDiskInfo to check all the SMART data, especially the Read Error rate, Reallocated Sector value, Seek Error Rate and Uncorrectable errors (the Raw Values column often contains the 'event count' value). 


You are supposed to check that the Current Value is well above the Threshold value, but these numbers are 'mangled' by the drive manufacturers and although they seem to use 100 as the start value (for 100% good?), some values start at 200. See here for a real-world explanation and how manufacturers have turned a good idea into a complete and useless mess!

Spinning rust hard disks can be expected to get seek errors but the Reallocated Sector Count (raw value) should be 0, or at least a low number, because it means that the disk has got so many errors from some sectors that it has stopped using that sector and used a different sector instead. Uncorrectable errors are very bad news (it means the OS should have reported a serious disk error and failed in some way). If you get DMA CRC errors it usually indicates you may have a bad disk data cable or bad data connection (more common with older IDE drives).


4. Internal RTC Battery

Lastly, you should check the voltage of your Internal Real Time Clock (RTC) battery at least once a year. All PCs and laptops have these and they are usually a CR2032 coin cell batteries in PCs which have a voltage of approx. 3.2V when new. These usually last only 2-5 years.

The CMOS RTC chip also contains some BIOS settings for the computer which are remembered even when the system is unplugged from the mains.

Lest we forget...

Software such as Open Hardware Monitor or HWInfo may report a 'VBAT' voltage of around 3V, however the reported voltage cannot be trusted because the power supply is on and it may be supplying power to the RTC circuit via the 3V standby rail - so all you are measuring is the PSU voltage.

Some computer designs supply the RTC chip with power from the standby rail of the PSU when off and in Standby mode, so the RTC is powered from the mains via the Standby rail (the mains is still connected to the PSU and so it is getting a tiny bit of power). However, other mainboards put a constant drain on the battery even when the PC is connected to the mains supply. This is poor design but it seems to be quite common!

Typical symptoms of a flat CMOS RTC battery are:
  • When you switch on your system in the morning, it may reset/restart a few times before booting because it has lost its BIOS settings and the BIOS has to figure out what memory configuration/speed, etc. it has to set. It may also lose other BIOS settings on occasion.
  • The BIOS configuration menu shows the wrong date and time.
If you don't want to take your PC or laptop apart to remove and measure the voltage of the battery, you can check the battery like this:

Set correct BIOS Time and Date (+/- 1 minute).
  1. Reboot the computer and go into the BIOS settings - check the Date and Time is correct against your watch within 1 minute (it should be because Windows sets it to the correct (internet) time when it boots). If it is not correct, boot to Windows/Linux and then restart - or set the correct date and time manually in the BIOS.
  2. Power off the system.
  3. Disconnect the MAINS cable (if a laptop, also remove the main battery pack).
  4. Press the Power button (it should not switch on because it has no power!)
  5. Wait 10 minutes (1 minute is not long enough because you have to detect that the clock is slow by 10 minutes or so in step 7).
  6. Reconnect the mains cable and switch on the system - quickly press the key to get into the BIOS settings - e.g. ESC or F1 - do NOT let it boot to Windows\Linux (if it does - start at step 1 again!).
  7. Now read the time and date and check it against your watch - if the date has reset to a much earlier date then your battery is very flat. If the date is correct but the time is slow, it means it had stopped ticking due to insufficient voltage whilst it was off for 10 minutes. So unless it shows the current time and date, your battery needs replacing as it is nearly flat.
A battery that measures below say 2.8V should be replaced even if it is not yet causing a problem (they should measure at least 3V when new).

Tip: When replacing the battery, do not touch the battery with your fingers. The grease on your fingers is (very slightly) conductive and it if gets across the battery sides it can cause a (very tiny) drain on the battery which over a few years can hasten its demise. Also, do not use metal tweezers to handle the battery as you will short it out. Use gloves and plastic tweezers. This also applies when replacing watch batteries too (which is why, when you replace a watch battery, it never seems to last as long as it should!). If you have OCD, you can clean the battery with isopropyl alcohol before fitting it to remove all grease.

Resetting the 'CMOS'

P.S. You may see a small link or jumper on the PC mainboard which can 'reset' the CMOS BIOS settings. In some board designs, this jumper can short out the battery voltage to the CMOS RTC chip. Note that you MUST read the manual to see how to use this link. Some mainboards require the user to remove the power cable before shorting out the link and thus remove the Standby voltages. If the system still has the power connected and you short the link to reset the BIOS settings, it can cause permanent damage to the RTC battery circuit (usually blows a diode) and this can result in the RTC battery not lasting very long or the CMOS settings may be lost every time you remove power from the PC from then onwards. So never just short that link before reading the manual. Some mainboards require you to: short out the link - switch on the system - switch off the system - remove the link. So in this case you have to connect the power - again RTFM first!

If you found this useful, please subscribe.

Other Articles


No comments:

Post a Comment