Evaluating nvme-cli new feature logs and the need to monitor them

NVMe is a new, wider, standarized protocol to access flash-based storage over PCIe interfaces. NVMe introduces a new command set, several new concepts, new operations and different or more standarized logging capabilities from the habitual ones on SAS or SATA devices, together with a cli tool, nvme-cli, exposing a human friendly interface to the device.

We take a look at the new logging capabilities to asses if the current monitoring infrastructure in place at CERN, using smartmontools, is enough or if we have a need for the new capabilities interfaced by nvme-cli.

Summary table#

This summary reports only on error, health and failure logs, leaving aside namespace and administration logs. All nvme logs can be fetched from smartmontools, but only a few of them can be interpreted and presented for human readability. The rest are fetched as hex dumps.

Log Specification Mandatory Relevant for our purposes Interpretable from smartmontools
Error
SMART
Self-test
Telemetry
Firmware
Endurance
Sanitize

Error logs assesment in more detail#

Description and relevancy assesment for all retrievable logs and their interpretation through smartmontools. We refer as interpretation the capability of retrieving and presenting the log information in a human-readable format and not as a hex dump.

Telemetry Log#

  • What does it log: Telemetry data for the vendors or OEM
  • Is this feature optional?: No
  • Is it relevant for health and error monitoring purposes?: Could be. Only if we are interested on recovering a failing disk. Standard human readable logs are encouraged but not enforced: none of the SSDs tested on had human readable telemetry logs
  • Human interpretable on smartmontools?: No
  • How to retrieve it:
    • smartctl -l nvmelog,0x07,0x1000 /dev/nme0n1p1 (Host-Initiated)
    • smartctl -l nvmelog,0x08,0x1000 /dev/nme0n1p1 (Controller-Initiated)
    • nvme telemetry-log /dev/nvme0n1p1 -o outputfile (Host-Initiated)
    • nvme get-log /dev/nvme0n1p1 -i 0x08 -l 0x1000 (Controller-Initiated, NOT interpreted)

Firmware Log#

  • What does it log: Firmware log, with fields such as firmware slot or firmware revision
  • Is this feature optional?: No
  • Is it relevant for health and error monitoring purposes?: Maybe, but probably not
  • Human interpretable on smartmontools?: No
  • How to retrieve it:
    • smartctl -l nvmelog,0x3,0x1000 /dev/nvme0n1p1
    • nvme fw-log /dev/nvme0n1p1

SMART Log#

  • What does it log: SMART attributes as standarized by the NVME specification
  • Is this feature optional?: No
  • Is it relevant for health and error monitoring purposes?: Yes
  • Human interpretable on smartmontools?: Yes
  • How to retrieve it:
    • smartctl -a /dev/nme0n1p1
    • nvme smart-log /dev/nvme0n1p1

Error Log#

  • What does it log: Error log page of an SSD
  • Is this feature optional?: No
  • Is it relevant for health and error monitoring purposes?: Yes
  • Human interpretable on smartmontools?: Yes, parsed slightly differently. We can do with smartmontools
  • How to retrieve it:
    • smartctl -l nvmelog,0x01,0xff /dev/nvme0n1p1
    • nvme error-log /dev/nvme0n1p1

Commands Supported and Effects Log#

  • What does it log: the commands that the controller supports and the effects of those commands on the state of the NVM subsystem
  • Is this feature optional?: No
  • Is it relevant for health and error monitoring purposes?: No
  • Human interpretable on smartmontools?: No
  • How to retrieve it:
    • smartctl -l nvmelog,0x05,0x1000 /dev/nvme0n1p1
    • nvme effects-log /dev/nvme0n1p1

Endurance Log#

  • What does it log: Indicates if an Endurance Group Event has occurred for a particular Endurance Group (grouping of one or more NVM Sets: collection of NVM that is separate –logically and potentially physically– from NVM in other NVM Sets)
  • Is this feature optional?: Yes
  • Is it relevant for health and error monitoring purposes?: No, we can monitor device endurance through SMART log
  • Human interpretable on smartmontools?: No
  • How to retrieve it:
    • smartctl -l nvmelog,0x09,0xff /dev/nvme0n1p1
    • nvme endurance-log /dev/nvme0n1p1

NVMe Asymmetric Namespace Access Log#

  • What does it log: NVMe Asymmetric Namespace Access log page
  • Is this feature optional?: Yes
  • Is it relevant for health and error monitoring purposes?: No, not interested in namespaces
  • Human interpretable on smartmontools?: No
  • How to retrieve it:
    • nvme ana-log /dev/nvme0n1p1
    • smartctl -l nvmelog,0x0c,0xff /dev/nvme0n1p1

Sanitize Log#

  • What does it log: Sanitize operations status
  • Is this feature optional?: Yes
  • Is it relevant for health and error monitoring purposes?: No
  • Human interpretable on smartmontools?: No
  • How to retrieve it:
    • nvme sanitize-log /dev/nvme0n1p1
    • smartctl -l nvmelog,0x81,0xff /dev/nvme0n1p1

Self-test Log#

  • What does it log: the self-test operation results
  • Is this feature optional?: No
  • Is it relevant for health and error monitoring purposes?: Yes
  • Human interpretable on smartmontools?: Yes
  • How to retrieve it:
    • nvme self-test-log /dev/nvme0n1p1
    • smartctl -l selftest /dev/nvme0n1p1

Changed Namespace List Log#

  • What does it log: the NVMe Changed Namespace List log page from an NVMe device
  • Is this feature optional?: No
  • Is it relevant for health and error monitoring purposes?: No
  • Human interpretable on smartmontools?: No
  • How to retrieve it:
    • nvme changed-ns-list-log /dev/nvme0n1p1
    • smartctl -l nvmelog,0x04,0xff /dev/nvme0n1p1

References and resources/ Interesting reads#