Evaluating nvme-cli new feature logs and the need to monitor them
NVMe is a new, wider, standarized protocol to access flash-based storage over PCIe interfaces. NVMe introduces a new command set, several new concepts, new operations and different or more standarized logging capabilities from the habitual ones on SAS or SATA devices, together with a cli tool, nvme-cli
, exposing a human friendly interface to the device.
We take a look at the new logging capabilities to asses if the current monitoring infrastructure in place at CERN, using smartmontools
, is enough or if we have a need for the new capabilities interfaced by nvme-cli
.
Summary table#
This summary reports only on error, health and failure logs, leaving aside namespace and administration logs. All nvme logs can be fetched from smartmontools, but only a few of them can be interpreted and presented for human readability. The rest are fetched as hex dumps.
Log | Specification Mandatory | Relevant for our purposes | Interpretable from smartmontools |
---|---|---|---|
Error | ✔ | ✔ | ✔ |
SMART | ✔ | ✔ | ✔ |
Self-test | ✔ | ✔ | ✔ |
Telemetry | ✔ | ✖ | ✖ |
Firmware | ✔ | ✖ | ✖ |
Endurance | ✖ | ✖ | ✖ |
Sanitize | ✖ | ✖ | ✖ |
Error logs assesment in more detail#
Description and relevancy assesment for all retrievable logs and their interpretation through smartmontools. We refer as interpretation the capability of retrieving and presenting the log information in a human-readable format and not as a hex dump.
Telemetry Log#
- What does it log: Telemetry data for the vendors or OEM
- Is this feature optional?: No
- Is it relevant for health and error monitoring purposes?: Could be. Only if we are interested on recovering a failing disk. Standard human readable logs are encouraged but not enforced: none of the SSDs tested on had human readable telemetry logs
- Human interpretable on smartmontools?: No
- How to retrieve it:
smartctl -l nvmelog,0x07,0x1000 /dev/nme0n1p1
(Host-Initiated)smartctl -l nvmelog,0x08,0x1000 /dev/nme0n1p1
(Controller-Initiated)nvme telemetry-log /dev/nvme0n1p1 -o outputfile
(Host-Initiated)nvme get-log /dev/nvme0n1p1 -i 0x08 -l 0x1000
(Controller-Initiated, NOT interpreted)
Firmware Log#
- What does it log: Firmware log, with fields such as firmware slot or firmware revision
- Is this feature optional?: No
- Is it relevant for health and error monitoring purposes?: Maybe, but probably not
- Human interpretable on smartmontools?: No
- How to retrieve it:
smartctl -l nvmelog,0x3,0x1000 /dev/nvme0n1p1
nvme fw-log /dev/nvme0n1p1
SMART Log#
- What does it log: SMART attributes as standarized by the NVME specification
- Is this feature optional?: No
- Is it relevant for health and error monitoring purposes?: Yes
- Human interpretable on smartmontools?: Yes
- How to retrieve it:
smartctl -a /dev/nme0n1p1
nvme smart-log /dev/nvme0n1p1
Error Log#
- What does it log: Error log page of an SSD
- Is this feature optional?: No
- Is it relevant for health and error monitoring purposes?: Yes
- Human interpretable on smartmontools?: Yes, parsed slightly differently. We can do with smartmontools
- How to retrieve it:
smartctl -l nvmelog,0x01,0xff /dev/nvme0n1p1
nvme error-log /dev/nvme0n1p1
Commands Supported and Effects Log#
- What does it log: the commands that the controller supports and the effects of those commands on the state of the NVM subsystem
- Is this feature optional?: No
- Is it relevant for health and error monitoring purposes?: No
- Human interpretable on smartmontools?: No
- How to retrieve it:
smartctl -l nvmelog,0x05,0x1000 /dev/nvme0n1p1
nvme effects-log /dev/nvme0n1p1
Endurance Log#
- What does it log: Indicates if an Endurance Group Event has occurred for a particular Endurance Group (grouping of one or more NVM Sets: collection of NVM that is separate –logically and potentially physically– from NVM in other NVM Sets)
- Is this feature optional?: Yes
- Is it relevant for health and error monitoring purposes?: No, we can monitor device endurance through SMART log
- Human interpretable on smartmontools?: No
- How to retrieve it:
smartctl -l nvmelog,0x09,0xff /dev/nvme0n1p1
nvme endurance-log /dev/nvme0n1p1
NVMe Asymmetric Namespace Access Log#
- What does it log: NVMe Asymmetric Namespace Access log page
- Is this feature optional?: Yes
- Is it relevant for health and error monitoring purposes?: No, not interested in namespaces
- Human interpretable on smartmontools?: No
- How to retrieve it:
nvme ana-log /dev/nvme0n1p1
smartctl -l nvmelog,0x0c,0xff /dev/nvme0n1p1
Sanitize Log#
- What does it log: Sanitize operations status
- Is this feature optional?: Yes
- Is it relevant for health and error monitoring purposes?: No
- Human interpretable on smartmontools?: No
- How to retrieve it:
nvme sanitize-log /dev/nvme0n1p1
smartctl -l nvmelog,0x81,0xff /dev/nvme0n1p1
Self-test Log#
- What does it log: the self-test operation results
- Is this feature optional?: No
- Is it relevant for health and error monitoring purposes?: Yes
- Human interpretable on smartmontools?: Yes
- How to retrieve it:
nvme self-test-log /dev/nvme0n1p1
smartctl -l selftest /dev/nvme0n1p1
Changed Namespace List Log#
- What does it log: the NVMe Changed Namespace List log page from an NVMe device
- Is this feature optional?: No
- Is it relevant for health and error monitoring purposes?: No
- Human interpretable on smartmontools?: No
- How to retrieve it:
nvme changed-ns-list-log /dev/nvme0n1p1
smartctl -l nvmelog,0x04,0xff /dev/nvme0n1p1
References and resources/ Interesting reads#
- NVMe 1.4 specification https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf
- How SSDs Fail – NVMe™ SSD Management, Error https://nvmexpress.org/how-ssds-fail-nvme-ssd-management-error-reporting-and-logging-capabilities/
- Open Source NVMe™ Management Utility – NVMe Command Line Interface (NVMe-CLI): https://nvmexpress.org/open-source-nvme-management-utility-nvme-command-line-interface-nvme-cli/
- Status of NVMe support on smartmontools https://www.smartmontools.org/wiki/NVMe_Support
- A quick tour of NVMe: https://metebalci.com/blog/a-quick-tour-of-nvm-express-nvme/.
- NVM Express Specifications -Mastering Today’s Architecture and preparing for tomorrow’s https://www.snia.org/sites/default/files/SDC/2019/presentations/NVMe/Metz_J_Adams_Nick_NVM_Express_Specifications_Mastering_Today%E2%80%99s_Architecture_and_Preparing_for_Tomorrows.pdf
- NVME tips and tricks (2018): https://www.nvmedeveloperdays.com/English/Collaterals/Proceedings/2018/20181204_PRECON2_Hands.pdf
- Failure Trends in a Large Disk Drive Population (2007): http://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf
- Disk Failures in the EOS setup at CERN: https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_04046.pdf
- Flash Reliability in Production - The Expected and the Unexpected (2016): https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf
- The SSD Anthology: Understanding SSDs and New Drives from OCZ (2009): https://www.anandtech.com/show/2738
- NVM Express: SCSI Translation Reference: https://www.nvmexpress.org/wp-content/uploads/NVM-Express-SCSI-Translation-Reference-1_1-Gold.pdf
- Performance Analysis of NVMe SSDs and their Implication on Real World Databases: https://www.cs.utah.edu/~manua/pubs/systor15.pdf
- NVMe 1.3 Specification Published With New Features For Client And Enterprise SSDs: https://www.anandtech.com/show/11436/nvme-13-specification-published-new-features
- Implementing and Configuring Modern SANs with NVMe/FC: https://www.netapp.com/us/media/tr-4684.pdf