Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

Physical DPOD Cell Members that are required to process high transactions per second (TPS) load include 4 CPU sockets and NVMe disks for maximizing server I/O throughput.

DPOD is using NUMA (Non-Uniform Memory Access) technology to bind each of the Store's logical nodes to specific physical processor, disks and memory in a way that will minimize the latency of persisting data to disks.

Note: If the cell member server does not have 4 CPU sockets or does not have NVMe disks - do not perform the steps in this document.

Enabling NUMA in BIOS

Make sure to enable NUMA in the physical server's BIOS. You may need to consult with the hardware manufacturer documentation on how to achieve that.

Note: The number of NUMA nodes configured in BIOS should be 4 (should match the amount of physical CPU sockets in the server).
Some servers allow increasing the NUMA nodes number (e.g. double the number of CPU sockets), which is not suitable for DPOD.

...

Installing RAM Modules

Use the hardware manufacturer documentation to install the same amount of RAM for each one of the CPUs of the physical server.

Verify NUMA

Once NUMA has been enabled in BIOS and RAM modules have been installed, verify the installation using the following command:

...

.
Make sure the RAM size of each node is the same and that there are 4 nodes available:

Code Block
languagebash
themeRDark
numactl -sH | grep cpubind -e size -e available

Expected output:

foravailable: 4 CPU sockets cell members:
cpubind: 0 1 2 3

Connecting Disks

Same number of disks (2 or 3) on each CPU bus - 1,2,3

Required information

nodes (0-3)
node 0 size: 128292 MB
node 1 size: 128994 MB
node 2 size: 129010 MB
node 3 size: 129009 MB

Required Information for NVMe Disks

The following table contains the list of OS mount points that should be configured along with additional installed disks and additional information that must be gathered before in order to create the mount points required for federating the DPOD cell member to the cell environment.
Please copy this table, use it during the procedure, and complete the information in the empty cells as  as you follow the procedure:

...

.
The table should have 6 or 9 rows, according to the number of disks installed in your server.

Disk BayDisk SerialDisk OS PathPCI Slot NumberNUMA Node (CPU #)2/data22/data222 */data2223/data33/data333 */data3334/data44/data444 */data444

* Lines marked with asterisk (*) are relevant only in case DPOD sizing team recommends 9 disks instead of 6 disks per cell member. You may remove these lines in case you have only 6 disks per cell member.

Identifying disk bays and disk serial numbers

To identify which of the server's NVMe disk bays is bound to which of the CPUs, use the hardware manufacture documentation.
Write down the disk bay as well as the disk's serial number by visually observing the disk.

Identifying disk OS paths

To list the OS path of each disk, execute the following command and write down the disk OS path (e.g.: /dev/nvme0n1) according to the disk's serial number (e.g.: PHLE8XXXXXXC3P2EGN):

...

languagebash
themeRDark















...

Installing NVMe Disks in the Correct Disk Bays

Use the hardware manufacturer documentation to find out which disk bay is bound which of the CPUs. CPUs should be numbered from 0 to 3.

You should install the same number of NVMe disks (2 or 3) for CPUs 1, 2 and 3. CPU 0 should not have any NVMe disks bound to it.

Update table: Write down the disk bay and the disk's serial number by visually observing the disk and the bay where it is installed.

Identifying Disk OS Paths

To list the OS path of each disk, execute the following command.

Update table: Write down the disk OS path (e.g.: /dev/nvme0n1) according to the disk's serial number (e.g.: PHLE8XXXXXXC3P2EGN).

Code Block
languagebash
themeRDark
nvme -list

Expected output:
Node             SN                   Model                                    SNNamespace Usage                  Model    Format           FW Rev 
----------------                  Namespace Usage                      Format           FW Rev 
-------------------- --------------------- -------------------- --------- -------------------------- ----- --------- -------------------------- ---------------- --------
/dev/nvme0n1     PHLE8XXXXXXC3P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme1n1     PHLE8XXXXXXM3P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme2n1     PHLE8XXXXXX83P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme3n1     PHLE8XXXXXXN3P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme4n1     PHLE8XXXXXX63P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme5n1     PHLE8XXXXXXJ3P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46

Identifying PCI slot numbers

To list the the PCI slot for each disk OS path, execute the following command and write down the PCI slot (e.g.: 0c:00.0) according to the last part of the disk OS path (e.g.: nvme0n1):

Code Block
languagebash
themeRDark
lspci -nn | grep NVM | awk '{print $1}' | xargs -Innn bash -c "printf 'PCI Slot: nnn     '; ls -la /sys/dev/block | grep nnn"

Expected output:
PCI Slot: 0c:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:2 -> ../../devices/pci0000:07/0000:07:00.0/0000:08:00.0/0000:09:02.0/0000:0c:00.0/nvme/nvme0/nvme0n1
PCI Slot: 0d:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:5 -> ../../devices/pci0000:07/0000:07:00.0/0000:08:00.0/0000:09:03.0/0000:0d:00.0/nvme/nvme1/nvme1n1
PCI Slot: ad:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:1 -> ../../devices/pci0000:ac/0000:ac:02.0/0000:ad:00.0/nvme/nvme2/nvme2n1
PCI Slot: ae:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:0 -> ../../devices/pci0000:ac/0000:ac:03.0/0000:ae:00.0/nvme/nvme3/nvme3n1
PCI Slot: c5:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:3 -> ../../devices/pci0000:c4/0000:c4:02.0/0000:c5:00.0/nvme/nvme4/nvme4n1
PCI Slot: c6:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:4 -> ../../devices/pci0000:c4/0000:c4:03.0/0000:c6:00.0/nvme/nvme5/nvme5n1

Tip: you may execute the following command to list the details of all PCI slots with NVMe disks installed in the server:
lspci -nn | grep -i nvme | awk '{print $1}' | xargs -Innn lspci -v -s nnn

Tip: you may execute the following command to list all disk OS paths in the server:
ls -la /sys/dev/block

Identifying NUMA nodes

To list the NUMA node of each PCI slot, execute the following command and write down the NUMA node (e.g.: 1) according to the PCI slot (e.g.: 0c:00.0):

Code Block
languagebash
themeRDark
lspci -nn | grep -i nvme | awk '{print $1}' | xargs -Innn bash -c "printf 'PCI Slot: nnn'; lspci -v -s nnn | grep NUMA"

Expected output:
PCI Slot: 0c:00.0	Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 1
PCI Slot: 0d:00.0	Flags: bus master, fast devsel, latency 0, IRQ 52, NUMA node 1
PCI Slot: ad:00.0	Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2
PCI Slot: ae:00.0	Flags: bus master, fast devsel, latency 0, IRQ 49, NUMA node 2
PCI Slot: c5:00.0	Flags: bus master, fast devsel, latency 0, IRQ 51, NUMA node 3
PCI Slot: c6:00.0	Flags: bus master, fast devsel, latency 0, IRQ 55, NUMA node 3

Example of required information

This is an example of how a row of the table should look like:

...

Verifying NVMe disks speed

Execute the following command and verify all NVMe disks have the same speed (e.g.: 8GT/s):

Code Block
languagebash
themeRDark
lspci -nn | grep -i nvme | awk '{print $1}' | xargs -Innn bash -c "printf 'PCI Slot: nnn'; lspci -vvv -s nnn | grep LnkSta:"

Expected output:
PCI Slot: 0c:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: 0d:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: ad:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: ae:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: c5:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: c6:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Configuring mount points

Configure the mount points according to the table with all gathered information.
It is highly recommended to use LVM (Logical Volume Manager) to allow flexibility for future storage needs.

The following example uses LVM. You may use it for each mount point (replace vg_data2 with vg_data22/vg_data222/vg_data3 etc.):

Code Block
languagebash
themeRDark
pvcreate -ff /dev/nvme0n1
vgcreate vg_data2 /dev/nvme0n1
lvcreate -l 100%FREE -n lv_data vg_data2
mkfs.xfs -f /dev/vg_data2/lv_data

The following example is the line that should be added to /etc/fstab for each mount point (replace vg_data2 and /data2 with the appropriate values from the table):

Code Block
languagetext
themeRDark
/dev/vg_data2/lv_data    /data2                   xfs     defaults        0 0

Create a directory for each mount point (replace /data2 with the appropriate values from the table):

Code Block
languagebash
themeRDark
mkdir -p /data2

Inspecting final configuration

Note

This example is for 6 disks per cell member and does not include other mount points that should exist, as describe in Hardware and Software Requirements.

Execute the following command and verify mount points:

Code Block
languagebash
themeRDark
lsblk

Expected output:
NAME                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1               259:2    0   2.9T  0 disk 
└─vg_data2-lv_data    253:0    0   2.9T  0 lvm  /data2
nvme1n1               259:5    0   2.9T  0 disk 
└─vg_data22-lv_data   253:11   0   2.9T  0 lvm  /data22
nvme2n1               259:1    0   2.9T  0 disk 
└─vg_data3-lv_data    253:9    0   2.9T  0 lvm  /data3
nvme3n1               259:0    0   2.9T  0 disk 
└─vg_data33-lv_data   253:10   0   2.9T  0 lvm  /data33
nvme4n1               259:3    0   2.9T  0 disk 
└─vg_data44-lv_data   253:8    0   2.9T  0 lvm  /data44
nvme5n1               259:4    0   2.9T  0 disk 
└─vg_data4-lv_data    253:7    0   2.9T  0 lvm  /data4---- --------
/dev/nvme0n1     PHLE8XXXXXXC3P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme1n1     PHLE8XXXXXXM3P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme2n1     PHLE8XXXXXX83P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme3n1     PHLE8XXXXXXN3P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme4n1     PHLE8XXXXXX63P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46
/dev/nvme5n1     PHLE8XXXXXXJ3P2EGN   SSDPE2KE032T7L                           1         3.20 TB / 3.20 TB          512 B + 0 B      QDV1LV46

Identifying PCI Slot Numbers

To list the the PCI slot for each disk OS path, execute the following command.

Update table: Write down the PCI slot (e.g.: 0c:00.0) according to the last part of the disk OS path (e.g.: nvme0n1).

Code Block
languagebash
themeRDark
lspci -nn | grep NVM | awk '{print $1}' | xargs -Innn bash -c "printf 'PCI Slot: nnn     '; ls -la /sys/dev/block | grep nnn"

Expected output:
PCI Slot: 0c:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:2 -> ../../devices/pci0000:07/0000:07:00.0/0000:08:00.0/0000:09:02.0/0000:0c:00.0/nvme/nvme0/nvme0n1
PCI Slot: 0d:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:5 -> ../../devices/pci0000:07/0000:07:00.0/0000:08:00.0/0000:09:03.0/0000:0d:00.0/nvme/nvme1/nvme1n1
PCI Slot: ad:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:1 -> ../../devices/pci0000:ac/0000:ac:02.0/0000:ad:00.0/nvme/nvme2/nvme2n1
PCI Slot: ae:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:0 -> ../../devices/pci0000:ac/0000:ac:03.0/0000:ae:00.0/nvme/nvme3/nvme3n1
PCI Slot: c5:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:3 -> ../../devices/pci0000:c4/0000:c4:02.0/0000:c5:00.0/nvme/nvme4/nvme4n1
PCI Slot: c6:00.0     lrwxrwxrwx. 1 root root 0 May 16 10:26 259:4 -> ../../devices/pci0000:c4/0000:c4:03.0/0000:c6:00.0/nvme/nvme5/nvme5n1

Tip: you may execute the following command to list the details of all PCI slots with NVMe disks installed in the server:
lspci -nn | grep -i nvme | awk '{print $1}' | xargs -Innn lspci -v -s nnn

Identifying NUMA Nodes

To list the NUMA node of each PCI slot, execute the following command.

Update table: Write down the NUMA node (e.g.: 1) according to the PCI slot (e.g.: 0c:00.0).

Code Block
languagebash
themeRDark
lspci -nn | grep -i nvme | awk '{print $1}' | xargs -Innn bash -c "printf 'PCI Slot: nnn'; lspci -v -s nnn | grep NUMA"

Expected output:
PCI Slot: 0c:00.0	Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 1
PCI Slot: 0d:00.0	Flags: bus master, fast devsel, latency 0, IRQ 52, NUMA node 1
PCI Slot: ad:00.0	Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2
PCI Slot: ae:00.0	Flags: bus master, fast devsel, latency 0, IRQ 49, NUMA node 2
PCI Slot: c5:00.0	Flags: bus master, fast devsel, latency 0, IRQ 51, NUMA node 3
PCI Slot: c6:00.0	Flags: bus master, fast devsel, latency 0, IRQ 55, NUMA node 3

Verifying Required Information

Your required information table should be complete by now.

Make sure you have gathered information about all the installed NVMe disks, and that NUMA nodes are between 1 and 3 (and do not include NUMA node 0).

Verifying NVMe Disks Speed

Execute the following command and verify all NVMe disks have the same speed (e.g.: 8GT/s):

Code Block
languagebash
themeRDark
lspci -nn | grep -i nvme | awk '{print $1}' | xargs -Innn bash -c "printf 'PCI Slot: nnn'; lspci -vvv -s nnn | grep LnkSta:"

Expected output:
PCI Slot: 0c:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: 0d:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: ad:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: ae:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: c5:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
PCI Slot: c6:00.0		LnkSta:	Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-