Overview
Physical DPOD Cell Members that are required to process high transactions per second (TPS) load include NVMe disks for maximizing server I/O throughput.
DPOD is using NUMA (Non-Uniform Memory Access) technology to bind each of the Store's logical nodes to specific physical processor, disks and memory in a way that will minimize the latency of persisting data to disks.
Enabling NUMA in BIOS
Make sure to enable NUMA in the physical server's BIOS. You may need to consult with the hardware manufacturer documentation on how to achieve that.
Note: The number of NUMA nodes should match the amount of physical CPU sockets in the server. Some servers allow increasing the NUMA nodes number (e.g. double the number of CPU sockets), which is not suitable for DPOD.
Once NUMA has been enabled in BIOS, verify that the number of NUMA nodes matches the number of CPU sockets using the following command:
numactl -s | grep cpubind Expected output for 4 CPU sockets cell members: cpubind: 0 1 2 3
Connecting Disks
2 sockets:
All disks should be connected to BUS of CPU 1
Required information
The following table contains the list of OS mount points that should be configured along with additional information that must be gathered before federating the DPOD cell member to the cell environment.
Please copy this table, use it during the procedure, and complete the information in the empty cells as you follow the procedure:
Store Node | Mount Point Path | Disk Bay | Disk Serial | Disk OS Path | PCI Slot Number | NUMA Node (CPU #) |
---|---|---|---|---|---|---|
2 | /data2 | |||||
2 | /data22 | |||||
2 * | /data222 | |||||
3 | /data3 | |||||
3 | /data33 | |||||
3 * | /data333 | |||||
4 | /data4 | |||||
4 | /data44 | |||||
4 * | /data444 |
* Lines marked with asterisk (*) are relevant only in case DPOD sizing team recommends 9 disks instead of 6 disks per cell member. You may remove these lines in case you have only 6 disks per cell member.
Identifying disk bays and disk serial numbers
To identify which of the server's NVMe disk bays is bound to which of the CPUs, use the hardware manufacture documentation.
Write down the disk bay as well as the disk's serial number by visually observing the disk.
Identifying disk OS paths
To list the OS path of each disk, execute the following command and write down the disk OS path (e.g.: /dev/nvme0n1) according to the disk's serial number (e.g.: PHLE8XXXXXXC3P2EGN):
nvme -list Expected output: Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 PHLE8XXXXXXC3P2EGN SSDPE2KE032T7L 1 3.20 TB / 3.20 TB 512 B + 0 B QDV1LV46 /dev/nvme1n1 PHLE8XXXXXXM3P2EGN SSDPE2KE032T7L 1 3.20 TB / 3.20 TB 512 B + 0 B QDV1LV46 /dev/nvme2n1 PHLE8XXXXXX83P2EGN SSDPE2KE032T7L 1 3.20 TB / 3.20 TB 512 B + 0 B QDV1LV46 /dev/nvme3n1 PHLE8XXXXXXN3P2EGN SSDPE2KE032T7L 1 3.20 TB / 3.20 TB 512 B + 0 B QDV1LV46 /dev/nvme4n1 PHLE8XXXXXX63P2EGN SSDPE2KE032T7L 1 3.20 TB / 3.20 TB 512 B + 0 B QDV1LV46 /dev/nvme5n1 PHLE8XXXXXXJ3P2EGN SSDPE2KE032T7L 1 3.20 TB / 3.20 TB 512 B + 0 B QDV1LV46
Identifying PCI slot numbers
To list the the PCI slot for each disk OS path, execute the following command and write down the PCI slot (e.g.: 0c:00.0) according to the last part of the disk OS path (e.g.: nvme0n1):
lspci -nn | grep NVM | awk '{print $1}' | xargs -Innn bash -c "printf 'PCI Slot: nnn '; ls -la /sys/dev/block | grep nnn" Expected output: PCI Slot: 0c:00.0 lrwxrwxrwx. 1 root root 0 May 16 10:26 259:2 -> ../../devices/pci0000:07/0000:07:00.0/0000:08:00.0/0000:09:02.0/0000:0c:00.0/nvme/nvme0/nvme0n1 PCI Slot: 0d:00.0 lrwxrwxrwx. 1 root root 0 May 16 10:26 259:5 -> ../../devices/pci0000:07/0000:07:00.0/0000:08:00.0/0000:09:03.0/0000:0d:00.0/nvme/nvme1/nvme1n1 PCI Slot: ad:00.0 lrwxrwxrwx. 1 root root 0 May 16 10:26 259:1 -> ../../devices/pci0000:ac/0000:ac:02.0/0000:ad:00.0/nvme/nvme2/nvme2n1 PCI Slot: ae:00.0 lrwxrwxrwx. 1 root root 0 May 16 10:26 259:0 -> ../../devices/pci0000:ac/0000:ac:03.0/0000:ae:00.0/nvme/nvme3/nvme3n1 PCI Slot: c5:00.0 lrwxrwxrwx. 1 root root 0 May 16 10:26 259:3 -> ../../devices/pci0000:c4/0000:c4:02.0/0000:c5:00.0/nvme/nvme4/nvme4n1 PCI Slot: c6:00.0 lrwxrwxrwx. 1 root root 0 May 16 10:26 259:4 -> ../../devices/pci0000:c4/0000:c4:03.0/0000:c6:00.0/nvme/nvme5/nvme5n1 Tip: you may execute the following command to list the details of all PCI slots with NVMe disks installed in the server: lspci -nn | grep -i nvme | awk '{print $1}' | xargs -Innn lspci -v -s nnn Tip: you may execute the following command to list all disk OS paths in the server: ls -la /sys/dev/block
Identifying NUMA nodes
To list the NUMA node of each PCI slot, execute the following command and write down the NUMA node (e.g.: 1) according to the PCI slot (e.g.: 0c:00.0):
lspci -nn | grep -i nvme | awk '{print $1}' | xargs -Innn bash -c "printf 'PCI Slot: nnn'; lspci -v -s nnn | grep NUMA" Expected output: PCI Slot: 0c:00.0 Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 1 PCI Slot: 0d:00.0 Flags: bus master, fast devsel, latency 0, IRQ 52, NUMA node 1 PCI Slot: ad:00.0 Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2 PCI Slot: ae:00.0 Flags: bus master, fast devsel, latency 0, IRQ 49, NUMA node 2 PCI Slot: c5:00.0 Flags: bus master, fast devsel, latency 0, IRQ 51, NUMA node 3 PCI Slot: c6:00.0 Flags: bus master, fast devsel, latency 0, IRQ 55, NUMA node 3
Example of required information
This is an example of how a row of the table should look like:
Store Node | Mount Point Path | Disk Bay | Disk Serial | Disk OS Path | PCI Slot Number | NUMA node (CPU #) |
---|---|---|---|---|---|---|
2 | /data2 | Bay 1 | PHLE8XXXXXXC3P2EGN | /dev/nvme0n1 | 0c:00.0 | 1 |
Verifying NVMe disks speed
Execute the following command and verify all NVMe disks have the same speed (e.g.: 8GT/s):
lspci -nn | grep -i nvme | awk '{print $1}' | xargs -Innn bash -c "printf 'PCI Slot: nnn'; lspci -vvv -s nnn | grep LnkSta:" Expected output: PCI Slot: 0c:00.0 LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- PCI Slot: 0d:00.0 LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- PCI Slot: ad:00.0 LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- PCI Slot: ae:00.0 LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- PCI Slot: c5:00.0 LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- PCI Slot: c6:00.0 LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Configuring mount points
Configure the mount points according to the table with all gathered information.
It is highly recommended to use LVM (Logical Volume Manager) to allow flexibility for future storage needs.
The following example uses LVM. You may use it for each mount point (replace vg_data2 with vg_data22/vg_data222/vg_data3 etc.):
pvcreate -ff /dev/nvme0n1 vgcreate vg_data2 /dev/nvme0n1 lvcreate -l 100%FREE -n lv_data vg_data2 mkfs.xfs -f /dev/vg_data2/lv_data
The following example is the line that should be added to /etc/fstab for each mount point (replace vg_data2 and /data2 with the appropriate values from the table):
/dev/vg_data2/lv_data /data2 xfs defaults 0 0
Create a directory for each mount point (replace /data2 with the appropriate values from the table):
mkdir -p /data2
Inspecting final configuration
This example is for 6 disks per cell member and does not include other mount points that should exist, as describe in Hardware and Software Requirements.
Execute the following command and verify mount points:
lsblk Expected output: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:2 0 2.9T 0 disk └─vg_data2-lv_data 253:0 0 2.9T 0 lvm /data2 nvme1n1 259:5 0 2.9T 0 disk └─vg_data22-lv_data 253:11 0 2.9T 0 lvm /data22 nvme2n1 259:1 0 2.9T 0 disk └─vg_data3-lv_data 253:9 0 2.9T 0 lvm /data3 nvme3n1 259:0 0 2.9T 0 disk └─vg_data33-lv_data 253:10 0 2.9T 0 lvm /data33 nvme4n1 259:3 0 2.9T 0 disk └─vg_data44-lv_data 253:8 0 2.9T 0 lvm /data44 nvme5n1 259:4 0 2.9T 0 disk └─vg_data4-lv_data 253:7 0 2.9T 0 lvm /data4