# SOME DESCRIPTIVE TITLE. # Copyright (C) 2025, OpenStack Foundation # This file is distributed under the same license as the Swift package. # FIRST AUTHOR , YEAR. # #, fuzzy msgid "" msgstr "" "Project-Id-Version: Swift 2.35.0.dev140\n" "Report-Msgid-Bugs-To: \n" "POT-Creation-Date: 2025-01-15 18:44+0000\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME \n" "Language-Team: LANGUAGE \n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" "Content-Transfer-Encoding: 8bit\n" #: ../../source/ops_runbook/diagnose.rst:3 msgid "Identifying issues and resolutions" msgstr "" #: ../../source/ops_runbook/diagnose.rst:6 msgid "Is the system up?" msgstr "" #: ../../source/ops_runbook/diagnose.rst:8 msgid "" "If you have a report that Swift is down, perform the following basic checks:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:10 msgid "Run swift functional tests." msgstr "" #: ../../source/ops_runbook/diagnose.rst:12 msgid "" "From a server in your data center, use ``curl`` to check ``/healthcheck`` " "(see below)." msgstr "" #: ../../source/ops_runbook/diagnose.rst:15 msgid "If you have a monitoring system, check your monitoring system." msgstr "" #: ../../source/ops_runbook/diagnose.rst:17 msgid "Check your hardware load balancers infrastructure." msgstr "" #: ../../source/ops_runbook/diagnose.rst:19 msgid "Run swift-recon on a proxy node." msgstr "" #: ../../source/ops_runbook/diagnose.rst:22 msgid "Functional tests usage" msgstr "" #: ../../source/ops_runbook/diagnose.rst:24 msgid "" "We would recommend that you set up the functional tests to run against your " "production system. Run regularly this can be a useful tool to validate that " "the system is configured correctly. In addition, it can provide early " "warning about failures in your system (if the functional tests stop working, " "user applications will also probably stop working)." msgstr "" #: ../../source/ops_runbook/diagnose.rst:30 msgid "" "A script for running the function tests is located in ``swift/.functests``." msgstr "" #: ../../source/ops_runbook/diagnose.rst:34 msgid "External monitoring" msgstr "" #: ../../source/ops_runbook/diagnose.rst:36 msgid "" "We use pingdom.com to monitor the external Swift API. We suggest the " "following:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:39 msgid "Do a GET on ``/healthcheck``" msgstr "" #: ../../source/ops_runbook/diagnose.rst:41 msgid "" "Create a container, make it public (``x-container-read: .r*,.rlistings``), " "create a small file in the container; do a GET on the object" msgstr "" #: ../../source/ops_runbook/diagnose.rst:46 msgid "Diagnose: General approach" msgstr "" #: ../../source/ops_runbook/diagnose.rst:48 msgid "Look at service status in your monitoring system." msgstr "" #: ../../source/ops_runbook/diagnose.rst:50 msgid "" "In addition to system monitoring tools and issue logging by users, swift " "errors will often result in log entries (see :ref:`swift_logs`)." msgstr "" #: ../../source/ops_runbook/diagnose.rst:53 msgid "Look at any logs your deployment tool produces." msgstr "" #: ../../source/ops_runbook/diagnose.rst:55 msgid "" "Log files should be reviewed for error signatures (see below) that may point " "to a known issue, or root cause issues reported by the diagnostics tools, " "prior to escalation." msgstr "" #: ../../source/ops_runbook/diagnose.rst:60 msgid "Dependencies" msgstr "" #: ../../source/ops_runbook/diagnose.rst:62 msgid "" "The Swift software is dependent on overall system health. Operating system " "level issues with network connectivity, domain name resolution, user " "management, hardware and system configuration and capacity in terms of " "memory and free disk space, may result is secondary Swift issues. System " "level issues should be resolved prior to diagnosis of swift issues." msgstr "" #: ../../source/ops_runbook/diagnose.rst:71 msgid "Diagnose: Swift-dispersion-report" msgstr "" #: ../../source/ops_runbook/diagnose.rst:73 msgid "" "The swift-dispersion-report is a useful tool to gauge the general health of " "the system. Configure the ``swift-dispersion`` report to cover at a minimum " "every disk drive in your system (usually 1% coverage). See :ref:" "`dispersion_report` for details of how to configure and use the dispersion " "reporting tool." msgstr "" #: ../../source/ops_runbook/diagnose.rst:79 msgid "" "The ``swift-dispersion-report`` tool can take a long time to run, especially " "if any servers are down. We suggest you run it regularly (e.g., in a cron " "job) and save the results. This makes it easy to refer to the last report " "without having to wait for a long-running command to complete." msgstr "" #: ../../source/ops_runbook/diagnose.rst:86 msgid "Diagnose: Is system responding to ``/healthcheck``?" msgstr "" #: ../../source/ops_runbook/diagnose.rst:88 msgid "" "When you want to establish if a swift endpoint is running, run ``curl -k`` " "against ``https://$ENDPOINT/healthcheck``." msgstr "" #: ../../source/ops_runbook/diagnose.rst:94 msgid "Diagnose: Interpreting messages in ``/var/log/swift/`` files" msgstr "" #: ../../source/ops_runbook/diagnose.rst:98 msgid "" "In the Hewlett Packard Enterprise Helion Public Cloud we send logs to " "``proxy.log`` (proxy-server logs), ``server.log`` (object-server, account-" "server, container-server logs), ``background.log`` (all other servers " "[object-replicator, etc])." msgstr "" #: ../../source/ops_runbook/diagnose.rst:103 msgid "The following table lists known issues:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:109 msgid "**Logfile**" msgstr "" #: ../../source/ops_runbook/diagnose.rst:110 msgid "**Signature**" msgstr "" #: ../../source/ops_runbook/diagnose.rst:111 msgid "**Issue**" msgstr "" #: ../../source/ops_runbook/diagnose.rst:112 msgid "**Steps to take**" msgstr "" #: ../../source/ops_runbook/diagnose.rst:113 #: ../../source/ops_runbook/diagnose.rst:118 #: ../../source/ops_runbook/diagnose.rst:123 #: ../../source/ops_runbook/diagnose.rst:127 msgid "/var/log/syslog" msgstr "" #: ../../source/ops_runbook/diagnose.rst:114 msgid "kernel: [] sd .... [csbu:sd...] Sense Key: Medium Error" msgstr "" #: ../../source/ops_runbook/diagnose.rst:115 msgid "Suggests disk surface issues" msgstr "" #: ../../source/ops_runbook/diagnose.rst:116 msgid "" "Run ``swift-drive-audit`` on the target node to check for disk errors, " "repair disk errors" msgstr "" #: ../../source/ops_runbook/diagnose.rst:119 msgid "kernel: [] sd .... [csbu:sd...] Sense Key: Hardware Error" msgstr "" #: ../../source/ops_runbook/diagnose.rst:120 msgid "Suggests storage hardware issues" msgstr "" #: ../../source/ops_runbook/diagnose.rst:121 msgid "" "Run diagnostics on the target node to check for disk failures, replace " "failed disks" msgstr "" #: ../../source/ops_runbook/diagnose.rst:124 msgid "kernel: [] .... I/O error, dev sd.... ,sector ...." msgstr "" #: ../../source/ops_runbook/diagnose.rst:126 msgid "Run diagnostics on the target node to check for disk errors" msgstr "" #: ../../source/ops_runbook/diagnose.rst:128 msgid "pound: NULL get_thr_arg" msgstr "" #: ../../source/ops_runbook/diagnose.rst:129 msgid "Multiple threads woke up" msgstr "" #: ../../source/ops_runbook/diagnose.rst:130 msgid "Noise, safe to ignore" msgstr "" #: ../../source/ops_runbook/diagnose.rst:131 #: ../../source/ops_runbook/diagnose.rst:137 msgid "/var/log/swift/proxy.log" msgstr "" #: ../../source/ops_runbook/diagnose.rst:132 #: ../../source/ops_runbook/diagnose.rst:143 msgid ".... ERROR .... ConnectionTimeout ...." msgstr "" #: ../../source/ops_runbook/diagnose.rst:133 msgid "A storage node is not responding in a timely fashion" msgstr "" #: ../../source/ops_runbook/diagnose.rst:134 msgid "" "Check if node is down, not running Swift, unconfigured, storage off-line or " "for network issues between the proxy and non responding node" msgstr "" #: ../../source/ops_runbook/diagnose.rst:138 msgid "proxy-server .... HTTP/1.0 500 ...." msgstr "" #: ../../source/ops_runbook/diagnose.rst:139 msgid "A proxy server has reported an internal server error" msgstr "" #: ../../source/ops_runbook/diagnose.rst:140 msgid "" "Examine the logs for any errors at the time the error was reported to " "attempt to understand the cause of the error." msgstr "" #: ../../source/ops_runbook/diagnose.rst:142 #: ../../source/ops_runbook/diagnose.rst:148 msgid "/var/log/swift/server.log" msgstr "" #: ../../source/ops_runbook/diagnose.rst:144 #: ../../source/ops_runbook/diagnose.rst:172 msgid "A storage server is not responding in a timely fashion" msgstr "" #: ../../source/ops_runbook/diagnose.rst:145 #: ../../source/ops_runbook/diagnose.rst:157 msgid "" "Check if node is down, not running Swift, unconfigured, storage off-line or " "for network issues between the server and non responding node" msgstr "" #: ../../source/ops_runbook/diagnose.rst:149 msgid ".... ERROR .... Remote I/O error: '/srv/node/disk...." msgstr "" #: ../../source/ops_runbook/diagnose.rst:150 msgid "A storage device is not responding as expected" msgstr "" #: ../../source/ops_runbook/diagnose.rst:151 msgid "" "Run ``swift-drive-audit`` and check the filesystem named in the error for " "corruption (unmount & xfs_repair). Check if the filesystem is mounted and " "working." msgstr "" #: ../../source/ops_runbook/diagnose.rst:154 #: ../../source/ops_runbook/diagnose.rst:160 #: ../../source/ops_runbook/diagnose.rst:166 #: ../../source/ops_runbook/diagnose.rst:170 #: ../../source/ops_runbook/diagnose.rst:175 #: ../../source/ops_runbook/diagnose.rst:180 #: ../../source/ops_runbook/diagnose.rst:184 msgid "/var/log/swift/background.log" msgstr "" #: ../../source/ops_runbook/diagnose.rst:155 msgid "object-server ERROR container update failed .... Connection refused" msgstr "" #: ../../source/ops_runbook/diagnose.rst:156 msgid "A container server node could not be contacted" msgstr "" #: ../../source/ops_runbook/diagnose.rst:161 msgid "object-updater ERROR with remote .... ConnectionTimeout" msgstr "" #: ../../source/ops_runbook/diagnose.rst:162 msgid "The remote container server is busy" msgstr "" #: ../../source/ops_runbook/diagnose.rst:163 msgid "" "If the container is very large, some errors updating it can be expected. " "However, this error can also occur if there is a networking issue." msgstr "" #: ../../source/ops_runbook/diagnose.rst:167 msgid "account-reaper STDOUT: .... error: ECONNREFUSED" msgstr "" #: ../../source/ops_runbook/diagnose.rst:168 msgid "Network connectivity issue or the target server is down." msgstr "" #: ../../source/ops_runbook/diagnose.rst:169 msgid "Resolve network issue or reboot the target server" msgstr "" #: ../../source/ops_runbook/diagnose.rst:171 msgid ".... ERROR .... ConnectionTimeout" msgstr "" #: ../../source/ops_runbook/diagnose.rst:173 #: ../../source/ops_runbook/diagnose.rst:178 msgid "" "The target server may be busy. However, this error can also occur if there " "is a networking issue." msgstr "" #: ../../source/ops_runbook/diagnose.rst:176 msgid ".... ERROR syncing .... Timeout" msgstr "" #: ../../source/ops_runbook/diagnose.rst:177 msgid "A timeout occurred syncing data to another node." msgstr "" #: ../../source/ops_runbook/diagnose.rst:181 msgid ".... ERROR Remote drive not mounted ...." msgstr "" #: ../../source/ops_runbook/diagnose.rst:182 #: ../../source/ops_runbook/diagnose.rst:186 msgid "A storage server disk is unavailable" msgstr "" #: ../../source/ops_runbook/diagnose.rst:183 #: ../../source/ops_runbook/diagnose.rst:187 msgid "Repair and remount the file system (on the remote node)" msgstr "" #: ../../source/ops_runbook/diagnose.rst:185 msgid "object-replicator .... responded as unmounted" msgstr "" #: ../../source/ops_runbook/diagnose.rst:188 msgid "/var/log/swift/\\*.log" msgstr "" #: ../../source/ops_runbook/diagnose.rst:189 msgid "STDOUT: EXCEPTION IN" msgstr "" #: ../../source/ops_runbook/diagnose.rst:190 msgid "A unexpected error occurred" msgstr "" #: ../../source/ops_runbook/diagnose.rst:191 msgid "" "Read the Traceback details, if it matches known issues (e.g. active network/" "disk issues), check for re-ocurrences after the primary issues have been " "resolved" msgstr "" #: ../../source/ops_runbook/diagnose.rst:194 msgid "/var/log/rsyncd.log" msgstr "" #: ../../source/ops_runbook/diagnose.rst:195 msgid "rsync: mkdir \"/disk....failed: No such file or directory...." msgstr "" #: ../../source/ops_runbook/diagnose.rst:196 msgid "A local storage server disk is unavailable" msgstr "" #: ../../source/ops_runbook/diagnose.rst:197 msgid "Run diagnostics on the node to check for a failed or unmounted disk" msgstr "" #: ../../source/ops_runbook/diagnose.rst:199 msgid "/var/log/swift*" msgstr "" #: ../../source/ops_runbook/diagnose.rst:200 msgid "Exception: Could not bind to 0.0.0.0:6xxx" msgstr "" #: ../../source/ops_runbook/diagnose.rst:201 msgid "" "Possible Swift process restart issue. This indicates an old swift process is " "still running." msgstr "" #: ../../source/ops_runbook/diagnose.rst:203 msgid "" "Restart Swift services. If some swift services are reported down, check if " "they left residual process behind." msgstr "" #: ../../source/ops_runbook/diagnose.rst:207 msgid "Diagnose: Parted reports the backup GPT table is corrupt" msgstr "" #: ../../source/ops_runbook/diagnose.rst:209 msgid "" "If a GPT table is broken, a message like the following should be observed " "when the following command is run:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:223 msgid "To fix, go to :ref:`fix_broken_gpt_table`" msgstr "" #: ../../source/ops_runbook/diagnose.rst:227 msgid "Diagnose: Drives diagnostic reports a FS label is not acceptable" msgstr "" #: ../../source/ops_runbook/diagnose.rst:229 msgid "" "If diagnostics reports something like \"FS label: obj001dsk011 is not " "acceptable\", it indicates that a partition has a valid disk label, but an " "invalid filesystem label. In such cases proceed as follows:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:233 msgid "Verify that the disk labels are correct:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:241 msgid "" "If partition labels are inconsistent then, resolve the disk label issues " "before proceeding:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:250 msgid "If the Filesystem label is missing then create it with care:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:271 msgid "Diagnose: Failed LUNs" msgstr "" #: ../../source/ops_runbook/diagnose.rst:275 msgid "" "The HPE Helion Public Cloud uses direct attach SmartArray controllers/" "drives. The information here is specific to that environment. The hpacucli " "utility mentioned here may be called hpssacli in your environment." msgstr "" #: ../../source/ops_runbook/diagnose.rst:280 msgid "" "The ``swift_diagnostics`` mount checks may return a warning that a LUN has " "failed, typically accompanied by DriveAudit check failures and device errors." msgstr "" #: ../../source/ops_runbook/diagnose.rst:284 msgid "" "Such cases are typically caused by a drive failure, and if drive check also " "reports a failed status for the underlying drive, then follow the procedure " "to replace the disk." msgstr "" #: ../../source/ops_runbook/diagnose.rst:288 msgid "Otherwise the lun can be re-enabled as follows:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:290 msgid "" "Generate a hpssacli diagnostic report. This report allows the DC team to " "troubleshoot potential cabling or hardware issues so it is imperative that " "you run it immediately when troubleshooting a failed LUN. You will come back " "later and grep this file for more details, but just generate it for now." msgstr "" #: ../../source/ops_runbook/diagnose.rst:300 msgid "" "Export the following variables using the below instructions before " "proceeding further." msgstr "" #: ../../source/ops_runbook/diagnose.rst:303 msgid "" "Print a list of logical drives and their numbers and take note of the failed " "drive's number and array value (example output: \"array A logicaldrive 1..." "\" would be exported as LDRIVE=1):" msgstr "" #: ../../source/ops_runbook/diagnose.rst:311 msgid "" "Export the number of the logical drive that was retrieved from the previous " "command into the LDRIVE variable:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:318 msgid "" "Print the array value and Port:Box:Bay for all drives and take note of the " "Port:Box:Bay for the failed drive (example output: \" array A physicaldrive " "2C:1:1...\" would be exported as PBOX=2C:1:1). Match the array value of this " "output with the array value obtained from the previous command to be sure " "you are working on the same drive. Also, the array value usually matches the " "device name (For example, /dev/sdc in the case of \"array c\"), but we will " "run a different command to be sure we are operating on the correct device." msgstr "" #: ../../source/ops_runbook/diagnose.rst:333 msgid "" "Sometimes a LUN may appear to be failed as it is not and cannot be mounted " "but the hpssacli/parted commands may show no problems with the LUNS/drives. " "In this case, the filesystem may be corrupt and may be necessary to run " "``sudo xfs_check /dev/sd[a-l][1-2]`` to see if there is an xfs issue. The " "results of running this command may require that ``xfs_repair`` is run." msgstr "" #: ../../source/ops_runbook/diagnose.rst:340 msgid "Export the Port:Box:Bay for the failed drive into the PBOX variable:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:346 msgid "" "Print the physical device information and take note of the Disk Name " "(example output: \"Disk Name: /dev/sdk\" would be exported as DEV=/dev/sdk):" msgstr "" #: ../../source/ops_runbook/diagnose.rst:354 msgid "" "Export the device name variable from the preceding command (example: /dev/" "sdk):" msgstr "" #: ../../source/ops_runbook/diagnose.rst:361 msgid "" "Export the filesystem variable. Disks that are split between the operating " "system and data storage, typically sda and sdb, should only have repairs " "done on their data filesystem, usually /dev/sda2 and /dev/sdb2, Other data " "only disks have just one partition on the device, so the filesystem will be " "1. In any case you should verify the data filesystem by running ``df -h | " "grep /srv/node`` and using the listed data filesystem for the device in " "question as the export. For example: /dev/sdk1." msgstr "" #: ../../source/ops_runbook/diagnose.rst:374 msgid "Verify the LUN is failed, and the device is not:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:383 #: ../../source/ops_runbook/diagnose.rst:674 msgid "Stop the swift and rsync service:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:390 msgid "Unmount the problem drive, fix the LUN and the filesystem:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:396 msgid "" "If umount fails, you should run lsof search for the mountpoint and kill any " "lingering processes before repeating the unpount:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:404 msgid "" "If the ``xfs_repair`` complains about possible journal data, use the " "``xfs_repair -L`` option to zeroise the journal log." msgstr "" #: ../../source/ops_runbook/diagnose.rst:407 msgid "" "Once complete test-mount the filesystem, and tidy up its lost and found area." msgstr "" #: ../../source/ops_runbook/diagnose.rst:416 msgid "Mount the filesystem and restart swift and rsync." msgstr "" #: ../../source/ops_runbook/diagnose.rst:418 msgid "" "Run the following to determine if a DC ticket is needed to check the cables " "on the node:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:426 msgid "" "If the output reports any non 0x00 values, it suggests that the cables " "should be checked. For example, log a DC ticket to check the sas cables " "between the drive and the expander." msgstr "" #: ../../source/ops_runbook/diagnose.rst:433 msgid "Diagnose: Slow disk devices" msgstr "" #: ../../source/ops_runbook/diagnose.rst:437 msgid "collectl is an open-source performance gathering/analysis tool." msgstr "" #: ../../source/ops_runbook/diagnose.rst:439 msgid "" "If the diagnostics report a message such as ``sda: drive is slow``, you " "should log onto the node and run the following command (remove ``-c 1`` " "option to continuously monitor the data):" msgstr "" #: ../../source/ops_runbook/diagnose.rst:470 msgid "" "Look at the ``Wait`` and ``SvcTime`` values. It is not normal for these " "values to exceed 50msec. This is known to impact customer performance " "(upload/download). For a controller problem, many/all drives will show long " "wait and service times. A reboot may correct the problem; otherwise hardware " "replacement is needed." msgstr "" #: ../../source/ops_runbook/diagnose.rst:476 msgid "Another way to look at the data is as follows:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:507 msgid "" "This shows the historical distribution of the wait and service times over a " "day. This is how you read it:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:510 msgid "" "sda did 54580 operations with a short wait time, 371 operations with a " "longer wait time and 65 with an even longer wait time." msgstr "" #: ../../source/ops_runbook/diagnose.rst:513 msgid "" "sdl did 50106 operations with a short wait time, but as you can see many " "took longer." msgstr "" #: ../../source/ops_runbook/diagnose.rst:516 msgid "" "There is a clear pattern that sdf to sdl have a problem. Actually, sda to " "sde would more normally have lots of zeros in their data. But maybe this is " "a busy system. In this example it is worth changing the controller as the " "individual drives may be ok." msgstr "" #: ../../source/ops_runbook/diagnose.rst:521 msgid "" "After the controller is changed, use collectl -s D as described above to see " "if the problem has cleared. disk-anal.pl will continue to show historical " "data. You can look at recent data as follows. It only looks at data from " "13:15 to 14:15. As you can see, this is a relatively clean system (few if " "any long wait or service times):" msgstr "" #: ../../source/ops_runbook/diagnose.rst:555 msgid "" "For long wait times, where the service time appears normal is to check the " "logical drive cache status. While the cache may be enabled, it can be " "disabled on a per-drive basis." msgstr "" #: ../../source/ops_runbook/diagnose.rst:560 msgid "Diagnose: Slow network link - Measuring network performance" msgstr "" #: ../../source/ops_runbook/diagnose.rst:562 msgid "" "Network faults can cause performance between Swift nodes to degrade. Testing " "with ``netperf`` is recommended. Other methods (such as copying large files) " "may also work, but can produce inconclusive results." msgstr "" #: ../../source/ops_runbook/diagnose.rst:566 msgid "" "Install ``netperf`` on all systems if not already installed. Check that the " "UFW rules for its control port are in place. However, there are no pre-" "opened ports for netperf's data connection. Pick a port number. In this " "example, 12866 is used because it is one higher than netperf's default " "control port number, 12865. If you get very strange results including zero " "values, you may not have gotten the data port opened in UFW at the target or " "may have gotten the netperf command-line wrong." msgstr "" #: ../../source/ops_runbook/diagnose.rst:575 msgid "" "Pick a ``source`` and ``target`` node. The source is often a proxy node and " "the target is often an object node. Using the same source proxy you can test " "communication to different object nodes in different AZs to identity " "possible bottlenecks." msgstr "" #: ../../source/ops_runbook/diagnose.rst:581 msgid "Running tests" msgstr "" #: ../../source/ops_runbook/diagnose.rst:583 msgid "Prepare the ``target`` node as follows:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:589 msgid "Or, do:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:595 msgid "" "On the ``source`` node, run the following command to check throughput. Note " "the double-dash before the -P option. The command takes 10 seconds to " "complete. The ``target`` node is 192.168.245.5." msgstr "" #: ../../source/ops_runbook/diagnose.rst:610 msgid "On the ``source`` node, run the following command to check latency:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:625 msgid "Expected results" msgstr "" #: ../../source/ops_runbook/diagnose.rst:627 msgid "" "Faults will show up as differences between different pairs of nodes. " "However, for reference, here are some expected numbers:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:630 msgid "" "For throughput, proxy to proxy, expect ~9300 Mbit/sec (proxies have a 10Ge " "link)." msgstr "" #: ../../source/ops_runbook/diagnose.rst:633 msgid "" "For throughout, proxy to object, expect ~920 Mbit/sec (at time of writing " "this, object nodes have a 1Ge link)." msgstr "" #: ../../source/ops_runbook/diagnose.rst:636 msgid "For throughput, object to object, expect ~920 Mbit/sec." msgstr "" #: ../../source/ops_runbook/diagnose.rst:638 msgid "For latency (all types), expect ~11000 transactions/sec." msgstr "" #: ../../source/ops_runbook/diagnose.rst:641 msgid "Diagnose: Remapping sectors experiencing UREs" msgstr "" #: ../../source/ops_runbook/diagnose.rst:643 msgid "Find the bad sector, device, and filesystem in ``kern.log``." msgstr "" #: ../../source/ops_runbook/diagnose.rst:645 msgid "Set the environment variables SEC, DEV & FS, for example:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:653 msgid "Verify that the sector is bad:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:659 msgid "If the sector is bad this command will output an input/output error:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:667 msgid "" "Prevent chef from attempting to re-mount the filesystem while the repair is " "in progress:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:681 msgid "Unmount the problem drive:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:687 msgid "Overwrite/remap the bad sector:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:693 msgid "" "This command should report an input/output error the first time it is run. " "Run the command a second time, if it successfully remapped the bad sector it " "should not report an input/output error." msgstr "" #: ../../source/ops_runbook/diagnose.rst:697 msgid "Verify the sector is now readable:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:703 msgid "" "If the sector is now readable this command should not report an input/output " "error." msgstr "" #: ../../source/ops_runbook/diagnose.rst:706 msgid "" "If more than one problem sector is listed, set the SEC environment variable " "to the next sector in the list:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:713 msgid "Repeat from step 8." msgstr "" #: ../../source/ops_runbook/diagnose.rst:715 #: ../../source/ops_runbook/diagnose.rst:986 msgid "Repair the filesystem:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:721 msgid "" "If ``xfs_repair`` reports that the filesystem has valuable filesystem " "changes:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:739 msgid "" "You should attempt to mount the filesystem, and clear the lost+found area:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:748 msgid "" "If the filesystem fails to mount then you will need to use the ``xfs_repair -" "L`` option to force log zeroing. Repeat step 11." msgstr "" #: ../../source/ops_runbook/diagnose.rst:752 msgid "" "If ``xfs_repair`` reports that an additional input/output error has been " "encountered, get the sector details as follows:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:759 msgid "" "If new input/output error is reported then set the SEC environment variable " "to the problem sector number:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:766 msgid "Repeat from step 8" msgstr "" #: ../../source/ops_runbook/diagnose.rst:769 msgid "Remount the filesystem and restart swift and rsync." msgstr "" #: ../../source/ops_runbook/diagnose.rst:771 msgid "" "If all UREs in the kern.log have been fixed and you are still unable to have " "xfs_repair disk, it is possible that the URE's have corrupted the filesystem " "or possibly destroyed the drive altogether. In this case, the first step is " "to re-format the filesystem and if this fails, get the disk replaced." msgstr "" #: ../../source/ops_runbook/diagnose.rst:779 msgid "Diagnose: High system latency" msgstr "" #: ../../source/ops_runbook/diagnose.rst:783 msgid "" "The latency measurements described here are specific to the HPE Helion " "Public Cloud." msgstr "" #: ../../source/ops_runbook/diagnose.rst:786 msgid "" "A bad NIC on a proxy server. However, as explained above, this usually " "causes the peak to rise, but average should remain near normal parameters. A " "quick fix is to shutdown the proxy." msgstr "" #: ../../source/ops_runbook/diagnose.rst:790 msgid "" "A stuck memcache server. Accepts connections, but then will not respond. " "Expect to see timeout messages in ``/var/log/proxy.log`` (port 11211). Swift " "Diags will also report this as a failed node/port. A quick fix is to " "shutdown the proxy server." msgstr "" #: ../../source/ops_runbook/diagnose.rst:795 msgid "" "A bad/broken object server can also cause problems if the accounts used by " "the monitor program happen to live on the bad object server." msgstr "" #: ../../source/ops_runbook/diagnose.rst:798 msgid "" "A general network problem within the data canter. Compare the results with " "the Pingdom monitors to see if they also have a problem." msgstr "" #: ../../source/ops_runbook/diagnose.rst:802 msgid "Diagnose: Interface reports errors" msgstr "" #: ../../source/ops_runbook/diagnose.rst:804 msgid "" "Should a network interface on a Swift node begin reporting network errors, " "it may well indicate a cable, switch, or network issue." msgstr "" #: ../../source/ops_runbook/diagnose.rst:807 msgid "Get an overview of the interface with:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:814 msgid "" "The ``Link Detected:`` indicator will read ``yes`` if the nic is cabled." msgstr "" #: ../../source/ops_runbook/diagnose.rst:817 msgid "Establish the adapter type with:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:823 msgid "Gather the interface statistics with:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:829 msgid "If the nick supports self test, this can be performed with:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:835 msgid "Self tests should read ``PASS`` if the nic is operating correctly." msgstr "" #: ../../source/ops_runbook/diagnose.rst:837 msgid "" "Nic module drivers can be re-initialised by carefully removing and re-" "installing the modules (this avoids rebooting the server). For example, " "mellanox drivers use a two part driver mlx4_en and mlx4_core. To reload " "these you must carefully remove the mlx4_en (ethernet) then the mlx4_core " "modules, and reinstall them in the reverse order." msgstr "" #: ../../source/ops_runbook/diagnose.rst:844 msgid "" "As the interface will be disabled while the modules are unloaded, you must " "be very careful not to lock yourself out so it may be better to script this." msgstr "" #: ../../source/ops_runbook/diagnose.rst:849 msgid "Diagnose: Hung swift object replicator" msgstr "" #: ../../source/ops_runbook/diagnose.rst:851 msgid "" "A replicator reports in its log that remaining time exceeds 100 hours. This " "may indicate that the swift ``object-replicator`` is stuck and not making " "progress. Another useful way to check this is with the 'swift-recon -r' " "command on a swift proxy server:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:869 msgid "" "The ``Oldest completion`` line in this example indicates that the object-" "replicator on swift object server 192.168.245.3 has not completed the " "replication cycle in 12 days. This replicator is stuck. The object " "replicator cycle is generally less than 1 hour. Though an replicator cycle " "of 15-20 hours can occur if nodes are added to the system and a new ring has " "been deployed." msgstr "" #: ../../source/ops_runbook/diagnose.rst:876 msgid "" "You can further check if the object replicator is stuck by logging on the " "object server and checking the object replicator progress with the following " "command:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:900 msgid "" "The above status is output every 5 minutes to ``/var/log/swift/background." "log``." msgstr "" #: ../../source/ops_runbook/diagnose.rst:904 msgid "" "The 'remaining' time is increasing as time goes on, normally the time " "remaining should be decreasing. Also note the partition number. For example, " "15344 remains the same for several status lines. Eventually the object " "replicator detects the hang and attempts to make progress by killing the " "problem thread. The replicator then progresses to the next partition but " "quite often it again gets stuck on the same partition." msgstr "" #: ../../source/ops_runbook/diagnose.rst:911 msgid "" "One of the reasons for the object replicator hanging like this is filesystem " "corruption on the drive. The following is a typical log entry of a corrupted " "filesystem detected by the object replicator:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:923 msgid "" "An ``ls`` of the problem file or directory usually shows something like the " "following:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:930 msgid "" "If no entry with ``Remote I/O error`` occurs in the ``background.log`` it is " "not possible to determine why the object-replicator is hung. It may be that " "the ``Remote I/O error`` entry is older than 7 days and so has been rotated " "out of the logs. In this scenario it may be best to simply restart the " "object-replicator." msgstr "" #: ../../source/ops_runbook/diagnose.rst:936 msgid "Stop the object-replicator:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:942 msgid "" "Make sure the object replicator has stopped, if it has hung, the stop " "command will not stop the hung process:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:949 msgid "" "If the previous ps shows the object-replicator is still running, kill the " "process:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:956 msgid "Start the object-replicator:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:962 msgid "" "If the above grep did find an ``Remote I/O error`` then it may be possible " "to repair the problem filesystem." msgstr "" #: ../../source/ops_runbook/diagnose.rst:965 msgid "Stop swift and rsync:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:972 msgid "Make sure all swift process have stopped:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:978 msgid "Kill any swift processes still running." msgstr "" #: ../../source/ops_runbook/diagnose.rst:980 msgid "Unmount the problem filesystem:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:992 msgid "" "If the ``xfs_repair`` fails then it may be necessary to re-format the " "filesystem. See :ref:`fix_broken_xfs_filesystem`. If the ``xfs_repair`` is " "successful, re-enable chef using the following command and replication " "should commence again." msgstr "" #: ../../source/ops_runbook/diagnose.rst:999 msgid "Diagnose: High CPU load" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1001 msgid "" "The CPU load average on an object server, as shown with the 'uptime' " "command, is typically under 10 when the server is lightly-moderately loaded:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1010 msgid "" "During times of increased activity, due to user transactions or object " "replication, the CPU load average can increase to to around 30." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1013 msgid "" "However, sometimes the CPU load average can increase significantly. The " "following is an example of an object server that has extremely high CPU load:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1023 msgid "Further issues and resolutions" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1027 msgid "" "The urgency levels in each **Action** column indicates whether or not it is " "required to take immediate action, or if the problem can be worked on during " "business hours." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1035 msgid "**Scenario**" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1036 msgid "**Description**" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1037 msgid "**Action**" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1038 msgid "``/healthcheck`` latency is high." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1039 msgid "" "The ``/healthcheck`` test does not tax the proxy very much so any drop in " "value is probably related to network issues, rather than the proxies being " "very busy. A very slow proxy might impact the average number, but it would " "need to be very slow to shift the number that much." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1042 msgid "" "Check networks. Do a ``curl https://:/healthcheck`` where " "``ip-address`` is individual proxy IP address. Repeat this for every proxy " "server to see if you can pin point the problem." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1046 msgid "" "Urgency: If there are other indications that your system is slow, you should " "treat this as an urgent problem." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1048 msgid "Swift process is not running." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1049 msgid "" "You can use ``swift-init`` status to check if swift processes are running on " "any given server." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1051 msgid "Run this command:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1057 msgid "" "Examine messages in the swift log files to see if there are any error " "messages related to any of the swift processes since the time you ran the " "``swift-init`` command." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1061 msgid "Take any corrective actions that seem necessary." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1063 #: ../../source/ops_runbook/diagnose.rst:1094 #: ../../source/ops_runbook/diagnose.rst:1107 #: ../../source/ops_runbook/diagnose.rst:1129 msgid "" "Urgency: If this only affects one server, and you have more than one, " "identifying and fixing the problem can wait until business hours. If this " "same problem affects many servers, then you need to take corrective action " "immediately." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1067 msgid "ntpd is not running." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1068 msgid "NTP is not running." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1069 msgid "Configure and start NTP." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1071 msgid "Urgency: For proxy servers, this is vital." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1073 msgid "Host clock is not syncd to an NTP server." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1074 msgid "" "Node time settings does not match NTP server time. This may take some time " "to sync after a reboot." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1076 msgid "" "Assuming NTP is configured and running, you have to wait until the times " "sync." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1077 msgid "A swift process has hundreds, to thousands of open file descriptors." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1078 msgid "" "May happen to any of the swift processes. Known to have happened with a " "``rsyslod`` restart and where ``/tmp`` was hanging." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1081 msgid "Restart the swift processes on the affected node:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1088 msgid "If known performance problem: Immediate" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1089 msgid "Urgency:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1090 msgid "If system seems fine: Medium" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1091 msgid "A swift process is not owned by the swift user." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1092 msgid "" "If the UID of the swift user has changed, then the processes might not be " "owned by that UID." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1098 msgid "Object account or container files not owned by swift." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1099 msgid "" "This typically happens if during a reinstall or a re-image of a server that " "the UID of the swift user was changed. The data files in the object account " "and container directories are owned by the original swift UID. As a result, " "the current swift user does not own these files." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1103 msgid "" "Correct the UID of the swift user to reflect that of the original UID. An " "alternate action is to change the ownership of every file on all file " "systems. This alternate action is often impractical and will take " "considerable time." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1111 msgid "A disk drive has a high IO wait or service time." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1112 msgid "" "If high wait IO times are seen for a single disk, then the disk drive is the " "problem. If most/all devices are slow, the controller is probably the source " "of the problem. The controller cache may also be miss configured – which " "will cause similar long wait or service times." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1116 msgid "" "As a first step, if your controllers have a cache, check that it is enabled " "and their battery/capacitor is working." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1119 msgid "" "Second, reboot the server. If problem persists, file a DC ticket to have the " "drive or controller replaced. See :ref:`diagnose_slow_disk_drives` on how to " "check the drive wait or service times." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1123 msgid "Urgency: Medium" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1124 msgid "The network interface is not up." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1125 msgid "" "Use the ``ifconfig`` and ``ethtool`` commands to determine the network state." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1126 msgid "" "You can try restarting the interface. However, generally the interface (or " "cable) is probably broken, especially if the interface is flapping." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1133 msgid "Network interface card (NIC) is not operating at the expected speed." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1134 msgid "" "The NIC is running at a slower speed than its nominal rated speed. For " "example, it is running at 100 Mb/s and the NIC is a 1Ge NIC." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1136 msgid "Try resetting the interface with:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1142 msgid "... and then run:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1148 msgid "" "See if size goes to the expected speed. Failing that, check hardware (NIC " "cable/switch port)." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1151 msgid "" "If persistent, consider shutting down the server (especially if a proxy) " "until the problem is identified and resolved. If you leave this server " "running it can have a large impact on overall performance." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1155 msgid "Urgency: High" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1156 msgid "The interface RX/TX error count is non-zero." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1157 msgid "" "A value of 0 is typical, but counts of 1 or 2 do not indicate a problem." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1158 msgid "" "For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the " "range 3-30 probably indicate that the error count has crept up slowly over a " "long time. Consider rebooting the server to remove the report from the noise." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1162 msgid "" "Typically, when a cable or interface is bad, the error count goes to 400+. " "For example, it stands out. There may be other symptoms such as the " "interface going up and down or not running at correct speed. A server with a " "high error count should be watched." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1166 msgid "" "If the error count continues to climb, consider taking the server down until " "it can be properly investigated. In any case, a reboot should be done to " "clear the error count." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1170 msgid "Urgency: High, if the error count increasing." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1172 msgid "" "In a swift log you see a message that a process has not replicated in over " "24 hours." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1173 msgid "" "The replicator has not successfully completed a run in the last 24 hours. " "This indicates that the replicator has probably hung." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1175 msgid "Use ``swift-init`` to stop and then restart the replicator process." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1177 msgid "" "Urgency: Low. However if you recently added or replaced disk drives then you " "should treat this urgently." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1179 msgid "Container Updater has not run in 4 hour(s)." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1180 msgid "" "The service may appear to be running however, it may be hung. Examine their " "swift logs to see if there are any error messages relating to the container " "updater. This may potentially explain why the container is not running." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1183 msgid "" "Urgency: Medium This may have been triggered by a recent restart of the " "rsyslog daemon. Restart the service with:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1191 msgid "" "Object replicator: Reports the remaining time and that time is more than 100 " "hours." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1192 msgid "" "Each replication cycle the object replicator writes a log message to its log " "reporting statistics about the current cycle. This includes an estimate for " "the remaining time needed to replicate all objects. If this time is longer " "than 100 hours, there is a problem with the replication process." msgstr "" #: ../../source/ops_runbook/diagnose.rst:1196 msgid "Urgency: Medium Restart the service with:" msgstr "" #: ../../source/ops_runbook/diagnose.rst:1203 msgid "Check that the remaining replication time is going down." msgstr "" #: ../../source/ops_runbook/index.rst:3 msgid "Swift Ops Runbook" msgstr "" #: ../../source/ops_runbook/index.rst:5 msgid "" "This document contains operational procedures that Hewlett Packard " "Enterprise (HPE) uses to operate and monitor the Swift system within the HPE " "Helion Public Cloud. This document is an excerpt of a larger product-" "specific handbook. As such, the material may appear incomplete. The " "suggestions and recommendations made in this document are for our particular " "environment, and may not be suitable for your environment or situation. We " "make no representations concerning the accuracy, adequacy, completeness or " "suitability of the information, suggestions or recommendations. This " "document are provided for reference only. We are not responsible for your " "use of any information, suggestions or recommendations contained herein." msgstr "" #: ../../source/ops_runbook/maintenance.rst:3 msgid "Server maintenance" msgstr "" #: ../../source/ops_runbook/maintenance.rst:6 msgid "General assumptions" msgstr "" #: ../../source/ops_runbook/maintenance.rst:8 msgid "" "It is assumed that anyone attempting to replace hardware components will " "have already read and understood the appropriate maintenance and service " "guides." msgstr "" #: ../../source/ops_runbook/maintenance.rst:12 msgid "" "It is assumed that where servers need to be taken off-line for hardware " "replacement, that this will be done in series, bringing the server back on-" "line before taking the next off-line." msgstr "" #: ../../source/ops_runbook/maintenance.rst:16 msgid "" "It is assumed that the operations directed procedure will be used for " "identifying hardware for replacement." msgstr "" #: ../../source/ops_runbook/maintenance.rst:20 msgid "Assessing the health of swift" msgstr "" #: ../../source/ops_runbook/maintenance.rst:22 msgid "" "You can run the swift-recon tool on a Swift proxy node to get a quick check " "of how Swift is doing. Please note that the numbers below are necessarily " "somewhat subjective. Sometimes parameters for which we say 'low values are " "good' will have pretty high values for a time. Often if you wait a while " "things get better." msgstr "" #: ../../source/ops_runbook/maintenance.rst:28 msgid "For example:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:48 msgid "" "In the example above we ask for information on replication times (-r), load " "averages (-l) and async pendings (-a). This is a healthy Swift system. Rules-" "of-thumb for 'good' recon output are:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:52 msgid "" "Nodes that respond are up and running Swift. If all nodes respond, that is a " "good sign. But some nodes may time out. For example:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:60 msgid "That could be okay or could require investigation." msgstr "" #: ../../source/ops_runbook/maintenance.rst:62 msgid "" "Low values (say < 10 for high and average) for async pendings are good. " "Higher values occur when disks are down and/or when the system is heavily " "loaded. Many simultaneous PUTs to the same container can drive async " "pendings up. This may be normal, and may resolve itself after a while. If it " "persists, one way to track down the problem is to find a node with high " "async pendings (with ``swift-recon -av | sort -n -k4``), then check its " "Swift logs, Often async pendings are high because a node cannot write to a " "container on another node. Often this is because the node or disk is offline " "or bad. This may be okay if we know about it." msgstr "" #: ../../source/ops_runbook/maintenance.rst:73 msgid "" "Low values for replication times are good. These values rise when new rings " "are pushed, and when nodes and devices are brought back on line." msgstr "" #: ../../source/ops_runbook/maintenance.rst:77 msgid "" "Our 'high' load average values are typically in the 9-15 range. If they are " "a lot bigger it is worth having a look at the systems pushing the average " "up. Run ``swift-recon -av`` to get the individual averages. To sort the " "entries with the highest at the end, run ``swift-recon -av | sort -n -k4``." msgstr "" #: ../../source/ops_runbook/maintenance.rst:83 msgid "" "For comparison here is the recon output for the same system above when two " "entire racks of Swift are down:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:141 msgid "" "The replication times and load averages are within reasonable parameters, " "even with 80 object stores down. Async pendings, however is quite high. This " "is due to the fact that the containers on the servers which are down cannot " "be updated. When those servers come back up, async pendings should drop. If " "async pendings were at this level without an explanation, we have a problem." msgstr "" #: ../../source/ops_runbook/maintenance.rst:149 msgid "Recon examples" msgstr "" #: ../../source/ops_runbook/maintenance.rst:151 msgid "Here is an example of noting and tracking down a problem with recon." msgstr "" #: ../../source/ops_runbook/maintenance.rst:153 msgid "Running reccon shows some async pendings:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:171 msgid "" "Why? Running recon again with -av swift (not shown here) tells us that the " "node with the highest (23) is .72.61. Looking at the log files on " ".72.61 we see:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:207 msgid "" "That is why this node has a lot of async pendings: a bunch of disks that are " "not mounted on and . There may be other issues, but " "clearing this up will likely drop the async pendings a fair bit, as other " "nodes will be having the same problem." msgstr "" #: ../../source/ops_runbook/maintenance.rst:213 msgid "Assessing the availability risk when multiple storage servers are down" msgstr "" #: ../../source/ops_runbook/maintenance.rst:217 msgid "" "This procedure will tell you if you have a problem, however, in practice you " "will find that you will not use this procedure frequently." msgstr "" #: ../../source/ops_runbook/maintenance.rst:220 msgid "" "If three storage nodes (or, more precisely, three disks on three different " "storage nodes) are down, there is a small but nonzero probability that user " "objects, containers, or accounts will not be available." msgstr "" #: ../../source/ops_runbook/maintenance.rst:226 msgid "Procedure" msgstr "" #: ../../source/ops_runbook/maintenance.rst:230 msgid "" "swift has three rings: one each for objects, containers and accounts. This " "procedure should be run three times, each time specifying the appropriate " "``*.builder`` file." msgstr "" #: ../../source/ops_runbook/maintenance.rst:234 msgid "" "Determine whether all three nodes are in different Swift zones by running " "the ring builder on a proxy node to determine which zones the storage nodes " "are in. For example:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:251 msgid "" "Here, node .4 is in zone 1. If two or more of the three nodes " "under consideration are in the same Swift zone, they do not have any ring " "partitions in common; there is little/no data availability risk if all three " "nodes are down." msgstr "" #: ../../source/ops_runbook/maintenance.rst:256 msgid "" "If the nodes are in three distinct Swift zones it is necessary to whether " "the nodes have ring partitions in common. Run ``swift-ring`` builder again, " "this time with the ``list_parts`` option and specify the nodes under " "consideration. For example:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:276 msgid "" "The ``list_parts`` option to the ring builder indicates how many ring " "partitions the nodes have in common. If, as in this case, the first entry " "in the list has a 'Matches' column of 2 or less, there is no data " "availability risk if all three nodes are down." msgstr "" #: ../../source/ops_runbook/maintenance.rst:281 msgid "" "If the 'Matches' column has entries equal to 3, there is some data " "availability risk if all three nodes are down. The risk is generally small, " "and is proportional to the number of entries that have a 3 in the Matches " "column. For example:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:301 msgid "A quick way to count the number of rows with 3 matches is:" msgstr "" #: ../../source/ops_runbook/maintenance.rst:309 msgid "" "In this case the nodes have 30 out of a total of 2097152 partitions in " "common; about 0.001%. In this case the risk is small/nonzero. Recall that a " "partition is simply a portion of the ring mapping space, not actual data. So " "having partitions in common is a necessary but not sufficient condition for " "data unavailability." msgstr "" #: ../../source/ops_runbook/maintenance.rst:317 msgid "" "We should not bring down a node for repair if it shows Matches entries of 3 " "with other nodes that are also down." msgstr "" #: ../../source/ops_runbook/maintenance.rst:320 msgid "" "If three nodes that have 3 partitions in common are all down, there is a " "nonzero probability that data are unavailable and we should work to bring " "some or all of the nodes up ASAP." msgstr "" #: ../../source/ops_runbook/maintenance.rst:325 msgid "Swift startup/shutdown" msgstr "" #: ../../source/ops_runbook/maintenance.rst:327 msgid "Use reload - not stop/start/restart." msgstr "" #: ../../source/ops_runbook/maintenance.rst:329 msgid "" "Try to roll sets of servers (especially proxy) in groups of less than 20% of " "your servers." msgstr "" #: ../../source/ops_runbook/procedures.rst:3 msgid "Software configuration procedures" msgstr "" #: ../../source/ops_runbook/procedures.rst:8 msgid "Fix broken GPT table (broken disk partition)" msgstr "" #: ../../source/ops_runbook/procedures.rst:10 msgid "" "If a GPT table is broken, a message like the following should be observed " "when the command..." msgstr "" #: ../../source/ops_runbook/procedures.rst:17 msgid "... is run." msgstr "" #: ../../source/ops_runbook/procedures.rst:26 msgid "To fix this, firstly install the ``gdisk`` program to fix this:" msgstr "" #: ../../source/ops_runbook/procedures.rst:32 msgid "Run ``gdisk`` for the particular drive with the damaged partition:" msgstr "" #: ../../source/ops_runbook/procedures.rst:55 msgid "" "On the command prompt, type ``r`` (recovery and transformation options), " "followed by ``d`` (use main GPT header) , ``v`` (verify disk) and finally " "``w`` (write table to disk and exit). Will also need to enter ``Y`` when " "prompted in order to confirm actions." msgstr "" #: ../../source/ops_runbook/procedures.rst:93 msgid "Running the command:" msgstr "" #: ../../source/ops_runbook/procedures.rst:99 msgid "Should now show that the partition is recovered and healthy again." msgstr "" #: ../../source/ops_runbook/procedures.rst:101 msgid "Finally, uninstall ``gdisk`` from the node:" msgstr "" #: ../../source/ops_runbook/procedures.rst:110 msgid "Procedure: Fix broken XFS filesystem" msgstr "" #: ../../source/ops_runbook/procedures.rst:112 msgid "" "A filesystem may be corrupt or broken if the following output is observed " "when checking its label:" msgstr "" #: ../../source/ops_runbook/procedures.rst:125 msgid "" "Run the following commands to remove the broken/corrupt filesystem and " "replace. (This example uses the filesystem ``/dev/sdb2``) Firstly need to " "replace the partition:" msgstr "" #: ../../source/ops_runbook/procedures.rst:168 msgid "Next step is to scrub the filesystem and format:" msgstr "" #: ../../source/ops_runbook/procedures.rst:186 msgid "You should now label and mount your filesystem." msgstr "" #: ../../source/ops_runbook/procedures.rst:188 msgid "Can now check to see if the filesystem is mounted using the command:" msgstr "" #: ../../source/ops_runbook/procedures.rst:197 msgid "Procedure: Checking if an account is okay" msgstr "" #: ../../source/ops_runbook/procedures.rst:201 msgid "" "``swift-direct`` is only available in the HPE Helion Public Cloud. Use " "``swiftly`` as an alternate (or use ``swift-get-nodes`` as explained here)." msgstr "" #: ../../source/ops_runbook/procedures.rst:205 msgid "" "You must know the tenant/project ID. You can check if the account is okay as " "follows from a proxy." msgstr "" #: ../../source/ops_runbook/procedures.rst:211 msgid "" "The response will either be similar to a swift list of the account " "containers, or an error indicating that the resource could not be found." msgstr "" #: ../../source/ops_runbook/procedures.rst:214 msgid "" "Alternatively, you can use ``swift-get-nodes`` to find the account database " "files. Run the following on a proxy:" msgstr "" #: ../../source/ops_runbook/procedures.rst:221 msgid "" "The response will print curl/ssh commands that will list the replicated " "account databases. Use the indicated ``curl`` or ``ssh`` commands to check " "the status and existence of the account." msgstr "" #: ../../source/ops_runbook/procedures.rst:226 msgid "Procedure: Getting swift account stats" msgstr "" #: ../../source/ops_runbook/procedures.rst:230 msgid "" "``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at " "``swifty`` for an alternate or use ``swift-get-nodes`` as explained in :ref:" "`checking_if_account_ok`." msgstr "" #: ../../source/ops_runbook/procedures.rst:234 msgid "" "This procedure describes how you determine the swift usage for a given swift " "account, that is the number of containers, number of objects and total bytes " "used. To do this you will need the project ID." msgstr "" #: ../../source/ops_runbook/procedures.rst:238 msgid "Log onto one of the swift proxy servers." msgstr "" #: ../../source/ops_runbook/procedures.rst:240 msgid "Use swift-direct to show this accounts usage:" msgstr "" #: ../../source/ops_runbook/procedures.rst:256 msgid "" "This account has 1 container. That container has 8436776 objects. The total " "bytes used is 67440225625994." msgstr "" #: ../../source/ops_runbook/procedures.rst:260 msgid "Procedure: Revive a deleted account" msgstr "" #: ../../source/ops_runbook/procedures.rst:262 msgid "" "Swift accounts are normally not recreated. If a tenant/project is deleted, " "the account can then be deleted. If the user wishes to use Swift again, the " "normal process is to create a new tenant/project -- and hence a new Swift " "account." msgstr "" #: ../../source/ops_runbook/procedures.rst:267 msgid "" "However, if the Swift account is deleted, but the tenant/project is not " "deleted from Keystone, the user can no longer access the account. This is " "because the account is marked deleted in Swift. You can revive the account " "as described in this process." msgstr "" #: ../../source/ops_runbook/procedures.rst:274 msgid "" "The containers and objects in the \"old\" account cannot be listed anymore. " "In addition, if the Account Reaper process has not finished reaping the " "containers and objects in the \"old\" account, these are effectively " "orphaned and it is virtually impossible to find and delete them to free up " "disk space." msgstr "" #: ../../source/ops_runbook/procedures.rst:280 msgid "" "The solution is to delete the account database files and re-create the " "account as follows:" msgstr "" #: ../../source/ops_runbook/procedures.rst:283 msgid "" "You must know the tenant/project ID. The account name is AUTH_. " "In this example, the tenant/project is ``4ebe3039674d4864a11fe0864ae4d905`` " "so the Swift account name is ``AUTH_4ebe3039674d4864a11fe0864ae4d905``." msgstr "" #: ../../source/ops_runbook/procedures.rst:287 msgid "" "Use ``swift-get-nodes`` to locate the account's database files (on three " "servers). The output has been truncated so we can focus on the import pieces " "of data:" msgstr "" #: ../../source/ops_runbook/procedures.rst:308 msgid "" "Before proceeding check that the account is really deleted by using curl. " "Execute the commands printed by ``swift-get-nodes``. For example:" msgstr "" #: ../../source/ops_runbook/procedures.rst:318 msgid "" "Repeat for the other two servers (192.168.245.3 and 192.168.245.4). A ``404 " "Not Found`` indicates that the account is deleted (or never existed)." msgstr "" #: ../../source/ops_runbook/procedures.rst:321 msgid "If you get a ``204 No Content`` response, do **not** proceed." msgstr "" #: ../../source/ops_runbook/procedures.rst:323 msgid "" "Use the ssh commands printed by ``swift-get-nodes`` to check if database " "files exist. For example:" msgstr "" #: ../../source/ops_runbook/procedures.rst:336 #: ../../source/ops_runbook/procedures.rst:353 msgid "Repeat for the other two servers (192.168.245.3 and 192.168.245.4)." msgstr "" #: ../../source/ops_runbook/procedures.rst:338 msgid "If no files exist, no further action is needed." msgstr "" #: ../../source/ops_runbook/procedures.rst:340 msgid "" "Stop Swift processes on all nodes listed by ``swift-get-nodes`` (In this " "example, that is 192.168.245.3, 192.168.245.4 and 192.168.245.5)." msgstr "" #: ../../source/ops_runbook/procedures.rst:343 msgid "We recommend you make backup copies of the database files." msgstr "" #: ../../source/ops_runbook/procedures.rst:345 msgid "Delete the database files. For example:" msgstr "" #: ../../source/ops_runbook/procedures.rst:355 msgid "Restart Swift on all three servers" msgstr "" #: ../../source/ops_runbook/procedures.rst:357 msgid "" "At this stage, the account is fully deleted. If you enable the auto-create " "option, the next time the user attempts to access the account, the account " "will be created. You may also use swiftly to recreate the account." msgstr "" #: ../../source/ops_runbook/procedures.rst:363 msgid "" "Procedure: Temporarily stop load balancers from directing traffic to a proxy " "server" msgstr "" #: ../../source/ops_runbook/procedures.rst:365 msgid "" "You can stop the load balancers sending requests to a proxy server as " "follows. This can be useful when a proxy is misbehaving but you need Swift " "running to help diagnose the problem. By removing from the load balancers, " "customer's are not impacted by the misbehaving proxy." msgstr "" #: ../../source/ops_runbook/procedures.rst:370 msgid "" "Ensure that in /etc/swift/proxy-server.conf the ``disable_path`` variable is " "set to ``/etc/swift/disabled-by-file``." msgstr "" #: ../../source/ops_runbook/procedures.rst:373 msgid "Log onto the proxy node." msgstr "" #: ../../source/ops_runbook/procedures.rst:375 msgid "Shut down Swift as follows:" msgstr "" #: ../../source/ops_runbook/procedures.rst:383 msgid "Shutdown, not stop." msgstr "" #: ../../source/ops_runbook/procedures.rst:385 msgid "Create the ``/etc/swift/disabled-by-file`` file. For example:" msgstr "" #: ../../source/ops_runbook/procedures.rst:391 msgid "Optional, restart Swift:" msgstr "" #: ../../source/ops_runbook/procedures.rst:397 msgid "" "It works because the healthcheck middleware looks for /etc/swift/disabled-by-" "file. If it exists, the middleware will return 503/error instead of 200/OK. " "This means the load balancer should stop sending traffic to the proxy." msgstr "" #: ../../source/ops_runbook/procedures.rst:402 msgid "Procedure: Ad-Hoc disk performance test" msgstr "" #: ../../source/ops_runbook/procedures.rst:404 msgid "You can get an idea whether a disk drive is performing as follows:" msgstr "" #: ../../source/ops_runbook/procedures.rst:410 msgid "" "You can expect ~600MB/sec. If you get a low number, repeat many times as " "Swift itself may also read or write to the disk, hence giving a lower number." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:3 msgid "Troubleshooting tips" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:6 msgid "" "Diagnose: Customer complains they receive a HTTP status 500 when trying to " "browse containers" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:8 msgid "" "This entry is prompted by a real customer issue and exclusively focused on " "how that problem was identified. There are many reasons why a http status of " "500 could be returned. If there are no obvious problems with the swift " "object store, then it may be necessary to take a closer look at the users " "transactions. After finding the users swift account, you can search the " "swift proxy logs on each swift proxy server for transactions from this user. " "The linux ``bzgrep`` command can be used to search all the proxy log files " "on a node including the ``.bz2`` compressed files. For example:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:34 msgid "This shows a ``GET`` operation on the users account." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:38 msgid "" "The HTTP status returned is 404, Not found, rather than 500 as reported by " "the user." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:40 msgid "" "Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can " "search the swift object servers log files for this transaction ID:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:75 msgid "" "The 3 GET operations to 3 different object servers that hold the 3 replicas " "of this users account. Each ``GET`` returns a HTTP status of 404, Not found." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:79 msgid "" "Next, use the ``swift-get-nodes`` command to determine exactly where the " "user's account data is stored:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:119 msgid "" "Check each of the primary servers, .31, .204.70 and " ".72.16, for this users account. For example on .72.16:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:131 msgid "" "So this users account db, an sqlite db is present. Use sqlite to checkout " "the account:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:167 msgid "" "Next try and find the ``DELETE`` operation for this account in the proxy " "server logs:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:182 msgid "" "From this you can see the operation that resulted in the account being " "deleted." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:185 msgid "Procedure: Deleting objects" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:188 msgid "Simple case - deleting small number of objects and containers" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:192 msgid "" "``swift-direct`` is specific to the Hewlett Packard Enterprise Helion Public " "Cloud. Use ``swiftly`` as an alternative." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:197 msgid "" "Object and container names are in UTF8. Swift direct accepts UTF8 directly, " "not URL-encoded UTF8 (the REST API expects UTF8 and then URL-encoded). In " "practice cut and paste of foreign language strings to a terminal window will " "produce the right result." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:202 msgid "Hint: Use the ``head`` command before any destructive commands." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:204 msgid "" "To delete a small number of objects, log into any proxy node and proceed as " "follows:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:207 msgid "Examine the object in question:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:213 msgid "" "See if ``X-Object-Manifest`` or ``X-Static-Large-Object`` is set, then this " "is the manifest object and segment objects may be in another container." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:217 msgid "" "If the ``X-Object-Manifest`` attribute is set, you need to find the name of " "the objects this means it is a DLO. For example, if ``X-Object-Manifest`` is " "``container2/seg-blah``, list the contents of the container container2 as " "follows:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:226 msgid "" "Pick out the objects whose names start with ``seg-blah``. Delete the segment " "objects as follows:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:235 msgid "" "If ``X-Static-Large-Object`` is set, you need to read the contents. Do this " "by:" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:237 msgid "Using swift-get-nodes to get the details of the object's location." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:238 msgid "Change the ``-X HEAD`` to ``-X GET`` and run ``curl`` against one copy." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:239 msgid "This lists a JSON body listing containers and object names" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:240 msgid "Delete the objects as described above for DLO segments" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:242 msgid "" "Once the segments are deleted, you can delete the object using ``swift-" "direct`` as described above." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:245 msgid "Finally, use ``swift-direct`` to delete the container." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:248 msgid "Procedure: Decommissioning swift nodes" msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:250 msgid "" "Should Swift nodes need to be decommissioned (e.g.,, where they are being re-" "purposed), it is very important to follow the following steps." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:253 msgid "" "In the case of object servers, follow the procedure for removing the node " "from the rings." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:255 msgid "" "In the case of swift proxy servers, have the network team remove the node " "from the load balancers." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:257 msgid "Open a network ticket to have the node removed from network firewalls." msgstr "" #: ../../source/ops_runbook/troubleshooting.rst:259 msgid "" "Make sure that you remove the ``/etc/swift`` directory and everything in it." msgstr ""