Movatterモバイル変換


[0]ホーム

URL:


Jump to content
Wikitech
Search

Portal:Toolforge/Admin/Runbooks/ToolsNfsAlmostFull

From Wikitech
<Portal:Toolforge |Admin |Runbooks

Toolforge Admin

[edit]

TheToolsNfsAlmostFull alert fires when the Toolforge NFS server is almost out of disk space. This happens surprisingly often asthe NFS share has no quotas. When that happens, generally a task similar totask T247315 is created with administrator work logged on it and a tree of user tasks that we assign to end users to clean up their tool shares or project shares with some advice and assistance where possible.

The procedures in this runbook requireadmin permissions to complete.

Error / Incident

The Toolforge NFS server is almost out of disk space. This generally means that some space needs to be freed up. As of 2024-01-03 the nfs server istools-nfs-2.tools.eqiad1.wikimedia.cloud.

Note that this alert comes in multiple severity levels, awarning alert means that there's much more space available than for acritical or apage alert.

Admin actions include, but are not limited to:

  • Truncate automatically-created *.out and *.err files that were created byjob file logs. For example:
sudoionice-c3nice-19find/srv/tools-typef-size+10M-name"*.err"-exectruncate-s1M{}\;
  • Log files generated bythe webservice command such as access.log and error.log can be treated similarly.
  • Other files should probably be checked with the user before deleting unless the situation is very urgent (usually asking the user in the phabricator task is enough).
  • If a service is consistently filling up NFS volumes, and users cannot be reached, it could be shut down as a danger to the overall service. We should make our best effort to avoid needing to do that, of course.
  • Remove older files (or, if necessary, all files) from project/.shared/cache -- some tools (mostly cewbot) use that for temporary storage and often fail to clean up afterwards.

The#Debugging section has some tips to find what's filling up the volume.

Debugging

Try what was done last time

If the alert fires after a very short time (about a week or so) after the last time cleanup was done, it is usually caused by the same thing as the last time. Look at the task for that cleanup and look what was done there. Cleanup those again and nudge those maintainers.

Locate disk hogs

:# Individual files:#ionice-c3nice-19find/srv/tools-typef-size+100M-printf"%k KB %p\n"|sort-h>tools_large_files_$(date+%Y%m%d).txt:# Largest tools overall:#du-sh/srv/tools/{project,home}/*|teetools_large_users_$(date+%Y%m%d).txt

These will take a few hours to complete.

Common issues

Add here any new common issues you find.

Related information

Old incidents

Add here any new tasks for incidents you might encounter.

Retrieved from "https://wikitech.wikimedia.org/w/index.php?title=Portal:Toolforge/Admin/Runbooks/ToolsNfsAlmostFull&oldid=2304298"
Categories:

[8]ページ先頭

©2009-2025 Movatter.jp