From jkitchin at andrew.cmu.edu Mon Nov 4 12:23:49 2013 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Mon, 4 Nov 2013 12:23:49 -0500 Subject: gilgamesh: another raid array problem Message-ID: Hi everyone, Another disk in the raid array is causing a problem that is leading to no home directories being present. A replacement has been ordered and we are working to resolve the issue. Unfortunately I am out of town, and won't be back until Friday. Hopefully we can fix it before then, but it may take that long. John ----------------------------------- John Kitchin Associate Professor Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20131104/6bcf74d9/attachment.html From jkitchin at andrew.cmu.edu Sun Nov 10 16:52:07 2013 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Sun, 10 Nov 2013 16:52:07 -0500 Subject: gilgamesh: gilgamesh update Message-ID: Hi everyone, I wanted to give you an update on gilgamesh. It seems that last week another drive failed in the home directory raid array, putting the raid array in a degraded state and causing some issues. Today I was able to figure out which drive caused that, replace that drive and the array is presently rebuilding. that will probably take overnight, and if all goes well I plan to reenable logging in to gilgamesh tomorrow. If the system is stable for another day, I will turn the queue back on, and slowly start turning on nodes. I do not believe any data has been lost, but I will take this opportunity to remind you that the home directories are not backed up. I also do not know if there is any reason for two drives to have failed close together. The disks are a little over 2 years old, which seems young to die, but they also have not been used much in the past 2 years until we made them the new home directories. John ----------------------------------- John Kitchin Associate Professor Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20131110/422f8e76/attachment.html From jkitchin at andrew.cmu.edu Mon Nov 11 08:42:39 2013 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Mon, 11 Nov 2013 08:42:39 -0500 Subject: gilgamesh: gilgamesh is back up Message-ID: The raid array successfully rebuilt last night. You should be able to log in now and see your home directories. I have not restarted the queue yet. If all is still good this afternoon, I will start that back up and turn on a few nodes. John ----------------------------------- John Kitchin Associate Professor Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20131111/2f6dcf80/attachment.html From jkitchin at andrew.cmu.edu Tue Nov 12 14:18:25 2013 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Tue, 12 Nov 2013 14:18:25 -0500 Subject: gilgamesh: very heavy disk io Message-ID: Gilgamesh is presently sluggish because of very heavy disk io. I suspect this is because there are 456 cpus used by kparrish that are maxing out the disk io. please kill some of those jobs asap. John ----------------------------------- John Kitchin Associate Professor Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20131112/19ae650a/attachment.html From jkitchin at andrew.cmu.edu Tue Nov 12 18:10:52 2013 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Tue, 12 Nov 2013 18:10:52 -0500 Subject: gilgamesh: please limit quantum espresso and other disk-io intensive jobs Message-ID: Hi everyone, I believe we have identified the issue that was causing gilgamesh to be sluggish. The issue was likely a lot of quantum espresso jobs running. this code is very disk io intensive, and was maxing out the disk write rate on the home directories, leading to the sluggish behavior. We are working to solve this by using local disk io on the nodes, and if we figure that out I will let you know. John ----------------------------------- John Kitchin Associate Professor Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20131112/c3136e21/attachment.html