From jkitchin at andrew.cmu.edu Mon Dec 8 15:00:21 2014 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Mon, 8 Dec 2014 15:00:21 -0500 Subject: gilgamesh: update on cluster Message-ID: It seems we have had another hard drive failure on the home raid. I am in the process of getting it back online. hopefully that will go as well as it has gone in the past. Right now it looks like it will be another day at least before I know more. Some filesystem repair commands are running and they take a long time on 10TB of disk. John ----------------------------------- John Kitchin Professor Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20141208/53d97f25/attachment.html From jkitchin at andrew.cmu.edu Thu Dec 11 10:53:12 2014 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Thu, 11 Dec 2014 10:53:12 -0500 Subject: gilgamesh: update on cluster Message-ID: It appears there are multiple drive issues on the home server raid array, including one drive that is totally failed, and one or more with some errors. odds are high it is lost. The array is "online" which means it exists, but it is in degraded state, and I am unable to mount the homes due to xfs errors that cannot be repaired, possibly from the failed disk. This afternoon I am going to try replacing the failed drive and see if the array will rebuild, and if that works, replace the other drives one at a time with rebuild. this will take a few days to figure out if it works because it takes a long time to rebuild the ~10TB array. If that works, then I can try repairing the file system errors, and if it all goes well we may have minimal data loss. if not, we will likely have total loss of the homes. hopefully not. John ----------------------------------- Professor John Kitchin Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 @johnkitchin http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20141211/c925fc36/attachment.html From jkitchin at andrew.cmu.edu Fri Dec 12 09:12:06 2014 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Fri, 12 Dec 2014 09:12:06 -0500 Subject: gilgamesh: new update on cluster Message-ID: Hi all, I am making progress in recovering the cluster homes. Last night I replaced a dead drive, and the RAID array successfully rebuilt. There is still a drive with errors on it, so I replaced that this morning, and the array is rebuilding again. Assuming that goes well and finishes tonight, tomorrow I will try to repair the file system again, and if that goes well, we may have have our data back. Please remind your group members to register at https://lists.andrew.cmu.edu/mailman/listinfo/gilgamesh-users to get updates. Thanks for your patience. John ----------------------------------- Professor John Kitchin Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 @johnkitchin http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20141212/29d1f408/attachment.html From jkitchin at andrew.cmu.edu Sat Dec 13 09:54:43 2014 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Sat, 13 Dec 2014 09:54:43 -0500 Subject: gilgamesh: christmas miracle - of sorts - i think Message-ID: Hi all, I believe I have managed to get the cluster to a state where we can recover data. Here is the current situation. I replaced two drives, and the array is rebuilt and reporting to be mostly healthy. However, there are some "bad stripes" which prevent me from repairing the file system, and the data on those stripes is almost certainly lost. We are going to have to do some work that will probably involve destroying the array, and recreating it. In the meantime, I have mounted the homes as read-only, and it appears I can access some of the data. Some of the data is unavailable. That means I can list directory contents and see contents of some files, but some files give errors trying to read or copy them. I am working on some options for backing this data up. To avoid pandemonium and overloading the server with file transfers, i have disabled login for now. The best plan I think is to get an external drive connected to the server to transfer the data (rather than lots of people trying to rsync). In the meantime, if there are some super urgent needs please let me know, I will see what I can do. John ----------------------------------- Professor John Kitchin Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 @johnkitchin http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20141213/3c91ae41/attachment.html From jkitchin at andrew.cmu.edu Wed Dec 24 06:22:35 2014 From: jkitchin at andrew.cmu.edu (John Kitchin) Date: Wed, 24 Dec 2014 06:22:35 -0500 Subject: gilgamesh: cluster update - mostly back on line Message-ID: Hi all, I believe the cluster is back on line. I was not able to save all of the data, and some people lost everything in their home directory. You will have to trust me that I spent the last two weeks trying to save what we could. I have logged in, and nothing obviously wrong or funny happened. I do not know what you will see. There may be a log file in your home that shows what files were not saved. When you log in, if you get an error about no home directory existing, please send me a note, and I will make a new one. I have not turned the compute nodes on yet, and I plan to do that on Jan 5. I am still verifying the setup works, and there may still be maintenance to do. I do not want to complicate that with running jobs yet. I would use this time to think about an effective backup plan if you don't already have one. I will be offline from 12/26 to 1/3, so if you need anything, please let me know before then. Otherwise, I will address it when I return. John ----------------------------------- Professor John Kitchin Doherty Hall A207F Department of Chemical Engineering Carnegie Mellon University Pittsburgh, PA 15213 412-268-7803 @johnkitchin http://kitchingroup.cheme.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20141224/a1909a8b/attachment.html