From awadell at andrew.cmu.edu Wed Sep 1 16:23:28 2021 From: awadell at andrew.cmu.edu (Alexius Wadell) Date: Wed, 1 Sep 2021 16:23:28 -0400 Subject: Arjuna Status Update Message-ID: Hello, Yesterday, at approximately 1:00 pm, Arjuna experienced a filesystem failure, thus far, we have been unable to recover the data. We understand that this situation is undesirable, and recognize the severity of its impact on your research. We are currently reaching out to PSC, ECE, and CIT for additional guidance, but are not optimistic about any data recovery. Arjuna will remain down indefinitely as we exhaust all available avenues for recovering data. Best, the Arjuna Admin Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From abills at andrew.cmu.edu Tue Sep 7 17:14:13 2021 From: abills at andrew.cmu.edu (Alexander Bills) Date: Tue, 7 Sep 2021 17:14:13 -0400 Subject: Arjuna Update Message-ID: Hello, We have not been able to recover data from the RAID storage and have exhausted all recommended solutions from PSC, CMU ECE, and others. At this time we do not believe that the data is recoverable. After an extensive investigation, we determined that the cause of the data loss was a reformat of the /home filesystem. We changed the configuration of the worker nodes to mount their drives (previously all OS and files on workers were stored in RAM and the drives were unused) in an attempt to free RAM space and to alleviate crashes caused by users filling the /tmp file system. c003 had a connection to the RAID system, which we believed was inactive, but which became active upon this change, and c003 reformatted the RAID storage rather than its internal disk. During a power fluctuation at PSC on Tuesday, c001 briefly lost connectivity to the RAID, and upon remount, was unable to reconstruct the reformatted filesystem. We attempted to reconstruct the filesystem using all tools recommended to us, ranging from simply rebuilding the partition table to ?forensics? tools typically used by law enforcement to reconstruct destroyed evidence. However, we were unable to recover any missing data. The connection from c003 to RAID was removed, ensuring that this incident will not be repeated. While we do not anticipate any more major changes of this type to the Arjuna configuration, we strongly reiterate our urging for all users to backup all important data on all machines. We will return Arjuna to service this week, and at that time will provide instructions on how to request a new account. Best Regards, the Arjuna Admin Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From abills at andrew.cmu.edu Fri Sep 10 09:37:35 2021 From: abills at andrew.cmu.edu (Alexander Bills) Date: Fri, 10 Sep 2021 09:37:35 -0400 Subject: Arjuna Returning to Service Message-ID: Hi All, Arjuna will be returning to service today. You will need to request an account using this form . We will closely monitor this form for the next two weeks, following that, please open an issue on GitHub for account requests. Please report any issues on github. Once your account has been created, you will receive an email from root at coe.psc.edu, and at that time, you will be able to log into Arjuna. If you are interested in actively participating in the administration of Arjuna, please send an email to Alec Bills at abills at andrew.cmu.edu and Alex Wadell at awadell at andrew.cmu.edu. Best, the Arjuna Admin Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From ddardzin at andrew.cmu.edu Mon Sep 20 13:04:01 2021 From: ddardzin at andrew.cmu.edu (Derek Dardzinski) Date: Mon, 20 Sep 2021 13:04:01 -0400 Subject: Salloc Time Limit Message-ID: Hello, My nodes that are checked out using salloc are being killed after one day. Previously you could have a node for seven days, is there some additional tag I need to add to extend the time limit? Here is the command I used to checkout a node: salloc -N 1 -n 54 -J derek -p cpu --mem=0 Best, Derek -- Derek Dardzinski PhD Candidate, Materials Science and Engineering Carnegie Mellon University Website: derekdardzinski.com -------------- next part -------------- An HTML attachment was scrubbed... URL: