Controls Group Weekly Telecon: "To what extent should HA drive the control system architecture...?"

US/Central
Agenda
Slides
Present: ANL John Carwardine, Frank Lenkszus Claude Saunders DESY Kay Rehlich FNAL Brian Chase Erik Gottschalk Paul Joireman Paul Kasley Peter Kasper Jim Patrick Vince Pavlicek Margaret Votava KEK Shinichiro Michizono SLAC Bob Downing Tom Himel Ray Larsen Nan Phinney Marc Ross S Smith A Young Upcoming Schedule ================= There is a ILC Americas review at FNAL April 4th-6th. See http://ilcagenda.cern.ch/conferenceDisplay.py?confId=159 for the agenda. Marc Ross and John Carwardine will be at FNAL during those days and then at Argonne of the April 7th. Various meetings will scheduled during that week while so many people are in town including one with Nikolay Solyak and Chris Adolphsen to discuss Linac and positron civil construction (include coax cable requirements and motors). Scheduled for Tuesday at 11:00 is a meeting to discuss RDR costing strategy including operation centers. Keep your eyes on the mailing list as meeting proposals will be forthcoming. The following week (April 10th), various people from SLAC (Marc Ross, Janice Nelson, and Keith Jobe). They will be here Tuesday -> Thursday. Wednesday morning they will meet with ILCTA software groups with with an EPICS theme – LLRF control at SLAC, MDB control efforts including a demo of the cryo system. ANL Epics people will also be there to offer advice. Wednesday afternoon they will meet with Manfred and the instrumentation group. Tuesday is currently free with a proposal to talk about controls architecture with telecons to [at least] SLAC and DESY in the morning. Again, watch for more detail to appear on mail lists. Upcoming weekly Thursday meetings topics are now posted on the ILC Indico server: http://ilcagenda.cern.ch/materialDisplay.py?materialId=4&confId=332 Still no feedback about the requirements process from points of contact. John will contact individually. It might help to have a list of questions to help focus the contacts on what details we are soliciting. Claude’s spreadsheet and the initial list of questions developed by Frank are the best starting point. These have already been distributed to Points of Contact. The spreadsheet is available in the Fermilab docdb, document number 243. Stefan is on vacation starting April 23rd and is gone for 2 weeks. Given the RDR Linac LLRF review at DESY is shortly after he returns, we will need to have the LLRF topic before he leaves on vacation. Again, watch for more detail to appear on the mailing list. High Availability Discussion There was the usual discussion of what we are required and striving for in terms of availability. Overall control system availability should be 99%. There was no decision in further subdividing this between hardware vs. software allotment. It was noted that the downtime in the tevatron due to software problems is negligible. Tom commented that redundancy will add 20% to hardware reliability numbers. Adding hardware redundancy to COTS components such as network routers are not cost drivers and seem logical, but one wouldn’t double up the instruments themselves. We approach that by adding in system redundancy, e.g., not all BPMs are required. With the COTS redundancy, there will be some software overhead in automatic failover. To alleviate some of the problems with individual instruments, we attach build in system redundancy (eg, system availability of 99%. redundancy will add 20% to hardware reliability. At the beginning, all failover need not be automatic, but the the hooks need to be available in the controls system infrastructure so it can be added as needed. Diagnostics in failure situations is critical. We walked through the slides at: http://ilcagenda.cern.ch/materialDisplay.py?materialId=slides&confId=381 It is believed that the higher downtimes in collider machines over light sources such as APS is due primarily to more hardware (eg, more parts to fail). There is also more software in controlling the collider machines and that complexity is large. The collider machines are constantly evolving and the addition of new features contributes to the downtime. The underlying controls toolkit must be flexible and robust enough to handle the evolutionary nature of the machine. Testing new features/releases on an integration system is crucial. We need the ability to roll in/out software changes with little impact on performance. The SLC experience is that testing on an integration system went extremely well except for database which is still problematic. We need to be able to reconfigure feedback loops without much impact. Looking at the availability index on slide 3: We want a level of 10,11, and 12 for things like power supplies, but nothing else? RF is redundant at a system level. What about crate controllers? The table on slide 6 may be about right for redundancy. We should refer back to Tom’s MTBF list to compare. There was the general consensus that much could be gained from present accelerator experience by adopting greater QA rigor in the design, testing, and deployment of control system hardware and software. Quality Assurance needs to be integrated in the culture including applications. Do we buy SNMP diagnostic software for networking or roll our own? Licenses are in the range of a million dollars. How does the mesh with ATCA self management? How does commercial SNMP diagnostic software interface with controls system software. Claude sent some followup mail. Good alarm management is vital. We need the ability to filter out noisy stuff. Jim’s list of human aspects of slide 8 is very valid. The machine controls group is often different than the the groups that do applications. This can lead to redundant application code. How do make sure the interfaces are currect? We want hardware redundancy in the controls backbone (networking, databases, etc), but for applications, we slide more to the left of the availability level scale. Do we want UPS (uninterrupted power supply) and where? Cryogenics, personnel and machine safety, motor controllers (not the motor themselves). Some things (computers) need time to shut down gracefully and should come back automatically, but don’t necessarily need a generator to kept them running during a long shutdown period. Power dips are worse then power outages? Much of the burden to handle dips are on the machine designers. We need to be able to resume automatically after a power dip. Do we need carrier grade OS's (eg, linux)? These versions have redundant servers, eg dsn, nfs, etc. It’s not clear if we can get source code for these kernels. Maybe this is an add on if we need. Tevatron experience with linux is once we get a stable version of the kernel, things are good. Proprietary software is risky. Hopefully the job of figuring out how to make the controls system highly available will be easier once machine designers will start to pick components.
There are minutes attached to this event. Show them.
The agenda of this meeting is empty