We're starting the ATLAS Tier 2/3 workshop now. This is my first attempt to liveblog anything, so hope it goes well. I'll try to keep up.
- What's next for US ATLAS FY13 Facilities Milestones
- March - Analysis jobs using FAX for data access
- April - Complete deployment of 2013 pledged resources
- May - Tier3 in the cloud prototype
- June - All US Tier2 connected to LHCONE
- July - Cloud production at scale
- HLT Farm at P1 running production in cloud
- 100 Gbit pilot between T1 and an international siteFAX analysis milestone
- new stress tests coming for FDR
- need examples, testing, documentationCloudy Tier 2
- Follow BNL work on Tier 1
- Support growth of T3 into T2Skim Slim Service - Ilija
- in pre-production, uses FAX and UC3
- help researchers with data set reduction
- many extensions possible - new mode of analysis support
- Cloud activities at BNL Condor Scaling
- $50k grant for ec2 testing
- naive approach: single schedd, collector, etc / single process per daemon / password authN / connection broker (CCB)
- maxed out at ~3000 nodes
- refined approach: split schedd from collector, negotiator, ccb / 20 collector processes / startd randomly chooses one / collector reporting so that sub-collectors report to one top-level / tune os limits / shared port daemon to mux tcp enable session authN to reduce authentications
- smooth up to ~5000 nodes for two weeks
- production simulation jobs
- spent $13k; only $750 for data transfer
- moderate spot termination
- actual spot price remained close to baseline
- roughly, without real statistics, pricing is "competitive"EC2 Spot Pricing
- m1.small is 1.7 GB
- not really enough for atlas jobs - 1 compute unit about 1/2 as powerful as hardware CPU
- does better with larger system
- currently bidding 3*baseline. optimal?
- spot tends to 10% on-demand pricing. based on t1 costs, this is competitive.
- 1.7GB per "compute unit", not per CPU. Empirically insifficient for ATLAS work.
- do 7 slots on m1.xlarge perform economically?
- very interesting table on ec2 type performance comparison in slideshow, not reproduced hereNew coniderations
- spot pricing has sudden shutdowns, so:
- shorter jobs, < 1h - opposite to conventional ATLAS approach
- checkpointing jobs in condor would be nice
- with sub-hour jobs, we could possibly get free time from ec2
- spot utilization affects spot pricing if you launch in batch
- if you trickle in, pricing remains relatively steady
- Out of the Basement, Into the Cloud Motivations
- Tier3 community sizeable; tier 3 itself not
- where to get more resources?
- opportunistically on t2
- campus grid
- cloud rentalUsing a Tier 2
- mwt2 is overpledged on compute
- send tier3 users into tier2
- kvm platform on tier2Campus Grid
- no kvm (yet)
- use htcondor
- mutual opportunity in using campus grid rsrcscloud
- only need to provide iaas with a vm image
- figure out how to provide vms as needed (bursty jobs)setup
- boxgrinder
- htcondorThere are paradigm shifts!
- home directory use
- access via xrootd-fuse
- fax
- cvmfs
- reduce expectations at the start to change behavior for the long termFair amount of manual steps for KVM deployment. Hope for more streamlining under openstack. Condor classads to identify appropriate deployment locations. Boxgrinder definition very similar to tier2. ec2 workers behind amazon NAT - use CCB. caveats - * how to mount NFS? we don't. security issues. Use FAX! * how to export users? we don't. run all under vo-like role user. easier configuration, consistency. More work to do on file access. Tier 3 on UC3. * U Chicago Computing Cooperative * 500 slots on "seeder cluster" * IT Services +100 slots * access to 3500+ slots on Research Computing Center * registered VO with OSG * lots of resources for Tier3 users to take advantage of Majority of jobs will need cvmfs; relatively firm job req. * use parrot - see Suchandra's Skeleton Key talk * more complexity to abstract Side benefit: more slots to uc3 users as well. Goals: * eventually want to help other tier3 sites to run in tier2 space * fork boxgrinder definitions on github — shared, reusable tier3 image
- ATLAS and the PKI transition * March 23: No more DoEGrids issue * Existing certs still good until their natural expiration *
- US ATLAS Open Networking Discussion