OSG AHM ATLAS Tier 2/3 Workshop, Monday

We're starting the ATLAS Tier 2/3 workshop now. This is my first attempt to liveblog anything, so hope it goes well. I'll try to keep up.

What's next for US ATLAS FY13 Facilities Milestones
March - Analysis jobs using FAX for data access
April - Complete deployment of 2013 pledged resources
May - Tier3 in the cloud prototype
June - All US Tier2 connected to LHCONE
July - Cloud production at scale
HLT Farm at P1 running production in cloud
100 Gbit pilot between T1 and an international siteFAX analysis milestone
new stress tests coming for FDR
need examples, testing, documentationCloudy Tier 2
Follow BNL work on Tier 1
Support growth of T3 into T2Skim Slim Service - Ilija
in pre-production, uses FAX and UC3
help researchers with data set reduction
many extensions possible - new mode of analysis support
Cloud activities at BNL Condor Scaling
$50k grant for ec2 testing
naive approach: single schedd, collector, etc / single process per daemon / password authN / connection broker (CCB)
maxed out at ~3000 nodes
refined approach: split schedd from collector, negotiator, ccb / 20 collector processes / startd randomly chooses one / collector reporting so that sub-collectors report to one top-level / tune os limits / shared port daemon to mux tcp enable session authN to reduce authentications
smooth up to ~5000 nodes for two weeks
production simulation jobs
spent $13k; only $750 for data transfer
moderate spot termination
actual spot price remained close to baseline
roughly, without real statistics, pricing is "competitive"EC2 Spot Pricing
m1.small is 1.7 GB
not really enough for atlas jobs - 1 compute unit about 1/2 as powerful as hardware CPU
does better with larger system
currently bidding 3*baseline. optimal?
spot tends to 10% on-demand pricing. based on t1 costs, this is competitive.
1.7GB per "compute unit", not per CPU. Empirically insifficient for ATLAS work.
do 7 slots on m1.xlarge perform economically?
very interesting table on ec2 type performance comparison in slideshow, not reproduced hereNew coniderations
spot pricing has sudden shutdowns, so:
shorter jobs, < 1h - opposite to conventional ATLAS approach
checkpointing jobs in condor would be nice
with sub-hour jobs, we could possibly get free time from ec2
spot utilization affects spot pricing if you launch in batch
if you trickle in, pricing remains relatively steady
Out of the Basement, Into the Cloud Motivations
Tier3 community sizeable; tier 3 itself not
where to get more resources?
opportunistically on t2
campus grid
cloud rentalUsing a Tier 2
mwt2 is overpledged on compute
send tier3 users into tier2
kvm platform on tier2Campus Grid
no kvm (yet)
use htcondor
mutual opportunity in using campus grid rsrcscloud
only need to provide iaas with a vm image
figure out how to provide vms as needed (bursty jobs)setup
boxgrinder
htcondorThere are paradigm shifts!
home directory use
access via xrootd-fuse
fax
cvmfs
reduce expectations at the start to change behavior for the long termFair amount of manual steps for KVM deployment. Hope for more streamlining under openstack. Condor classads to identify appropriate deployment locations. Boxgrinder definition very similar to tier2. ec2 workers behind amazon NAT - use CCB. caveats - * how to mount NFS? we don't. security issues. Use FAX! * how to export users? we don't. run all under vo-like role user. easier configuration, consistency. More work to do on file access. Tier 3 on UC3. * U Chicago Computing Cooperative * 500 slots on "seeder cluster" * IT Services +100 slots * access to 3500+ slots on Research Computing Center * registered VO with OSG * lots of resources for Tier3 users to take advantage of Majority of jobs will need cvmfs; relatively firm job req. * use parrot - see Suchandra's Skeleton Key talk * more complexity to abstract Side benefit: more slots to uc3 users as well. Goals: * eventually want to help other tier3 sites to run in tier2 space * fork boxgrinder definitions on github — shared, reusable tier3 image
ATLAS and the PKI transition * March 23: No more DoEGrids issue * Existing certs still good until their natural expiration *
US ATLAS Open Networking Discussion