Astronomical Data Centres
What will the ADCs of the future look like?
- The whole idea of the Virtual Observatory (VO) is to put all the world's astronomical data archives at the service
of any astronomer; all this regardless of where the archive or the astronomer may be located. This would seem to
indicate that the VO will empower anyone to make his or her piece of data available on the VO.
- However, with all this data available, the astronomers will not want to simply download it onto their workstations
for manipulation by whatever programs they have installed. For the Grid, as well as making data available, will make
massive computing resources also available. And it is these that the astronomer will want to make use of.
- Now, these supercomputers, clusters, visualisation engines and complex suites of algorithms are not going to be
cheap. It is likely that they will be located at only a few centres in any country. So, after a few times of having to
wait for results from an analysis run because the data has to be shifted from some remote source to one of these
central sites, the astronomer will naturally ask why the data cannot be co-located with the resources to process it.
And, in order to provide a useful service so that the funding agencies keep the money pouring in, the ADC will
naturally comply.
- So, IMNSHO, it seems that the natural trend will be towards a few very large data centres in each country.
So, what's wrong with that?
- Let's ask why existing ADCs exist – and this is solely based on my own un-astronomical observations. The first
reason is that an institute with such an ADC can make the data available to its researchers easily and quickly. So,
institutes were willing to put a small amount of resource into maintaining mirrors of data archives to make their
researchers' jobs easier. This also provided a service to other researchers located nearby, so funding agencies were
willing to pay for the upkeep of these services.
- The second reason seemed to be that hosting the data from some new mission or survey guaranteed that your
researchers would be among the first to get their hands on the data and so would be more likely to make important
discoveries and publish the all-important papers.
- In the future, the first reason will be irrelevant since the VO/Grid will make all data instantly available to
everyone. But if there are only a few large ADCs, does that mean the researchers associated with the institutes
hosting the data will always get first grab at the data?
- NOT if the ADCs are established as semi-autonomous with identities that are distinct from the astronomical
departments that currently host them.
What about pipeline processing?
- This is certainly a question that needs addressing. I don't know enough about pipelines to posit an answer but it
seems to be an activity that sits halfway between research and programming. It seems that most pipeline work is
undertaken by astronomers who have drifted into programming, which begs the question whether it would be better
undertaken by people trained in data analysis techniques from an IT or Computer Science background.
- Anyway, if pipeline processing is also located in the ADCs so that expertise can be built up in the fields
associated with data analysis and processing, it seems fair that these teams also be autonomous from the research
side.
So, what will these ADCs do differently?
- Probably the easiest way of getting an idea of this is to look at commercial data centres and the services they
provide. I've pulled together a short list of services from a cursory look at a few commercial data centres' web
sites:
- monitoring
- data backup
- hosting
- load balancing
- managed VPNs
- physical security
- This doesn't look much like the services provided by existing ADCs. Are they relevant? I think so. If you are
making data and services available not just to your own researchers and others in your country but to people around
the world, you have an obligation to make those services available 24 hours/day. This means monitoring all the
services and ensuring they are working properly and have the resources they need to work optimally.
- Data backup is less relevant in the case of ADCs where most archives will be mirrored at least once somewhere
around the world. But it is relevant where the ADC is storing the data for projects and users – a service that
we have explicitly included as part of the AstroGrid infrastructure.
- With all these resources available, it is likely that ADCs will begin to offer hosting and VPN services to
projects or even departments. This is where the ADC completely mirrors the role of the commercial data centre.
- Physical security, environmental control etc then become of utmost importance. A project will not be happy if it
cannot access its data and programs because an ADC has lost its air-conditioning and the service contract it holds
only stipulates a 7-day turnaround! Data security is also an issue – people with proprietary access to certain
datasets will need to be assured that no-one else can get at their data.
How can astronomers trust IT people to know about astronomical data?
- You could say the same about marketing data, sales data, customer service data etc. I don't want to minimise the
difficulty of astronomical data acquisition, cleaning and analysis but the idea that only astronomers could understand
what is involved is not true. An ADC will employ both astronomers and IT personnel in the short term. In the long
term, they will all turn into experts in astronomical data analysis.
- What is more, the ADCs need not only ever handle astronomical data. If they become experts in data curation,
analysis, visualisation etc, these skills could easily be transferred to allow them to manage data from
bioinformatics, chemistry, engineering etc. Perhaps they will become Science Data Centres (SDCs) instead of ADCs!
Won't the ADCs just turn into glorified computer departments?
- That is certainly a real danger. And it would happen if all that is 'outsourced' is the data storage and the only
role of the ADC is to maintain data. I think the best way to avoid this is to ensure that the ADCs have active
roles in all areas of astronomical research. So the pipeline specialists work with mission managers to ensure that
data is efficiently captured and delivered to the researchers, the storage specialists work with researchers to ensure
the data is held in ways that facilitate the use of the data, the operations people work with everyone to ensure that
bandwidth, hardware etc is available.
- And the ADC should have its own people working on research teams in astronomy and computer science looking at new
ways of visualising data, new storage and access mechnisms etc.
I've read enough, what does all this mean?
- The reason I wrote all this is to try and get across my own view of how the world of ADCs might change under the
influence of the VO and Grid, to get some feedback on those views and to open a discussion about what people want from
future ADCs.
- It is important for each institute to look carefully at what will be involved in hosting an ADC to decide if it
really wants to be in that business. It is also important that those who do get funding for the extra resources
required to host an ADC really understand the issues involved and come up with long-term plans for moving from
the current state of archive hosting into the future state because a lot of the grunt-work in setting the ADCs up will
likely fall to people who do not find topics like backup cycles, shift rotas, service level agreements and
air-conditioning maintenance contracts all that interesting!
--
TonyLinde - 29 Apr 2003