COST Action

Riportiamo nella sua interezza (e integrità) il testo della COST Action.

SeqAhead: A Next Generation Sequencing Data Analysis Network

A. ABSTRACT
B. BACKGROUND
B.1 General background
B.2 Current state of knowledge
B.3 Reasons for the Action
B.4 Complementarity with other research programmes
C. OBJECTIVES AND BENEFITS
C.1 Main/primary objectives
C.2 Secondary objectives
C.3 How will the objectives be achieved?
C.4 Benefits of the Action
C.5 Target groups/end users
D. SCIENTIFIC PROGRAMME
D.1 Scientific focus
D.2 Scientific work plan methods and means
E. ORGANISATION
E.1 Coordination and organisation
E.2 Working Groups
E.3 Liaison and interaction with other research programmes
E.4 Gender balance and involvement of early-stage researchers
F. TIMETABLE
G. ECONOMIC DIMENSION
H. DISSEMINATION PLAN
H.1 Who?
H.2 What?
H.3 How?
Part II
A. History of the proposal

A. ABSTRACT

Next generation sequencing (NGS) is a highly parallelised approach for quickly and economically sequencing new genomes, re-sequencing large numbers of known genomes, or for rapidly investigating transcriptomes under different conditions. Producing data on an unprecedented scale, these techniques are now driving the generation of knowledge (especially in biomedicine and molecular life sciences) to new dimensions. The massive data volumes being generated by these new technologies require new data handling and storage methods. Hence, the life science community urgently needs new, and improved approaches to facilitate NGS data management and analysis. This COST Action unites bioinformaticians, computer scientists and biomedical scientists, harnessing their expertise to bring NGS data management and analysis to new levels of efficiency and integration. Rigorous surveillance of NGS technology and NGS-related software developments will allow the partners to generate software solutions for future NGS opportunities in a timely manner. The Action will increase the ability of European groups to maximally benefit from NGS technology, and will create a nucleus for world-wide activities to jointly address the upcoming biomedical informatics revolution.

Keywords
next generation sequencing, sequence data analysis, peta-byte data storage, technology transfer, NGS technology watch and dissemination

B. BACKGROUND

B.1 General background
Next Generation Sequencing (NGS) technology massively parallelises nucleotide sequencing procedures, making the sequencing of genomes and of transcriptomes much faster and much cheaper than ever before. The new technology is, however, throwing up massive (bio)informatics challenges, which are demanding new ways of thinking and novel solutions.
Today, Europe has more than 200 NGS systems, with a combined capacity to sequence more than 300 human genomes per day. Technologies such as 454, SOLiD, and Genome Analyzer continuously increase their sequencing capacity. Illumina quotes 30x coverage of a human genome in one experiment for under €10,000. New companies promise to deliver both sequence read lengths of up to 50,000 base pairs (bp), and methods to sequence the DNA or RNA single cells. Experiments like this will soon hit infrastructural and physical limits.

The vast amount of data arriving daily at computing centres creates completely new challenges in hardware (e.g. new data-storage facilities, large bandwidth for data transfer) and software (data security, algorithms for data quality control, analysis). New solutions are urgently needed to extract information from all the data, like for the comparison of thousands of genomes and their transcriptomes; for the detection and analysis of genomic variability, DNA structure, alternative splicing, new protein families, non-coding genes, transcriptional regulation, or epigenetic effects of DNA methylation. Additionally, mapping and assembly tools are becoming increasingly difficult to use with today’s large data volumes.

The Europe-wide dearth of tools for supporting almost all aspects of research involving NGS has forced different groups to construct ad hoc solutions, often using algorithms that are no longer appropriate for today’s problems. This COST Action will unite experts from all NGS- related fields. Their cooperation within a large team will yield several advantages: it will

• improve coordination of software-development efforts, reducing redundancy, shortening the design cycle, and providing better quality products;

• influence ontology and data-transfer format issues in a concerted way, to ensure that European groups work with and generate internationally compatible systems;

• monitor technological developments and their consequences for future software needs more effectively, yielding better, more timely advice on necessary software solutions;

• disseminate information more efficiently, maintaining a unique portal for the distribution of data, software, protocols and ideas, and including a register of relevant trainers for both current and future in-reach and out-reach events.
Genuine cooperation between the participating groups has already been demonstrated in three workshops, each including an extended hack-a-thon session:

• November 2009 in Rome (Building Next Generation Sequencing platforms and pipeline solutions) ;

• June 2010 in Bari (Building Next Generation Sequencing II) under the auspices of EMBRACE (EU FP6 NoE), EMBnet, UPPMAX, CASPUR and CNR-ITB (see www.nextgenerationsequencing.org);

• June 2010 in Helsinki (Next Generation Sequencing data analysis) organised by CSC (www.csc.fi/english/csc/courses/archive/ngs_workshop).

B.2 Current state of knowledge
A wide range of NGS applications exists already, including, for example, de novo sequencing, whole genome genotyping, gene and SNP discovery, RNA expression analysis and small RNA discovery. The state of the art can be logically separated into data storage, data analysis, and visualisation/graphical interfaces for end users.

Data storage
NGS data-storage needs are huge because of the enormous sequencing speeds and the often non-binary file formats. Data handling requires both fast access to small random access files and to large sequential files. Mixed or tiered storage is preferred with large, cheap disks combined with fast, more expensive disks. File systems should preferably be parallel allowing for a coherent view on the system. Some NGS applications need large local scratch disks on the compute nodes. Back-up facilities generally are a problem because of the sheer data sizes. Many labs consider it cheaper/easier to occasionally sequence something again than to maintain a secure back-up system.

Data analysis
Data analysis is the most diverse and the most challenging NGS area. Many techniques are available to answer biological questions; the number and types of experiment, and the research fields to which they can be applied, are too numerous to list here. We can, at best, mention some of the applications and discuss a few in more detail.
Assembly and mapping of genomic sequences. When a new organism is to sequenced, the principal strategy is the whole-genome shotgun (WGS) method. Several programs are available for the hard assembly task: Arachne and PCAP for capillary sequences; Velvet and Abyss for more recent styles of NGS reads. The reduced NGS read-lengths and the insertions between paired reads, though, make it difficult to deal with sequence repeats. De novo assembly with short reads is generally difficult, and tends to produce sets of disconnected contigs. Other methods exist, but all suffer from a trade-off between precision and computational complexity.
RNA-Seq is sequencing RNA from biological samples to identify an expression profile. It is, roughly speaking, the NGS-equivalent of transcriptomics. The first step is mapping the sequence reads on a reference genome, with the additional challenge of identifying known or novel splice junctions. In species where a reference genome is not available, a (challenging) de novo assembly approach will be necessary. The second step of the analysis is quantifying the transcriptional events across the entire genome for each sample, both to quantify known elements or alternatively spliced isoforms, and to detect novel transcribed regions. Afterwards, statistical methods need to be used to capture differential gene expression, to identify sample-specific alternative splicing isoforms and to estimate their relative abundance. Several statistical approaches and computational tools have been already developed, but they require further improvement in order to account for specific data structures.
ChIP-seq is chromatin immunoprecipitation followed by NGS to identify the actual binding sites of specific proteins on a genome. The ChIP procedure generates a pool of DNA fragments enriched for the nucleotide sequence to which a protein of interest binds. The goal now becomes to identify regions with higher coverage than the background signal, and, more challenging, to determine their precise boundaries. Exotic techniques, like the use of tunable femto second UV lasers, can even be used to crosslink protein and DNA. This will likely discriminate direct/ indirect chromatin binding, thus allowing novel definitions for transcription factors and enhancers.

Targeted re-sequencing is an important application of NGS. It can be used to verify the quality of existing sequences, to enhance poorly sequenced regions, or to resolve important differential features among related species or organisms, like genetic rearrangements.
Methylation-seq and epigenetics determine inherited changes in phenotype or gene expression that are not associated with change in the DNA sequence. Methylation of cytosines at CpG-sites in the DNA, and modifications of the histone proteins in the chromosomes, are well-know examples. NGS technologies can determine epigenetic information provided the sequencing is done after sample pre-treatment. Epigenomics is an expanding research field, and epigenetic determinants of diseases or other phenotypes should be looked upon as likely rather than rare exceptions. The continued deciphering of the epigenetic code will further increase the speed of developments in this area.
Small RNAs constitute a large family of regulatory molecules with diverse functions in eukaryotes. Most classes of small RNA are less than 30 bases in length, and hence even the short Illumina and SOLiD reads them completely. NGS allows sequencing of millions of small RNAs to quantify relative frequencies of different sequences in healthy and diseased tissues, at different developmental stages, and at experimental settings. Mapping them to the genome and analysing their characteristics can reveal the class of small RNA they belong to, and opens the way to study the genetic effects of mutations on the biogenesis of classes of small RNA.
Genotype-(molecular) phenotype relations can be established for a considerable fraction of newly detected point mutations found in human genetics diagnostic labs. Knowledge of the effect of a mutation on a protein’s 3D structure may provide insight into its mechanism of action, can aid the design of further experiments, and may ultimately inform the development of new medicines and diagnostics. Collecting information about a protein from the wide range of sources available on the Web is a challenge. The HOPE and Utopia software suites, which is based on the successful interoperability concepts of the FP6 NoE EMBRACE, provides useful seeds for dissemination and adoption of these technologies in the NGS arena.
Metagenomics goes beyond the isolated genome and recognises the relevance of mixed species in populations and their synergistic/antagonistic interactions. The genomes of some species may provide the functionality missing in others, or the presence of one species my keep growth of another under control. This calls for dealing with the sum of genomes as an entity and requires novel analytical approaches to identify potential relationships among component genomes, and associations between different genomic combinations, and their resulting coordinated functionality, behaviour and evolution.

In the field of ancient DNA, high-throughput sequencing technologies have opened up new research avenues, allowing the sequencing of genomes of extinct organisms. Recovery and analysis of DNA extracted from fossils provides insight into evolutionary processes, and provides data that were impossible to generate using traditional sequencing approaches. The mass of data generated by high-throughput sequencers makes it possible to generate sufficient material that the minute amounts typically available from the species of interest still constitute a large enough data-set for genome-scale analyses.
Statistical analyses are needed in nearly all branches of NGS research. NGS-related statistics is complicated by the fact that different data-generation techniques and protocols will require different statistical approaches.

Graphical user interfaces
The vast amount of sequences delivered to the end user require powerful graphical user interfaces (GUIs) that are linked to the analysis tools to enable biologists, who normally have little or no programming background, to efficiently manage NGS data and analysis results.

Currently there are separate analysis applications of good quality, but no coherent platforms that would bundle the best applications behind a user friendly interface. For visualisation there are many tools that have been shown useful for classical sequence visualisation, but require extending to handle the large data sizes of NGS. Open source tools are being developed to address these issues. For example, Chipster analysis platform has been evolved to provide support for analysis and visualisation of ChIP-seq and RNA-seq data and Utopia is a generic visualisation engine that can deal with very large data volumes.

For maximising the utility of platforms and tools, they should be tailored for different user groups and NGS labs, which requires IT expertise. This highlights the value of this Action in disseminating such know-how. Tools tend to rely on advanced data access algorithms to deliver the required throughput, and thus will benefit from input from hardware, networking, file system, and data storage experts, and most crucially, from life science researchers adapting them to real biological problems.

Beyond intuitive interfaces, the nature of NGS analysis inherently involves complex workflows, and if these new protocols are to be integrated in standard laboratory practices, users will also need support for creating and modifying workflows in their analysis tools.

B.3 Reasons for the Action
Nowadays, many small institutes can afford either to buy the hardware for NGS, or to have their sequencing done by an NGS service facility. The scientific opportunities offered by NGS, and consequently the software needed to ask appropriate biological questions of the data, are fast- moving targets. It is hence becoming increasingly difficult to keep up with, and to maintain the constantly evolving software. Indeed, the effort needed to achieve this for NGS data analyses is so huge and diverse that only a coordinated effort, including agreement on ontologies and data-exchange formats, is likely to be successful.
The outcome of this COST Action will benefit the whole, rapidly growing NGS community. Young researchers will have a unique opportunity to connect with NGS technology expertise either via the vast amount of information this COST Action will publish or through its short-term exchange programme. Many European groups recently started (or plan to start) using NGS or are planning to acquire a sequencer. The organised knowledge of this Action will allow such groups to speed up their learning processes and to successfully adopt NGS technologies. European research groups will benefit most, but there will also be a broader economic relevance because of the increased efficiency achieved by researchers in health, hospital diagnostics, food industries, agronomy, pharmaceutical industries, etc.

NGS technology is certain to produce new, ever more massive data-generation platforms. This COST Action cannot therefore produce a final, finished product solving all NGS needs. However, it will:

• report new developments in hardware and software (through topical reports that will be published via workshops, meetings and electronic media) so that users can access information and tools in a timely way, and can meanwhile concentrate on the biological questions at hand;

• coordinate the development of the data-management and analysis systems;

• provide tools and means for Europe-wide exchanges of knowledge, ideas and experiences related to NGS;

• prepare an Action Plan that will form the basis for future funding proposals; and establish a productive, strongly interacting, and complementary network of experts that will survive long after this COST Action terminates.

B.4 Complementarity with other research programmes (if appropriate)
This COST Action will be executed in close consultation with ESFRI projects such as ELIXIR, BBMRI, ECRIN, and INSTRUCT, to harmonise activities and guarantee the highest benefits for Europe. Many partners are also active partners in these ESFRI projects. The Action will also collaborate with other related initiatives, such as the existing Iberoamerican Network on Free Software for the Life Sciences (FreeBIT), or forthcoming COST or FP7 actions, through its extensive network of contacts.

C. OBJECTIVES AND BENEFITS

C.1 Main/primary objectives
The primary objective is to develop an coordinated Action Plan for the NGS community, to help deal with the flood of NGS data in an efficient and coherent manner on the informatics side.

C.2 Secondary objectives
The primary objective naturally leads to five actions or secondary objectives:

1. Establish a strong European network of NGS centres, and data-analysis and informatics experts to facilitate and stimulate the exchange of data, protocols, software, experiences and ideas. A logical and a physical platform will be created to facilitate this exchange and to ensure that all NGS challenges can be faced collaboratively.

2. Implement a ‘technology watch’ to monitor developments in bioinformatics software, in NGS technology, in data-storage and processing hardware, in data visualisation and graphical interfaces, and in informatics solutions across diverse scientific disciplines. The ‘technology watch’ will report on likely future bottlenecks that require software design, and it will monitor world-wide developments, continuously seeking existing solutions to avoid duplication of effort.

3. Define priority areas in which NGS challenges are currently most acute; use results from the ‘technology watch’ to focus the software design activities in areas relevant to data storage and management, data analysis and statistics, data visualisation and integration; and disseminate software solutions between partner labs, and beyond, through work-benches and browsers, building explicitly on the interoperability concepts laid-out by the EU NoE EMBRACE.

4. Create an NGS bioinformatics Action Plan to address European NGS data challenges, to explicitly define NGS research and development needs and prepare a framework for the design of robust new software tools.

5. Develop a strategic communication, dissemination, and education plan for NGS bioinformatics, to distribute knowledge and expertise via concerted education and publication programs. These will involve the following tasks:

  • design and implement courses based on the recommendations of the ‘technology watch’, embracing state-of-the art e-learning concepts, virtual communities, etc. Develop a schedule for training within and outside the partner labs.
  • publish the work of the Action in peer-reviewed journals, and disseminate the results at international conferences, technical and end-user workshops, and via an open Web-portal, providing a discussion forum and one-stop-shop for software, documentation, technical documents, etc.
  • provide coordinated information relevant to scientists, funding agencies, politicians, educators, and other stakeholders across Europe.
  • identify the most appropriate instruments by which future developments may be funded.


C.3 How will the objectives be achieved?
The goals listed in sections C1 and C2 logically lead to a series of Work group (WG) activities. Six WGs will be established. WG1 will be responsible for monitoring emerging technologies and developments in all fields. WG2 will write, and continuously update, an Action Plan for NGS (bio)informatics. WG3 will coordinate the development of all software that is non-generic, i.e. software written in response to a specific biological question (this will include statistics). WG4 will deal with generic informatics topics such as large data storage, interoperability, Grid and Cloud computing, semantic applications, etc. WG5 will deal with all aspects of dissemination. WG6 will be the Steering Committee, who will coordinate the management.

At a kick-off meeting, the partners will establish WG coordinators, who will also form the Steering Committee, and they will define the details of the management strategy. Immediately after the kick-off meeting WG3 and WG4 will define and implement a system for documentation, storage, and rapid distribution of software elements between project partners. WG1 and WG2 will coordinate the determination of priorities in the problem domain, while WG3 and WG4 will ensure that software answers are designed and implemented with optimal efficiency. WG5 will ensure that the results become available to users, within and outwith this COST Action; an important activity of WG5 will be to include as many NGS research groups as possible.

A budget will be made available within the COST Action to fund a large number of workshops, both for partner education and for end-user training. All educational material will be made freely available through the dissemination portal.

C.4 Benefits of the Action
After the completion of the Human Genome Project, almost every field in the life sciences has become concerned with DNA sequences and their derivatives. The last 10 years have brought major advances in sequencing technology, and in hardware and storage systems, but the necessary software and analysis platforms are lagging behind. The major benefit of this COST Action will be that the necessary software will be designed and implemented in a coordinated, and thus efficient, manner, with inclusion of experts from all the relevant disciplines, ranging from biological problem definition and NGS hardware to fundamental informatics.

These benefits will be most important for small research groups and small companies who will gain ready, (open) access to advice, ‘best practice’ documentation, software, and training in a cost-effective way. Several other important benefits will accrue as follows:

the synergy between the joint activities will bootstrap new research and discoveries;

pooling pan-European resources will allow educational activities to tap into wider and broader teacher group than those at national scales;

by uniting the voices of a large number of labs, the COST Action board will become a powerful advocate in international discussions on standards (relating to ontologies, data-exchange formats, data-deposition strategies, etc).

Many research fields in the life sciences use NGS data and will benefit enormously from the activities of this Action. Some examples are epigenetics (the study of inherited changes in phenotype or gene expression without change in the DNA); comparative genomics (which compares (parts of) genomes to understand their information content); human genetics (involving the detection of genotype variations that relate to disease state phenotypes); personalised medicine (which will, in the coming decade, rely critically on the ability to rapidly and reliably map the genetic make-up of individual patients); genomics selection for agriculture (which will exploit SNP chip data in farmed livestock to select the next generation of breeding animals and crop plants); metagenomics (which studies dependencies between microbes that are the foundation of the biosphere, controlling the biogeochemical cycles that affect geology, hydrology, and local and global climates, and other aspects of biodiversity and ecology).

C.5 Target groups/end users
The target groups of this COST Action are:

• life scientists (including biomedical and clinical researchers) in academia and industry working towards answers to biomedical questions, who will gain access to the right tools, and to newly emerging applications;

• young scientists interested in learning about and using NGS technologies;

• data-production and data-analysis groups, who will gain direct access both to information via the Action’s Web-portal and to the Action’s experts;

• hardware and software providers, who will be able to access holistic pictures of NGS “worlds” and will be able to propose and develop new technologies;

• public institutions, journalists, educators, politicians, decision makers, etc., who will gain a single point of access to (reliable) information.


D. SCIENTIFIC PROGRAMME

D.1 Scientific focus
The principal focus of this Action is to prepare the scientific community for the NGS data avalanche. The fast-changing nature of NGS technology will necessitate continuous surveillance, and steering of the Action to deal with new technologies, and their resulting software needs, as they become available.

The Action will achieve its objectives by bringing together scientists with in-depth knowledge of the various aspects of NGS technology. The Action has a multidisciplinary character, including technologists, (bio)informaticans, computer scientists and biomedical scientists. These scientists bring all the necessary know-how and infrastructure, including a wide range of state-of-the-art algorithms and analysis pipelines, and storage and computing platforms, to the Action. Six Working Groups (WGs) will focus their activities on the main objective (see D.2).

An intensive consultation with European experts has suggested the following strategy:

1. Design and establish an information and communication platform as a prerequisite for all further steps. The platform will unite Europe’s NGS (bio)informatics community in an concerted effort to cope with NGS challenges. The whole community will contribute to the resource, and will share information and tools.

2. Establish a ‘technology watch’ team to ascertain the status quo of the present NGS field, covering technologies, applications and analysis software, so that clear scientific directions can be laid out. This ‘technology watch’ must continue to monitor the emerging technologies to guarantee timely update of the scientific focus and an adequate software response to biological applications using this technology.

3. On the basis of the ‘technology watch’ inventory, discuss and define existing gaps, bottlenecks and upcoming challenges, and steer the Action accordingly.

4. Discuss the list of tasks and formulate an Action Plan describing how solutions can be achieved. For each challenge, the Action Plan must contain recommendations that groups/companies/partners can use to address their questions and provide working solutions. The Action Plan will be a concerted task for the whole community. It will include stakeholders, companies, and other interested parties, and will contain recommendations for raising the funds needed to make the solutions concrete.

5. Dissemination of the activities and achievements of the Action to ensure that the whole community benefits from its activities.

D.2 Scientific work plan methods and means
The tasks described above logically led to an implementation with 6 working groups (WGs). This section discusses the WG tasks, their coherence, and their interactions.

WG1 Technology watch for new developments
A technology watch task force (tech-watch) will be formed by partners who work in, or in close collaboration with, NGS labs. Tech-watch aims at the early detection of informatics bottlenecks. It will do so through a series of coherent activities:

  • Reports on (recent) NGS technologies, software and hardware
  • Communication with the sales and PR departments of NGS-technology-related hardware and software companies.
  • Attendance at major conferences where NGS is on the agenda, and representation at smaller events advertised on the Internet.
  • Frequent tele-conferences and establishment of a common, wiki-based information exchange system (on the WG5 portal) to ensure that informatics bottlenecks are detected early and that solutions are proposed.

Deliverables:

  • Initial technology report.
  • Material on the portal related to database, software, elements for communication (blogs, wiki, fora etc.) and e-learning.
  • Twice-yearly reports on recent technology advances, and available solutions from inside and outside the COST Action published as white papers, peer-reviewed article, blogs, wikis, etc.

WG 2: Development of an Action Plan for NGS bioinformatics to cope with challenges for European Research Area
WG2 is the seamless continuation of WG1. A collaborative effort will be required to address the challenges and gaps in analysis pipelines detected by WG1. Sub-committees will therefore be formed to coordinate responses to technological developments, and to coordinate the design and implementation (by WG3) of software solutions.

Deliverables:

  • An Action Plan that will be updated continuously, but at least quarterly. This will list all ongoing and planned software actions.
  • A continuously updated white paper to inform NGS vendors, informaticians, stakeholders, etc., about NGS opportunities, bottlenecks and ongoing activities.
  • A continuous monitor of ongoing software implementation activities, aiming to minimise redundancy between them.

WG 3: Design, implementation and incorporation of software solutions

WG3 will develop an up-to-date scientific programme based on the outcome of WG1 and WG2, and will coordinate the actual implementation efforts by the COST Action partners. This will include liaison with WG4 when basic informatics solutions are required.

At the kick-off meeting decisions will have to be made about issues proposed during the consultation phase as topics needing concerted software efforts. The selected topics will be worked on under the coordination of WG3. Those requiring the largest effort are listed in the remainder of this section – they are formulated here as mini research proposals to help clarify what actions need to be taken.

For read assembly and mapping, the most immediate task required to exploit today’s parallel computing environments to their full, is to develop parallelised envelopes based on MPI (Message Passing Interface). Achieving such a task would not only facilitate intelligent use of the computational power, but would also help to customise and distribute runs relative to available hardware and time constraints – often crucial in lab activities. Another central issue for the analysis of NGS data in the form of paired reads, is the development of 1-gapped alignment algorithms. In fact, fast mapping algorithms equipped and optimised with such a feature can strongly ease their use for structural variation discovery and analysis, a goal that will become more and more important as the amount of available NGS data grows. Finally, we mention two further important bioinformatics issues in this rapidly changing technological landscape:

1- the study of de novo assemblers based on, now available, flexible indexing designs and

2- the development of (automated) pipelines optimising the Sanger and NGS data proportions, in assembly project using mixed input data.

RNA-Seq faces several critical computational issues. Existing computational methods will be extended to allow for new junction detection and for the use of paired-end reads. Many different methods have been developed for the detection of splice variants of certain genes, mainly developed for RNA-chip technology. A benchmark method is ARH, based on entropy distribution of the different exons of a gene. The significance of the statistical approach is judged by its deviation from a suitable background distribution.

Statistical methods will be proposed to assess the significance of transcribed regions, and for the construction of genes and the quantification of the isoforms. True and unobserved RNA-seq data will be modelled as a Poisson process with suitable assumptions on the intensity function. Penalised functional methods will be used to detect expressed regions. The detection of differential expression will be done with mixed effect models in conjunction with a generalized linear model to account for different sources of noise.

New, integrated downstream data analysis tools will be developed for ChIP-seq experiments. These will involve assigning transcripts and genes to ChIPs, determining the preferred transcription factor binding sites (TFBSs) per experiments using de novo motif finding algorithms, to locate TFBS within ChIP-seq peak regions, for combinatorial analysis of binding sites (cis-regulatory module detection), to determine conservation/divergence of TFBD among genomes from related species. Motif detection algorithmic developments are also needed to study chromosome structure and epigenetic state; or to find the correspondence between TFBS and quantitative trait loci (QTL).

Methylation-seq (meth-seq) refers to the genome-wide analysis of epigenetic modifications. An increasingly popular approach is MEDIP (Methyl-DNA immunoprecipitation) followed by sequencing (MEDIP-seq). The main bottleneck is the bioinformatics analysis of such data. In particular, normalization of raw signals must take into account unspecific enrichment of CpG sites in the genome. Existing algotithms normalize genome-wide methylation data by a parametrization with a linear model approach, assuming the read counts in a given window being a linear function of the CpG density. Improvements of meth-seq data sets will foresee two main improvements of the method, in particular extension of the linear model to other parametrizations and influence factors (for example non CpG methylations).

Knowledge of non-coding RNAs (ncRNAs) is integral for understanding complex cellular mechanisms. The discovery of novel ncRNAs still is difficult. NGS technologies provide the ideal solution, allowing parallel quantification of a large number of small RNAs with a high degree of reliability and sensitivity, compared to microarray hybridization and Q-PCR. Data analysis, however, can be very complex, as many small RNAs consist of degradation products of tRNA, rRNA, or mRNA. The trade-off between false positives, if the miRNA definition is not strict enough, and false negatives, if analysis parameters are too stringent, has to be balanced. Recent detection methods focus on direct predictions, without relying on comparative genomics. An ideal application of deep-sequencing is expression profiling of known and novel miRNAs because of their similar length distribution and their relatively high abundance within the reads. Comparing different NGS data sets with different experimental backgrounds is a statistical challenge. Reliable and effective algorithms for small RNA expression profiling data analysis are urgently needed and will be developed.

The identification of gene regulatory networks is important to understand mechanisms in patho-physiological conditions and the function of disease genes. Biological pathways can be formalized as a mathematical model that can be used to generate hypotheses that can be tested experimentally. This approach became feasible once high-throughput techniques started to generate information at a genome-wide system level. Today’s algorithms mainly use measurements of mRNA levels from microarray experiments and need to be adapted for use with NGS. Reverse engineering tools exist already that rely on thousands of microarrays taken from public repositories. The inferred networks can be explored and queried with genes of interest through web-based applications. The integration of NGS data in these reverse-engineering algorithms will improve the quality and reliability of the inferred gene networks thanks to the clean results coming from RNA-seq.

Comparative genomics can be used to assess origins of protein families and presence/absence in, for example, the last common eukaryotic ancestor. Such knowledge can aid many types of studies like function prediction, etc. One of the main problems associated with the analysis has been the depth of sampling within lineages. NGS technology will allow many more species to be sampled. For the results to be most useful to the comparative genomics community, streamlined search tools should be developed across genomes/proteomes. Stored homology searches would also be useful.

The statistical analysis of NGS data should rely on the expertise that has been acquired on previous technologies such as micro-arrays. A world-wide effort has been invested by the statistical community in the past decade, to develop tools for biologist to perform basic tasks like group comparison, subgroup discovery or biological marker identification. This experience should now be converted and used to tackle the analyses of NGS data.

UTOPIA is a generic, semantically-integrated tool for protein sequence and structure analysis. In Utopia, the user experience is pivotal – the idea is that the software should be so easy and intuitive to use that it effectively becomes invisible to users. This kind of approach will be crucial for NGS data analysis and will drive the design of new NGS data-visualisation tools and GUIs. Visual comparison of thousands of genomes or transcripts in a meaningful way is hard. Today’s genome browsers are only able to visualise a few genomes, so how do we realistically capture, present and assimilate the information sequestered in many tens, hundreds or thousands of them? To achieve this, new ways of thinking about the problem are required. Accordingly, This Action will explore the use of techniques developed in the fields of computer graphics, large model rendering and visualisation. What is clear is that users will increasingly be biologists for whom ergonomic GUIs will be a fundamental part of the data-analysis process. Such interfaces will need to provide access to the best algorithms within intuitive, easy-to-use front-ends, and allow users both to inject their own biological knowledge into the underlying expert systems and to collaborate and share that knowledge with their peers, interactively.

Chipster has the statistical and graphical potential for NGS data analysis. It was originally designed to perform DNA microarray data analysis with R/Bioconductor and other tools through an intuitive GUI. Currently, tools for various aspects of ChIP-seq analysis and a genome browser for NGS-data visualisation are being implemented. Functionalities for RNA-seq (detection of alternative transcripts, fusion genes, and somatic mutations), miRNA-seq (correlation with target expression), methylation-seq (locations), and exome-seq (variant detection) are planned and will position Chipster to be a powerful end-user application for a wide variety of NGS data-analysis problems.

HOPE is a next-generation Web application for automatic mutant analysis. It is an NGS downstream application to explain the molecular origin of disease-related phenotypes caused by mutations in human proteins. The NGS-revolution has led to a rapid increase in detected disease-related mutations, a considerable fraction of which reside in protein-coding regions and can thus affect the structure and function of the encoded protein. Knowledge of these structural and functional effects can aid inform further experiments and may eventually lead to better disease diagnostics and medicines. The data needed by HOPE can have very different origins, range from a protein’s 3D structure to its role in biological pathways, or from information generated by mutagenesis experiments to predicted functional motifs. Collecting specific, relevant information for every protein of interest is a challenging and time-consuming task, but is imperative if meaningful conclusions about the effects of mutations are to be drawn. The collection of data, using EMBRACE’s Web service technology and the BioSapiens DAS servers will benefit enormously from a large-scale collaboration with NGS bioinformaticians.

EMBOSS is the leading open source package for sequence analysis. EMBOSS is currently focusing on support of NGS data and public data resources. It actively collaborates with other open-source bioinformatics developers (BioPerl, BioPython, BioRuby, BioJava) through joint membership of the Open Bio Foundation (OBF). Recent examples include a common understanding of the FASTQ next-generation data formats. The current focus is on the SAM and BAM formats for next-generation sequence and alignment data. For non-specialist users, EMBOSS has been integrated in many of the most popular user interfaces, including UTOPIA and Chipster. New applications in EMBOSS are immediately available to users through these packages, while also being used by high throughput pipelines using the command line, or using EMBRACE’s Web service technology. Active development areas for NGS data will include manipulation of large sequence sets, analysis of reads aligned to reference sequences, and the management of reference sequence data. Support will be added for visualisation by a wide spectrum of tools and browsers through exchange of common data formats for numerical and positional sequence annotations.

ESYSBIO is an eScience project developing a virtual collaboration workspace integrating several types of genome-scale data. NGS data play an important role in several of the key test cases used to drive the development of the system. The user interface will present biologists, experimentalists, biostatisticians and bioinformaticians different views to interact with data and analysis results, as well as to define and execute workflows in the same research project. The system is implemented in a pure Service Oriented Architecture, enabling both serving of back-end services to many front-ends, as well as clean integration with back-end high-performance computing resources.

WG3 Deliverables:

  • Frequent, but irregularly timed, reports on successfully completed software projects, and when possible and opportune, coordination of subsequent publications.
  • Twice-yearly software design workshops organised jointly with WG5 and others.

WG4: Generic informatics topics
WG3 can be summarised as ‘software design’. Much of the software will run into the same problems, like CPU time limitations, storage, interoperability issues (including ontology issues), and other generic informatics challenges. WG4 will coordinate activities in this area, including:

  • Determining whether certain problems are better suited for a cluster, for the GRID, for Cloud computing, for a super-computer, or perhaps for a combination of these.
  • Analysis of data-compression options that are especially crucial for out-sourced computing.
  • Definition of ontologies needed for database and software interoperability.
  • Support for WG1 and WG2 with the technology watch of emergent computing/informatics fields, such as Cloud computing and the European PRACE initiative.
  • Support for WG3 with knowledge about generic algorithms that might be common in the informatics community but little known among bioinformaticians.
  • Organisation of 1-week workshops (following the successful EMBRACE model), where programmers can gain one-on-one support from informaticians who are specialist in relevant fields, such as parallelisation, Web services, database design, etc.
  • Discussion with WG3 immediately after the kick-off meeting and implementation of methodologies for the documentation, storage and rapid distribution of software elements.

WG5: Development of a strategic dissemination and education program for NGS bioinformatics
WG5 will coordinate all the Action’s dissemination activities. In the first six months, it will establish a portal, probably modelled after the successful BioSapiens, EMBRACE and ENFIN portals used by FP6 NoEs in bioinformatics. This portal will hold separate fields for the following:

  • General information for students, teachers, researchers, policy makers, journalists, etc.
  • A calendar for workshops, meetings, courses, conferences, etc. (including those not organised by this COST Action).
  • Lists of names and addresses of NGS labs and researchers in Europe (and beyond) including scientific expertise.
  • All educational material used in workshops and courses.
  • Lists of protocols, recipes, best practice suggestions, etc.
  • Software categorised by type (Web service, Web server, pipeline, stand-alone package, work bench, etc.), by category of biological question and algorithm, and by facility (is parallel version available, etc.).
  • A bulletin board, mail distribution lists, etc.
  • Lists of articles, hand-outs, etc.
  • A mechanism that allows institutes or researchers to seek help with NGS questions. A mechanism will be implemented that routes questions to the appropriate expert, and automatically logs question and answer in a FAQ.
  • Hardware and software vendors.

WG6 Management
At the kick-off meeting, coordinators for each WG will be appointed. Together with the Management Committee, they will form the coordination board of WG6. WG6 will monitor the activities of the other WGs, and will provide support when needed. Their further activities are described Section E.

E. ORGANISATION

E.1 Coordination and organisation
The Action will be coordinated by WG6, which will contain all menbers of the Management Committee (MC), according to the published rules and procedures and with the support of the Scientific Secretariat in Brussels. To better organize and promote interactions between the multidisciplinary teams of scientists participating in the Action, five working groups will be established besides the MC.

The Management Committee will have as main responsibilities:

  • Appointment of chair, vice-chair(s), and WG co-ordinators; This will be carried out during the kick-off meeting
  • Appointment of a scientist responsible for generating and frequently updating the (WG5) portal (web-site coordinator).
  • Decision-making on the distribution of funds to the various activities of the Action
  • Planning and coordination of several types of meetings and teaching activities
  • Evaluation of meetings and other activities necessary to meet the set objectives
  • Evaluation and report of the progress of the different WGs and the Action as a whole.
  • Preparation of Annual Reports
  • Promotion of close collaboration between the different WG members and between the WGs
  • Establishment of extensive collaborations between members of the WGs and members of other related Actions and Scientific programs in Europe and world-wide.
  • Increase the visibility of the Action and promote interactions with non members
  • Promote interactions with the private sector and dealing with issues related to exploitation of results

A Steering Group (SG) consisting of the Chair, vice-chair(s), web-site coordinator and the WG

coordinators will be established. Members of this group will be in frequent (at least once every two months) communication via e-mail and/or video conferences to discuss the progress and ensure good coordination of the activities of the different WGs.

The Chair will be contacting members of the MC during inter-meeting periods to inform them about SG discussions as needed, and to recruit the necessary elements for achieving the milestones. Besides these checkpoints, the MC/WG meetings will also play a very crucial role in evaluating the progress of the Action

The steering group will organize frequent meetings to adjust the scientific programme according new needs of technical developments.

MC/WG meetings
The Management Committee will convene twice every year to ensure efficient coordination, evaluate the progress and make specific plans for future activities. These meetings will, with the exception of the first kick-off meeting, coincide with the Working Group meetings. They will take place at different locations reflecting the geographical distribution of the current and future members of the Action. Efforts will be made so that the MC/WG meetings coincide with larger meetings in the field (for example, ISMB ECCB, EMBnet Confeneces etc) so as to increase the visibility of the Action, attract more participants, and save travel costs. Their duration is expected to be two full days. To ensure efficiency in meeting the WG-specific needs as well as in promoting the communication and reaching maximum possible exchange of information among different WGs, the meetings will include day-long WG-specific sessions and plenary sessions involving the participation of representatives of all WGs. These will involve presentations of individual research results, as well as presentations and discussions on later developments in the field. Emphasis will be put on the inclusion of young scientists and women in these activities (oral presentations, chairmanship of sessions etc). Based on these presentations, the MC will be evaluating the progress of the Action towards reaching its objectives and will be deciding on future plans accordingly.

Short Term Scientific Missions and other teaching activities
The STSMs are a major tool for the dissemination of know-how to young investigators and the promotion of collaborations between different research teams. Candidates to participate in these activities will be selected following an application process and assessment by the MC members. The organization of summer schools and workshops for young investigators will be conducted. For these training activities special efforts will be made to utilize available e-learning infrastructures, such as EMBER (European Multimedia Bioinformatics Educational Resource), developed under European Framework programs. It should be emphasized that participants in the Action have the infrastructure and know-how on a very wide range of state-of the art methodologies and technologies. These teaching activities provide the unique opportunity to disseminate this knowledge to young researchers in Europe, promoting scientific Excellence in the fields of NGS data management and analysis.

E.2 Working Groups
The COST activities will be divided into 5 working groups. The coodination of the research activities are planned to take place in WG3 and WG4, while WG1 and WG2 have the lead in steering the coordination activity. WG6 performs COST Action overall management, and WG5 all dissemination activities.

WG 1 Technology watch for new developments

WG 2 Development of an Action Plan for NGS bioinformatics to cope with challenges for European Research Area

WG 3 Design, implementation, and incorporation of software solutions

WG 4 Generic informatics topics

WG 5 Development of a strategic dissemination and education program for NGS bioinformatics

WG 6 Management

E.3 Liaison and interaction with other research programmes
This COST Action will be executed in close consultation with ESFRI projects such as ELIXIR, BBMRI, ECRIN, and INSTRUCT to harmonize activities and guarantee the highest benefits for Europe. Many partners are also active partners in these ESFRI projects.

The Action will coordinate with other related initiatives, either existing (like the Iberoamerican Network on Free Software for the Life Sciences, FreeBIT) or upcoming, related COST or FP7 actions through an extensive network of contacts.

The Action involves experimentalist groups, and many partners are involved in national projects that either revolve around, or have a large input from NGS. A complete list cannot be provided, but a few examples are:

  • The Swedish partners are involved in the Uppsala-Stockholm Scilifelab collaboration on life science technology, that includes NGS, micro-arrays, proteomics, system biology, comparative genomics, re-sequencing etc.
  • The Italian partner is involved in the Drug Discovery and Optimization by Simulation project aiming to optimize the use of mixed NGS and non-NGS in assembly projects.
  • The Norwegian partner coordinates the National Bioinformatics platform that is collaborating closely with the Norwegian High-throughput Sequencing Center (www.sequencing.uio.no) on data management and analysis of NGS data from the Norwegian scientific community.
  • The Dutch partner is involved in a large systems biology project in which he is responsible for the integration of all data including NGS data.
  • The German partner is involved in major international consortia applying and developing new software and protocols for next generation sequencing. Firstly, the partner takes part in the 1000 Genomes Project where genome sequencing is performed on a large population of individuals in order to identify rare variants. Secondly, the partner is involved in the International Cancer Genome Consortium, where 500 cancer patients are characterized with MEDIP-seq, RNA-seq and genomic sequencing in order to identify and characterize disease-related information.
  • One of the Italian partners participates to the EU project CRESCENDO (Consortium for Research Into Nuclear Receptors in Development and Aging; http://www.crescendoip.org), that makes intensive use of NGS technologies for analysis of nuclear receptor activity in ‘in vivo’ and ‘in vitro’ models.

Here we could use five more partners with a one or two sentence national embedding

E.4 Gender balance and involvement of early-stage researchers
This COST Action will respect an appropriate gender balance in all its activities and the Management Committee will place this as a standard item on all its MC agendas. This Action will also be committed to considerably involve early-stage researchers. This item will also be placed as a standard item on all MC agendas.

It is worth emphasizing that early stage female scientists have already played a major role in the design and delineation of the objectives of the Action. It is therefore expected that young and female scientists will also play leading roles in the management of the Action. As described above, gender balance will be observed in the Workshops, Teaching Activities, and Short Scientific Missions in which also early stage researchers are expected to form the overwhelming majority of participants.

F. TIMETABLE
The proposed duration for this Action is 4 years. The Kick-off meeting will mark the start-point of the Action where the chair, co-chair(s), WG co-ordinators and web-site coordinator will be selected.

The Action-specific website will be generated the first trimester and will continuously be updated according to the results generated. Major reports will be presented at the end of every year. Details about the frequency and the timing of MC meetings, WG meetings, Workshops, Short-Term Scientific Missions and Teaching activities are indicated in the table below. Please note that WG/MC meetings are planned to take place once a year, albeit that this depends on the available budget and the specific needs of the Action. Their frequency may increase to twice a year.

Activities Year 1 Year 2 Year 3 Year 4
Coordination,General Video meetings x x x x x x x x x x x x x x x x
Kick-off meeting x
Web site x
Web site updates x x x x x x x x x x x x x x x
ManagementCommittee

meetings

x x x x
Scientific Programme meetings x x x x x x x x
WG1 meeting x x x x
WG2 meeting x x x x
WG3 meeting x x x x
WG4 meeting x x x x
WG5 meeting x x x x
Workshop x x x x x x x
Conferences x x


G. ECONOMIC DIMENSION

The following 15 COST countries have actively participated in the preparation of the Action or otherwise indicated their interest: SE, UK, FR, IT, NO, HU, FI, PO, DE, ES, SK, NL, GR, BE, CH

It is expected that more participants will join this Action.

On the basis of national estimates provided by representatives of these countries, the economic dimension of the activities to be carried out under the Action has been estimated at roughly EUR 30 Millionsfor the duration of the Action.

This estimate is valid under the assumption that all the countries mentioned above, but no other countries, will participate in the action. Any departure from this will change the estimate

accordingly.

H. DISSEMINATION PLAN

H.1 Who?
The activities of the Action will be disseminated as widely as possibly to diverse groups of people including basic scientists from academic- research institutions and industrial settings, clinicians, and general public. Target groups specifically include:

  • scientists working in NGS and scientists starting to work with NGS
  • institutes having a facility with NGS technology (sequencer, data analysis centre) and institutes that start or plan to start an NGS facility
  • scientists and institutions which use NGS results
  • Small and Medium Enterprises focusing on the development of novel NGS technology, or on the use or commercialisation of NGS results
  • European, National, and Regional policy makers and stakeholders
  • The general public, including teachers, journalists, or high school students


H.2 What?
– An Action-specific website will be constructed in order to provide information to the international scientific community, to industries (NGS technologies providers, pharmaceutical, biotechnology settings, NGS users) and general public. The Management Committee will assign this task to a partner (web-site coordinator). Part of the website will be accessible to the general public, whereas a section will be password-protected for the exchange of specific information and unpublished data between partners. The website will also contain information on the Action activities (meetings, workshops, etc), proceedings of meetings, links to publications of participants, job/STSMs announcements as well as material and presentations from the didactic activities.

– A brochure will be generated at the beginning of the Action describing its objectives and planned activities. This will be distributed to scientists, representatives from the industry and society in major international Life Sciences conferences to inform potential users of NGS about the Action (for example ISMB, ECCB, ESF).

Scientific publications in peer-reviewed scientific journals will be generated as a result of collaborative research during the Action either in the form of original, review, or technical articles. To increase the visibility of the Action, the publication of the proceedings of WG-meetings and final conference to highly cited peer-reviewed journals in the field will be pursued.

Combined Management Committee and Working Groups meetings and other scientific conferences; The MG/WG meetings are planned to take place on a regular basis, ideally every six months, in various geographic regions, in order to encourage participation of all interested members. To increase the visibility of the Action, they will be preferably organized as satellites to major scientific conferences in the field, such as the ISMB, the ECCB, EMBnet, ENFIN, ELIXIR meetings.

Short-Term Scientific Missions, targeting young scientists especially originating from developing regions, in order to foster exchange of ideas and technology transfer.
Teaching activities (Workshops, corses and summer schools), in order to disseminate the latest developments in the NGS field and also combine hands-on practical training with theoretical information. These will be offered mainly to young investigators.

Workshops will be organized according to the aim of delivering skills to users as soon as relevant developments are deployed. They will provide an essential contact between developers, providers and users. The level of such workshops will be “expert”, and delivering hands-on experience will be the focus.

Training courses will be developed around thematic areas. They will be tailored to participants that require maximum functional skills in a narrow area in minimum time. This Action will provide such opportunities by engaging in high quality skill deployment methods and tuned delivery of materials, in collaboration with existing initiatives in ELIXIR, EMBnet and ENFIN.

Summer schools will serve young users’ interest best. They are the perfect vehicle to deliver training to young people looking for first contact with the technologies. For those we propose tailored training at moderate costs. Summer Schools that are effective along these lines require an affordable platform of resources in place, and this includes the training facilities and also lodging and subsistence at adequate levels for this target population. A few participating countries can provide that easily.

– Spring of code some of the open source code developments in this Action are prone to group development in a management platform such as GIT (http://git-scm.com/) . A provision is made to make this happen when needed by organizing development teams with young programmers on board, recruited in the way of the “Google summer of code (TM)” model. This requires a Mentoring capacity that the Action can provide, and assigning Integrator roles for each of these developments. Open source code will be developed using the rules available in http://www.opensource.org/licenses. This will enable the participation of young developers in an organized way. A high level of leverage is expected, as usual, from organized, collaborative development. Much more can be achieved from concerted collaborative development than from single team software writing. This is particularly important in this action because a variety of competences is needed.

The method cannot be used for all the development that is needed in the action but there is clear evidence of areas where this approach can be appropriate, namely:

– Assemblers that can cope with mixed (NGS/non-NGS) data sets.

– Statistical tools for monitoring quality of results

– Visualization.

– Integration with data resources, namely:

– Haplotype and Variation databases

– Genome Wide Association Studies

– Predictive tools

– Web-services (EMBRACE, EMBnet)

A great deal of the development work involves testing. The software will need to be tested in different platforms, including high performance computing ones. Specific training for that will be made available to developers.

H.3 How?
The Management Committee (MC) will be responsible for implementing all of the above activities.

The representative members of each country will be responsible for disseminating the activities of the Action to research groups within their countries, industrial partners, medical societies and representatives of the society. Each MC member is therefore, expected to generate, regularly update and circulate to other MC members a list of target groups with contact information. For regional meetings and other activities, the MC will delegate responsibilities to WG-coordinators and members of the WGs depending on their specialty. In addition, the MC will be responsible for providing all necessary information regarding the above mentioned activities and their outcome as well as revise the dissemination plan according to the Domain Committee (DC) recommendations.

Part II

An History of the proposal

Below is an historical time line on how this COST Action proposal was conceived and prepared.

  1. The Proposer of this COST action, Dr. Erik Bongcam-Rudloff, was coordinator of Work Package 4 (WP4) “Test Cases” in the FP6 Network of Excellence (NoE) EMBRACE. The role of WP4 was to collect bioinformatics exemplars from the Life Science community, to identify current limitations, and to present those as Test Cases to the other EMBRACE WPs, in order to identify new bioinformatics solutions. Drs. Andreas Gisel, Eija Korpelainen and Peter Rice were also active members of WP4. Many of the Test Cases collected by WP4 involved problems relating to the use of Next Generation Sequencing (NGS) technologies.
  2. Awareness of the problems identified in WP4 motivated the Proposer to organise a meeting in Stockholm, in connection with the ISMB conference (June 27-July 2). Most of the partners involved in preparing the COST pre-proposal participated on that meeting. There, we discussed the needs of the research community and decided to hold a workshop in Rome that year. To organise the workshop, we decided to form a Task Force; Dr. Gert Vriend (EMBRACE WP5 coordinator) then joined that task force.
  3. The first task was to set up a website: www.nextgenerationsequencing.org or www.nextgensequencing.org
  4. The Task Force worked during the summer and autumn to plan the Rome workshop, inviting keynote speakers, preparing data for a “hack-a-thon”, and working out all practical details.
  5. The workshop took place on 18-20 November 2009, and addressed Bioinformaticians interested in issues relating to the management and analysis of NGS data. In addition to lectures from leading scientists in the field, a hands-on “hack-a-thon” was included to give participants first-hand experience of using the tools to solve an unpublished, real-world genome sequence-assembly problem (www.nextgensequencing.org).
    A comprehensive report on the workshop is given here: http://journal.embnet.org/index.php/embnetnews/article/view/60/207
  6. The acuteness of the issues raised in the workshop led the Task Force to write a COST pre-proposal.
  7. The pre-proposal was written by 6 partners, but several workshop participants committed to join preparations for a full proposal, if the first stage was successful.
  8. In May 2010, we were invited to prepare a full proposal.
  9. Encouraged by the success of the 1st workshop, a 2nd was organised on 16-17 June, in Bari, Italy (http://www.nextgenerationsequencing.org/).
  10. This was used as a platform to discuss preparations for the full proposal; new partners were solicited amongst participants and their colleagues.
  11. The full proposal was prepared using Google Documents to avoid problems handling multiple document versions; Web-based systems were used extensively both for conversations and group meetings. During the preparatory process, members of other EMBRACE WPs also joined (Drs. Vincent Breton, Teresa Attwood, Ralf Herwig), as did partners from other NoEs and EU initiatives (e.g., Dr. Jacques van Helden from BioSapiens, Dr. Alessandro Weisz from CRESCENDO, Dr Lucia Altucci from ATLAS). Most partners are, or have been, involved directly or indirectly in collaborative work relating to the subject of this COST Action.
  12. During the preparations, all newly recruited partners were allowed to participate, using the Web-based tools mentioned above. This permitted all experts listed in this COST Action both to join the discussions and to help with the writing process.