For the BioVis 2011 Symposium, we will inaugurate a new data visualization/analysis contest. This contest will be focused on specific biological problem domains, and will be based on realistic domain data and domain questions. For BioVis 2011, the contest involves the analysis of expression Quantitative Trait Locus (eQTL) data.
- Read the contest description and rules provided below.
- Create an account for the contest forum. The information provided by you will only be used for the BioVis Data Visualization contest.
- Download the demonstration data set from the corresponding section of the forum. This data is provided in several increasingly realistic versions, from fundamentally clean, to biologically complex.
- Develop your approach to a solution using the demonstration data.
- Download the contest data set from the corresponding section of the forum. This data is provided only in realistically complex versions.
- Apply your approach (and tune if necessary) to the contest data and generate results.
- Use the forum to ask the contest organizers for clarifications or to discuss the contest with other participants.
- Your contest entry must be received no later than Sep 7th (extended from Sep 1st) via email to firstname.lastname@example.org.
- General Information
- The eQTL biological domain
- The eQTL data
- The Challenge
- The Fine Print
|Contest Announcement and Demonstration Data release
|Jun 1st (approx.)
|Web seminar on contest domain and data
|Final Contest Data release
|Bonus Question release
|Final Submissions due (deadline extended from Sep 1st)
The information on these pages is subject to amendment and update at any time. We will post notifications of any changes to the Announcements discussion forum. Please watch for any changes that may affect your submission
|Contest General Address
|Contest Data Provider
& Domain Expert
We hope to attract diverse interesting biological problems, and feature a new biological theme each year for the contest. Through the mechanisms of this contest, we will support three main goals:
- The development of a better-informed Vis community, provided with deeper domain-specific intuition into the actual issues of interest to the user community.
- A better-tooled biological community, provided with enhanced applications specifically adjusted to meet their analysis needs.
- Finally, a mechanism to strongly promote fundable peer collaborations between Visualization and Bio/Life-Sciences researchers.
To meet these goals, each contest will be conducted using a pre-symposium "basic" phase, and, if participant interest is sufficient, a post-symposium "advanced" phase. Prior to the symposium, data and an explanation of the problem will be made available over the Web. A web-forum will also be available, where contest participants can discuss the problem and receive clarifications. At the symposium there will be a session during which contest participants present their results. After the contest-results presentation there will be a biological tutorial, covering the specific problem being addressed by the contest. This will be presented by Bio/Life Sciences experts in the domain, including the group directly responsible for developing the contest data set. The tutorial will explain in detail the complexities of the data, and of the biological intuitions that the end-users are interested in acquiring. After the symposium there will be a second round of the contest in which a second data set will be approached, using the additional domain-goal intuition acquired from the at-symposium tutorial. The results of this second round will be presented at a special session of the following year's BioVis symposium. Outstanding performers in the second round will be invited to co-publish their results with the Bio/Life-science researchers who have helped develop the data set and provided the expert instruction, and will be invited to co-propose fundable projects to the National Science Foundation, National Institutes of Health, or other national or international funding opportunities identified by the data set/problem-domain providers.
The eQTL Biological Domain
The biological domain for the 2011 contest is expression Quantitative Trait Locus data. eQTL experiments catalog massive collections of correlated genotype and phenotype data, in the hope of detecting important genome-sequence variations that affect phenotypic (specifically RNA expression level) outcomes through non-obvious mechanisms, and of identifying these mechanisms. Such non-obvious mechanisms typically involve networks of interacting polymorphisms that non-linearly affect specific gene expression levels, conditional on the presence of other polymorphisms, and combinations of polymorphisms, and conditional on the tissue type in which they are acting.
eQTL analysis is a relatively new biological field, because broad-coverage genotype and expression surveys are only just becoming technologically feasible. It is an attractive field for Visualization and Visual Analytics in the Biological Sciences both because it is a "hot topic" funding priority for the United States National Institutes of Health, with several high-profile Requests for Applications being current for developing analysis capabilities in this area; And because it is a phenomenally data-rich and information-dense domain (a typical complete analysis is envisioned to survey a few million genotypic loci and tens of thousands of expression levels, differentially in up to 100 tissue types, across 1000 subjects -- Our contest data is a rather restricted subset of this data), and absolutely demands sophisticated mechanisms for summarizing and presenting the data to the end users. eQTL analysis is also an ideal model system for the emerging field of personalized medicine, because the decision support problem for Personalized Medicine "have I considered all of the important factors in making this decision?" is exactly analogous to the eQTL interaction-network-discovery problem of "have I identified all of the relevant interacting factors?".
Our contest data is generated from actual eQTL analysis data, using an observation-shuffling technique. This technique preserves the biological complexity of the data, while allowing us to "spike in" specific interaction networks for the purpose of establishing ground truth that should be identifiable. Because our eventual goals are to encourage and enable the Visualization community to produce tools that are highly relevant to the Bio/Life-Sciences community, it is important that we maintain realistic complexity, and realistic confounding factors within the data. By maintaining realism, we assure that tools that address the contest data, retain relevance for real data, and we enhance our participant's appreciation of the real complexity and sophistication of the visualization approaches that must be applied. On the flip side, the data contest will encourage the Bio/Life-Sciences community to engage more with the Visualization community given the relevance and the potential utility to the tools and methodology that will result.
The eQTL Data
For the "basic phase" contest, we are providing two data sets. The first is a demonstration data set, generated in multiple versions, containing successive realistic complications of the base measurement data. The base demonstration data is fundamentally clean. It contains only the deterministic spiked in data, and any residual correlations inherent in the original biological samples. This data set is available as a control on which Visualization approaches may be prototyped, and in which all known information should be able to be observed. Subsequent versions have been generated with increasing additional realistic biological complexity, including typical missing data, noisy data, mismeasurements and misassignments.
The second data set is the actual to-be-evaluated formal contest data. The format of this data is identical to the training data, but the eQTL content is different, and it is only provided in the "dirty" forms.
Basic statistical interaction analyses are provided for both the demonstration and the formal contest data, for participants who do not wish to develop their own approaches. These analyses provide one statistical method's measure of interactions, for two-member interactions, and for three-member interactions. Our spiked-in data contains significantly more complex networks than these, which may only be accessible through visual synergies in visual approaches, or more sophisticated bioinformatic analyses.
The eQTL data catalogs genomic polymorphisms (SNPs) for 7500 genomic loci, gene expression levels for 15 genes, and an associated affected/not-affected disease state for our hypothetical spiked-in disease, hoompalitis. The exact format of the data files is explained in text files included in the downloadable data archives, which are available in the forum.
Our contest proposes a hypothetical disease, hoompalitis, around which we have synthesized our eQTL data. It is known that the expression levels of several different genes affect the presence, or severity of hoompalitis. Significantly abnormal levels of the gene products increases the severity of the condition, however some individuals with low and high levels appear unaffected. Some combinations of low levels of expression of one gene product, and high levels of the others, appear to produce minimal symptoms, though not all combinations have been studied.
Using the data provided, identify the pattern of genome-sequence variations, and expression-levels, that predict the occurrence of hoompalitis. To as great an extent as possible, elucidate and explain these factors, and the pattern of interaction amongst the factors, influencing the incidence of hoompalitis.
The factors that have been built into the data include both direct SNP effects on the genes that contain them (cis effects), SNP effects on distant genes (trans effects), networks of cis and trans-acting SNPs, and gene-gene interactions.
The complete pattern of interaction as spiked-in to the data, is more complex than we anticipate that any entrants will elucidate (and in fact is more complex than what is analyzable by any bioinformatic approach of which we are aware). This complexity is, in fact, biologically realistic, and solutions that are effective with real biological data will need to take not only what can be known and shown into account, but also ideally convey the approach's limitations and the limitations of the underlying analyses. There is a plethora of low-hanging fruit in the visualization domain for eQTL data, and we encourage contest participants to be creative in both their approaches, and in the segment of the problem they choose to pursue.
A bonus question will be made available to registered entrants shortly before the contest deadline. This question will detail a specific individual within the data set, and a proposed "gene therapy" change to their genotype. The bonus question will be to predict the outcome for this individual if the proposed therapy were carried out, based on what you have learned about hoompalitis.
Participants will submit three items to document their contest entries.
- A simple gridded checklist in CSV form, itemizing in general terms, which genome-sequence variants, and which genes, you predict to be relevant to the disease phenotype described in the Challenge section. A template for this file will be provided here before the submission deadline.
- A text narrative of no more than three pages, explaining your findings, more precisely defining the network of interactions you have identified, and if possible, explaining the apparent biological "meaning" of the interactions. This document should also contain your answer to the Bonus Question.
- A PowerPoint (or other graphical) presentation of your work, including a demonstration of your approach, and a description of how you arrived at the results reported in the checklist, and the narrative. Video presentations are also acceptable.
Please submit your contest entries via email to: email@example.com no later than 11:59 pm on September 7th, 2011 (deadline extended from September 1st).
Contest results will be assayed both versus the ground truth, and versus the best understanding that Bio/Life-sciences researchers can attain using their current state-of-the-art tools. Participants will be encouraged to apply their techniques, and base their results presentation on the "worst" data set with which they are comfortable, and Biological-expert results will be available for comparison at each level.
Judging for the contest entries will be conducted by a panel of experts, including members from the Bio/Life-Sciences community and members from the Visualization community. The provider of the contest data sets will participate as the final arbiter of the results' recapitulation of known ground truth, and of direct applicability to researchers in the Bio/Life Sciences domain.
Results will be scored independently on:
- Recapture of ground truth;
- Application to the problem domain;
- and Visualization quality.
Decisions of the judging panel will be final.
JudgesOverall judging is chaired by Christoper Bartlett, The Ohio State University, USA.
Confirmatory/Biology PanelChaired by William Ray, The Ohio State Unversity, USA (non-voting).
- Shana R. Spindler, Editor in Chief for The NICHD Connection at the Eunice Kennedy Shriver National Institute of Child Health and Human Development, USA
- Mark W. Logue, Assistant Research Professor, Boston University School of Medicine - Biomedical Genetics, USA
- Wolfgang Rumpf, Product Specialist & Director of Support: Rescentris Inc., USA
Exploratory/Visualization PanelChaired by Raghu Machiraju, The Ohio State Unversity, USA (non-voting).
- Amitabh Varshney, University of Maryland - College Park, USA
- Tamara Munzner, University of British Columbia, Canada
- Ananth Grama, Purdue Unversity, USA
The following is subject to change and final decisions regarding awards are still pending.
An award will be provided, at a minimum, for the best overall submission. Additional awards and award cash values will be announced shortly.
In addition to standard awards, a publication documenting the outstanding submissions, in collaboration with the lab that created the data sets for the contest (The Bartlett lab of the Battelle Center for Mathematical Medicine), will be sponsored by the contest. The Bartlett lab is also interested in co-proposing fundable projects to the National Science Foundation, National Institutes of Health, or other national or international funding opportunities.
The Fine Print
We are still printing the fine print.
Contest entrants agree to be bound by the following rules
- No entrant or team member participating on an entry may be a current, or previous (within 3 years) collaborator of the contest organizers, judges, or data providers. Currently excluded participants are current and prior collaborators of the Bartlett or Ray labs from the Battelle Center for Mathematical Medicine. Additional exclusions will be made public as judges agree to serve for the contest. If you are participating in the contest, please inform your collaborators and mentors, so that we can avoid recruiting judges that would put you in conflict! If in doubt, ask the contest organizers.
- You may use any commercial or proprietary tools for either the bioinformatic analysis of what you want to visualize, or to create your visualizations.
- If you are providing a tool as an entry, it must run on a platform available to the judges. We encourage statically linked Intel/Linux (Centos preferably) executables, Standalone Windows (7) executables, Standalone Mac OS X (10.6) executables, or web-based applications. We will make every effort to make your submission run on the judge's systems, but in the end we cannot guarantee our ability to adapt the judge's systems to any particular application.
- At least one author or entrant from each awarded submission must attend the symposium to present their work.
- We encourage creative "partial solutions". In fact, we anticipate that a complete solution is effectively intractable, because the biological reality is that everything interacts in biological systems in some fashion. As a result, genius is likely to be found in identifying what not to present, and how to inform the users of those exclusions, more readily than in trying to accomplish "everything" in one visual approach.