Data

The raw sequence data, as well as the results of several different types of analysis, are available in a protected platform of Seven Bridges Genomics, so we can manage data sets and users, even enable automated computation in the near future. To access and download the raw zipped data, please follow the following instruction.

  1. If you are already a SBG platform user, please go to step 3
  2. To register as a new user Visit SBG website, register as a new user.
  3. Send tengfei.yin@sbgenomics.com your username or email address with title "BioVis 2015 data contest user". He will invite/add you to the project "BioVis 2015 Data Contest" .
  4. After you are added into the project, you can sign in the site and should be able to see the project "BioVis 2015 Data Contest" in the dashboard . Click the project name, you will be able to download and browse all files and be able to see all the participants' username and send them messages.
  5. Users could also copy the files into their own project, and even do more computation on the platforms if needed (with $100 credit on AWS). You can even share/publish the method as a computation pipeline later. For more information, please contact tengfei.yin@sbgenomics.com.

The data consists of a short intronic (non-coding) region from the BRD2 gene, which has been strongly linked, through co-inheritance with disease, to epilepsy. Specifically variations in the presented intronic region, are most tightly linked. In each seqence there are 4 closely-spaced "repeat regions" where different numbers of distinct repeating sequence motifs occur. Despite the linkage in inheritance, there is no strong association between disease and any of the sequence variants in the intron. Approximately 250 affected cases, and 250 unaffected controls have had the region sequenced, and only 19 distinct combinations of these 4 repeats have been found. The 19 combinations that do occur, are found in both healthy and affected indivduals, but the several hundred other combinations that "should" occur, haven't been found in either population*.

Along with the raw sequence for each observed variable intron, we provide 3 sequences that should exist in the population, but that have not yet been observed. Our hope is that differences in the properties of these sequences that "don't exist", compared to the properties of the sequences that do, may provide the clues necessary to understand what this region is doing biologically, and why it is linked with disease.

Finally we'll provide several "unknown" sequences - that might or might not exist in nature, for you to try to classify as being more similar to the observed sequences, or the non-observed sequences. Since the functional difference does not appear to be "sequence based", we also provide several analyses of each of the sequences, including a selection of the most likely secondary structures into which the RNA might fold, energy calculations for the foldings, and a pairing-weight-matrix that encapsulates the many different foldings that might occur, into a single machine-readable format for analysis. We will also provide 3D structural predictions for the "best" secondary structures, and instructions on how and where to calculate your own additional 3D structures for anyone who wishes to explore the similaries or differences of the 3D conformational space. More details regarding the analyses and how to read the analysis files are provided with the data. We'll also provide you with the sequence of the full-length gene, and the position of the intron in the gene, in case you'd like to explore how the different intronic sequences under study, might interact with other exonic or distant intronic regions of the otherwise highly-conserved gene. The biologists' current best working hypothesis is that it is something about the structures - possibly something as simple as a conserved structural feature that does not form for the non-observed sequences, or, since real molecules are not rigid and are constantly moving and undergoing slight rearrangements, possibly as subtle as a change in the distribution of structures that the sequences adopt (even though the structures themselve might be identical), that eliminates the non-observed sequences from nature.

This contest is about developing tools that aid the biologist in identifying meaningful specific hypotheses about such features for testing. *This contest is primarily focused on hypotheses about the differences between the non-observed sequences and their properties, compared to the observed sequences, rather than on the disease, versus non-disease individuals, because there is no significant difference in the distribution of the 19 observed sequences between affected and unaffected individuals. The current assumption is that it is some extrinsic factor that is co-inherited that can cause any of the observed sequences to be "bad" in an individual, and the hope is that by identifying why the non-observed sequences are automatically bad, this will shed light on what this extrinsic factor may be, and may be doing. If you have any questions about the contest, the contest data, or about your prospective entry, please don't hesitate to contact contest@biovis.net.