OncoBlocks - Google Summer of Code 2015

Revisions

  • 1.0: Feb 19, 2015: First release.
  • 1.1: March 4, 2015: Added email contacts.
  • 1.2: March 11, 2015: Added in registration form for online basecamps.
  • 1.3: March 13, 2015: Added link to project template.

How to get involved

Due to high student interest (hooray!), we have set up a number of online basecamp forums where students can post questions and we have posted additional background information.

Register for Online Basecamp Forum

Project Template

When completing your application to GSoC, please use our project template.

OncoBlocks

OncoBlocks is a new open-source initiative, currently hosted at the Biostatistics and Computational Biology Department of the Dana-Farber Cancer Institute.

The goal of the project is to create reusable, open source software components to support cancer genomics research and enable precision (or "personalized") cancer medicine.

These components can then be used in multiple research and clinical application contexts, and may also form the basis of new features within the cBio Cancer Genomics Portal, another open-source project, originally created at Memorial Sloan-Kettering Cancer Center. For additional background regarding the cBio portal and cancer genomics in general, see our references at: Reference 1, Reference 2.

OncoBlocks is currently in start-up phase, and we are hosting prototype code on GitHub. The mentors have a long track record and commitment to open source software, including Cytoscape, a previous participant in GSoC.


Tumor Heterogeneity Analysis Tool

Goal

Create a web-based prototype for visualizing genomic heterogeneity, including multiple genomic snapshots over time and space.

Background

Join us in building a new open-source platform for precision (or "personalized") cancer medicine.

A key challenge in cancer genomics is the genomic heterogeneity seen within tumors. For example, two biopsies from different sites of the same tumor may have very different mutations. Likewise, a tumor biopsy taken after treatment may show the emergence of new resistance mutations that were not detected in the original sample.

For general background, see:

Project Synopsis

Create a web-based prototype for visualizing and analyzing multiple genomic snapshots over time and space.

The prototype should aid researchers and clinicians in answering the following questions:

Schematic of Tumor Heterogeneity Analysis Tool. Click for enlarged image.

  • What type of treatment did the patient receive and when?
  • When were biopsies taken and which tissues were used?
  • What genomic changes have occurred over time? For example, have any resistance mutations emerged?
  • How do two or more biopsies take at the same time point compare to one another?

This is an exploratory project, which requires creative and interactive data visualization. Mentor will provide additional genomics background, and sample data sets. Project requires a few rounds of prototyping with sample data, and students are free to explore multiple ideas.

A schematic of the proposed prototype is shown at right (click for enlarged view). We currently envision that the application will be written in Javascript and the D3 visualization library.

Students may also wish to draw upon inspiration from these projects:

  • Chronoline.js: An elegant Javascript library for drawing compact timelines.
  • UpSet: An inspired visualization application for exploring sets.

Dependency

None.

Mentors

Ethan Cerami (Dana-Farber Cancer Institute).

Skills needed

Strong Javascript skills. Bioinformatics skills and/or experience with the D3 data visualization library would be significant pluses.


Exceptional Responder Analysis Tool

Goal

Create a web-based tool for mining the cancer genomes of "exceptional responders", e.g. patients that uniquely respond to a specific therapeutic intervention.

Background

The cancer research community has recently started to mine genomic data to better understand "exceptional responders", e.g. patients that uniquely respond to a specific therapeutic intervention. In these cases, a patient's tumor may contain a unique mutation, which renders the patient sensitive to therapy. Knowing which mutations confer sensitivity then allows clinicians to identify other patients for the same therapeutic intervention.

For general background, see:

Project Synopsis

Create a web-based tool for comparing one or more cancer genomes against a background set of other cancer genomes.

Schematic of Exceptional Responder Analysis Tool. Click for enlarged image.

The prototype should aid researchers and clinicians in answering the following questions:

  • Which patients are responding extremely well to therapy?
  • Do these "exceptional responders" have unique mutations that are not present in the rest of the patient cohort?

This is an exploratory project, which involves both data visualization and statistical analysis. Mentor will provide additional genomics background, and sample data sets. Project requires a few rounds of prototyping with sample data, and students are free to explore multiple options for data visualization across diverse genomic data types.

A schematic of the proposed prototype is shown at right. We currently envision that the application will be written in Javascript and the D3 visualization library.

Dependency

None.

Mentors

Ethan Cerami (Dana-Farber Cancer Institute).

Skills needed

Strong Javascript skills. Bioinformatics skills and/or experience with the D3 data visualization library would be significant pluses.


Scalable Genomic Data Warehouse

Goal

Create a prototype of a scalable data warehouse for storing cancer genomic data.

Background

With recent initiatives, including The Cancer Genome Atlas Project, Genomics England, and the recently announced NIH Precision Medicine Initiative, the scientific community will soon have access to hundreds of thousands, perhaps millions of genomes. To leverage these data sets, the community requires new data warehouses for storing and accessing genomic alterations identified within these studies.

Project Synopsis

The goal of this project is to prototype and benchmark a new scalable warehouse for storing and querying the large set of genomic data that is likely to be generated over the next few years. Specific tasks include:

  • creating a robust set of simulated genomic data to be used for benchmarking.
  • benchmarking of storing and querying large-scale genomic data sets in multiple non-relational databases, including MongoDB and SciDB.
  • building a prototype data warehouse.
  • building a prototype web service interface for querying the data warehouse.

Mentor will provide additional genomics background, including sample data sets, and access to cloud-based servers for building multi-node, sharded database servers.

Dependency

None.

Mentors

Ethan Cerami (Dana-Farber Cancer Institute). Will Oemler (Blueprint Medicines).

Skills needed

Strong Java skills. Desire to learn NoSQL database technologies, including MongoDB and SciDB. Bioinformatics skills would be a plus, but is not required.


Extraction of Clinical Trial Biomarkers

Goal

Build a prototype tool for extracting genomic eligibility criteria from clinical trial records generated by the National Institutes of Health (NIH).

Background

A project in natural language processing for cancer genomics.

One of the key promises of precision or personalized medicine is to match a patient's cancer genome to specific targeted therapies or enrollment in new clinical trials.

The National Institutes of Health (NIH) provide summary data for all clinical trials in the U.S. Most clinical trials enumerate specific patient eligibility requirements. For example, this clinical trial aims to test a targeted cancer therapy against patients with genomic alterations in a gene called ALK. Given knowledge of genomic alterations in ALK, one could therefore recommend or flag a patient for clinical trial enrollment.

Project Synopsis

The goal of this project is to build a prototype tool for extracting and reviewing genomic markers from clinical trial records. Complete clinical trial data is available from the National Institutes of Health. Data sets are also available for download in XML format. However, and unfortunately, genomic biomarkers are not specifically enumerated as distinct elements within the XML, and it remains an open question as to how best extract this information.

This is an exploratory project, which involves both natural language processing and building a curation system whereby individuals will be able curate and review biomarkers. Mentor will provide additional genomics background, and sample data sets. Project will require a few rounds of prototyping with sample data, and students are free to explore multiple options for natural language processing and data curation.

Dependency

None.

Mentors

Ethan Cerami (Dana-Farber Cancer Institute).

Skills needed

Strong Java or Python skills. Experience in natural language processing a plus. Experience or interest in cancer genomics also a plus.