Generated on 2023-02-11 at 16:44.

Snaptron Sample Metadata

Each of the above compilations has its own set of sample metadata with varying field names and definitions. Snaptron indexes these metadata fields in a document store (Lucene) for full text retrieval. Numeric columns (e.g. RIN in the GTEx compilation) are indexed to support range based lookups.

Query metadata and sample metadata text is converted to lower case before indexing/querying to make searches case-insensitive.

Both sample-only searches and junction searches limited by a sample predicate can be performed:

curl "http://snaptron.cs.jhu.edu/gtex/samples?sfilter=SMRIN>8"

will return a list of samples which have a RIN value > 8.

curl "http://snaptron.cs.jhu.edu/srav2/snaptron?regions=chr6:1-10714015&rfilter=samples_count>:10&sfilter=description:cortex"

will return a list of junctions and their list of summary stats calcuated from the intersection of the region and rfilter predicates and which contain at least one sample in the list of samples which have “cortex” in their description field.

Further, you can also query by sample ID to simply return all the metadata for the submitted set of sample IDs:

curl "http://snaptron.cs.jhu.edu/srav2/samples?ids=0,2,500"

Sample Metadata Fields

A complete list of all sample metadata fields and types stored and indexed by Snaptron are available for each compilation:

  • TCGA

http://snaptron.cs.jhu.edu/data/tcga/samples.fields.tsv

  • GTEx

http://snaptron.cs.jhu.edu/data/gtex/samples.fields.tsv

  • SRAv2

http://snaptron.cs.jhu.edu/data/srav2/samples.fields.tsv

  • SRAv1

http://snaptron.cs.jhu.edu/data/srav1/samples.fields.tsv

Sample Metadata Field Types

Lucene types are reported for each field in the above TSV files:

  • text

Input field tokenized into one or more terms by whitespace before indexing to support “contains” searching

Example: a free-text description of the RNA-seq sequencing protocol

  • string

Input field indexed as one term (not tokenized)

Example: controlled vocabulary field such as an NCBI sample accession

  • integer

Numeric input field indexed to support range searches

Example: age at diagnosis for a cancer patient

  • float

Numeric input field indexed to support range searches, used if any floating point values were present in input

Example: RNA-seq integrity value (RIN)

NOTE: Lucene stores the input field as a float, but range queries need to be specified as integers for now, even for float fields

If a metadata field for a particular sample is empty/NA or is a string and the field type is numeric, that particular entry is set to NULL in Lucene.