Generated on 2023-02-11 at 16:44.
Snaptron Reference Tables¶
Table 1. Query Types¶
Query Type |
Description |
Multiplicity |
Format |
Example |
Region |
chromosome based coordinates range (1-based); HUGO gene name |
1 |
chr(1-22,X,Y,M):1-size of chromosome; gene_name |
chr21:1-500; CD99 |
Filter |
range over summary statistic column values |
1 or more |
column_name(>:,<:,:)number (integer or float) |
coverage_avg>:10 |
Sample Metadata |
keyword and numeric range search over sample metadata |
1 or more |
fieldname(>:,<:,:)keyword |
description:cortex; SMRIN>:8 |
Sample IDs |
limits results to only junctions found in specified samples IDs |
1 |
sids=\d+[,\d+]* |
sids=2,40,50,100 |
Sample Group* |
limits to only junctions found in specified sample group |
1 |
sids=groupname |
sids=Brain (gtex) |
Snaptron IDs |
one or more snaptron_ids |
1 |
ids=\d+[,\d+]* |
ids=5,7,8 |
Sample IDs |
one or more sample_ids |
1 |
ids=\d+[,\d+]* |
ids=20,40,100 |
* The sids=groupname
parameter is based on predefined groups of sample IDs. These group definitions are found in the data directory of the compilation being queried, typically in a file samples.groups.tsv
, e.g. for GTEx: http://snaptron.cs.jhu.edu/data/gtex/samples.groups.tsv
The Region
query type is required to be present if the Filter
, Sample Metadata
, Sample Group
, and/or Sample IDs
types are used.
Table 2. List of Snaptron Parameters¶
Parameter |
WSI Endpoints |
Values |
# Occurrences |
Example |
Description |
---|---|---|---|---|---|
regions |
snaptron;genes |
chr[1-22XYM]:\d+-\d+;HUGO gene |
1 but can take multiple arguments separated by a comma representing an OR |
chr1:1-5000;DRD4 |
coordinate intervals and/or HUGO gene names |
sids |
snaptron;genes |
sids=(\d+[,\d+]*)|(groupname) |
1 |
sids=30,100,150 ; sids=Brain |
filter to only junctions from >=1 samples in this list; uses the samples’ rail_ids, can also take a predefined sample group name (e.g. GTEx tissue) |
ids* |
snaptron;genes;samples |
ids=\d+[,\d+]* |
1 |
ids=5,6,7 |
ID filter for snaptron_id (endpoint=snaptron) and rail_id (endpoint=samples); this only returns the specific records with those IDs |
rfilter |
snaptron;genes |
fieldname[><!:]value |
0 or more |
rfilter=samples_count>:5&rfilter=coverage_sum:3 |
point range filter (inclusion) |
sfilter |
snaptron;genes;samples |
fieldname:value OR freetext |
0 or more |
sfilter=description:Cortex&sfilter=library_strategy:RNA-Seq |
sample metadata filter (inclusion) |
contains |
snaptron;genes |
0,1 |
0-1 occurrences |
contains=1 |
return only those junctions whose start and end coordinates are within the boundaries of the region (using either coordinates directly or passed in gene name) |
exact |
snaptron;genes |
0,1 |
0-1 occurrences |
exact=1 |
return only those junctions whose start and end coordinates are match the boundaries of the region requested |
either |
snaptron;genes |
0,1,2 |
0-1 occurrences |
either=2 |
return only those junctions whose start (either=1) or end (either=2) coordinate match or are within the boundaries of the region requested |
header |
snaptron;genes |
0,1 |
0-1 occurrences |
header=0 |
include the header as the first line (or not) |
fields** |
snaptron;genes |
fields=fieldname[,fieldname]* |
0 or more unique fieldnames within one fields clause |
fields=snaptron_id,samples_count |
which fields to return |
* The ids
parameter cannot be used with other parameters.
**can include non-return field options such as: rc
(result count)
Tables 3 and 4 show the queryable fields for region and range query types respectively. Fields from tables 3 and 4 can be mixed together in the same query though only one region predicate is allowed per query as specified in Table 1 above.
Table 3. Region Query Fields (“regions” parameter)¶
Field |
Range of Values |
Example |
Description |
---|---|---|---|
coordinate* |
chr(1-22;X;Y;M):1-size of chromosome |
chr1:4-100 |
chromosome:start-end |
gene symbol* |
a-zA-Z0-9 |
CD99 |
HUGO (HGNC) gene symbols |
*you can either pass a coordinate string or a gene symbol in the interval query segment, but not both
Often the query filter columns (Table 4) can be used as a way to reduce the number of false positive junctions. This can be done easily with the two columns: samples_count and coverage_sum. Some suggested values from our own research are presented in Table 5.
Table 4. Query Filter Fields (“rfilter” parameter)¶
Field |
Range of Values |
Example |
Description |
---|---|---|---|
length |
1-500K |
intron_length<:5000 |
length of exon-exon junction (intron) |
annotated* |
0 or 1 |
annotated:1 |
whether both left and right splice sites in one or more annotations (default is both) |
left_annotated* |
0 or 1 |
left_annotated:1 |
whether the left splice site is in one or more annotations |
right_annotated* |
0 or 1 |
right_annotated:1 |
whether the right splice site is in one or more annotations |
strand |
|
strand:+ |
which strand to require (default is both) |
samples_count |
1-Inf |
samples_count>:5 |
number of samples in which this junction has one or more reads covering it |
coverage_sum |
1-Inf |
coverage_sum>:10 |
aggregate count of reads covering the junction across all samples the junction appears in |
coverage_avg |
1.0-Inf |
coverage_avg>:5.0 |
average of read coverage across all samples the junction appears in |
coverage_median |
1.0-Inf |
coverage_median>:6.0 |
median of read coverage across all samples the junction appears in |
* these fields are treated as booleans for the purpose of searching but as Strings when returned since if they are not 0, they will be a list of one or more annotation source abbreviations. Also, importantly, if each splice site of a junction (left/right) is annotated separately (not connected), annotated
will be 0 but BOTH the left and right annotated fields will not be 0.
The return format is a TAB-delimited series of fields where each line represents a unique intron call. Table 5 displays the complete list of fields in the return format of the Snaptron web service. The chromosome
, start
, and, end
fields are a special case where the index is a combination of all three of them together.
Table 5. Complete list of Snaptron Fields In Return Format¶
Field Index |
Indexed? |
Field Name |
Type |
Description |
---|---|---|---|---|
1 |
No |
DataSource:Type |
Abbrev:Single Character |
Differentiates between a return line of type Intron (I), Sample (S), or Gene (G). |
2 |
Yes |
snaptron_id |
Integer |
stable, unique ID for Snaptron junctions |
3 |
Yes |
chromosome |
String |
Reference ID for genomics coordinates |
4 |
Yes |
start |
Integer |
beginning (left) coordinate of intron |
5 |
Yes |
end |
Integer |
last (right) coordinate of intron |
6 |
Yes |
length |
Integer |
Length of intron coordinate span |
7 |
Yes |
strand |
Single Character |
Orientation of intron (Watson or Crick) |
8 |
Yes |
annotated |
String |
If both ends of the intron are annotated as a splice site in some annotation |
9 |
No |
left_motif |
String |
Splice site sequence bases at the left end of the intron |
10 |
No |
right_motif |
String |
Splice site sequence bases at the right end of the intron |
11 |
Yes |
left_annotated |
String |
If the left end splice site is annotated or not and which annotations it appears in (maybe more than once) |
12 |
Yes |
right_annotated |
String |
If the right end splice site is in an annotated or not, same as left_annotated |
13 |
No |
samples* |
Comma separated list of tuples: integer:integer |
The list of samples which had one or more reads covering the intron and their coverages. IDs are from the IntropolisDB. |
14 |
Yes |
samples_count |
Integer |
Total number of samples that have one or more reads covering this junction |
15 |
Yes |
coverage_sum |
Integer |
Sum of all samples coverage for this junction |
16 |
Yes |
coverage_avg |
Float |
Average coverage across all samples which had at least 1 read covering the intron in the first pass alignment |
17 |
Yes |
coverage_median |
Float |
Median coverage across all samples which had at least 1 read covering the intron in the first pass alignment |
18 |
No |
source_dataset_id |
Integer |
Snaptron ID for the compilation. GTEx=1, SRAv2=2, TCGA=4) |
* this field always starts with a ,
; this is due to how it is searched when samples are used to filter a junction query (R+M or R+F+M).
The format of this field is a comma-delimited list of samples and their raw read coverage in that sample.
It uses the rail_id of the sample: ,rail_id1:coverage1,rail_id2:coverage2,...
.
This rail_id matches the first column in the relevant compilation’s samples.tsv
file available
from the links previously listed in the Raw Data and Indices
section.