GSA Data Standards

1. Metadata

1.1 BioProject

BioProject is a searchable collection of complete and incomplete (in-progress) large-scale molecular projects including genome sequencing and assembly, transcriptome, metagenomic, annotation, expression and mapping projects.

1.1.1 ID system

BioProject accession No. is prefixed with 'PRJC' and followed by 1 Capital letters and 6 digits. For example, PRJCA000001.

1.1.2 Attributes

See the Standard of BioProject metadata for detailed information.

1.2 BioSample

BioSample contains descriptions of biological source materials used in studies that have data in other National Genomics Data Center databases such as Genome Sequence Archive, Genome Warehouse, Gene Expression Nebulas, Genome Variation Map, Methylation Bank, etc. The National Genomics Data Center, working collaboratively with multiple partner institutions/laboratories, develops a family of standards for big omics data representation, analysis, search and exchange.

1.2.1 ID system

BioSample accession No. is prefixed with 'SAMC', and followed by6 digits. For example, SAMC000001.

1.2.2 Sample types
1.2.3 Attributes

See the Standard of BioSample metadata for detailed information.

1.3 GSA

The Genome Sequence Archive (GSA) is a data repository specialized for archiving raw sequence reads.

1.3.1 ID system

A GSA object consists of a series of Experiments and Runs.

GSA Accession No. is prefixed with 'CRA' and followed by 6 digits. For example, CRA000001.

Experiment Accession No. is prefixed with 'CRX' and followed by 6 digits. For example, CRX000001.

Run Accession No. is prefixed with 'CRR' and followed by 6 digits. For example, CRR000001.

1.3.2 Experiment
1.3.2.1 Attributes
Attributes*mandatory attribute
Name Description Tips Value Format
*ID Experiment IDs, prefixed with 'E' and followed by a natural number, such as E1, E2, E3.... The Experiment ID must be unique.
*Experiment title Short description that will identify the Experiment on public pages. It can have any format, but we suggest that you make it concise, unique, consistent, and as informative as possible. Every Experiment from same Sample must be unique. {text}
*BioProject accession BioProject accession. Typical of the form PRJCA [number], NOT SUBPRJCA [number], like PRJCA000005.
*BioSample name Sample Name is a name that you choose for the sample. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. Every Sample Name from a single Submitter must be unique. {text}
*Platform This column has drop-down menus that allow you to select from a controlled vocabulary Once specified for one row, these values can be copied-and-pasted down. See the Platform form for details
*Library Construction / Experimental Design Free-form description of the methods used to create the sequencing library; a brief 'materials and methods' section. e.g., DNA of sorted NCSCs was extracted from the cell line using a QIAamp DNA Mini Kit, sheared to approximately 300-500 bp using a Covaris S220 instrument. Then the libraries were constructed through end-repair, A-tailing, adapter ligation and bisulfate-converted using a ZymoEZ DNA Methylation Kit. {text}
Library name Name of Library.
*Strategy This column has drop-down menus that allow you to select from a controlled vocabulary. Once specified for one row, these values can be copied-and-pasted down. See the Strategy form for details.
*Source This column has drop-down menus that allow you to select from a controlled vocabulary. Once specified for one row, these values can be copied-and-pasted down. See the Source form for details.
*Selection This column has drop-down menus that allow you to select from a controlled vocabulary. Once specified for one row, these values can be copied-and-pasted down. See the Selection form for details.
*Layout This column has drop-down menus that allow you to select from a controlled vocabulary. Once specified for one row, these values can be copied-and-pasted down.
*Read length for mate 1(bp) Planned Read Length of Mate1 for your submission. When Platform is PacBio sequel and Ion Torrent series sequencers, leave this column empty is available.
Read length for mate 2 (bp) Planned Read Length of Mate 2 for your submission. Require for paired-end data only.
Insert size (bp) Fragment size for Paired reads. Please provide a numerical value for the median interval of the insert size.
Nominal size (bp) Nominal size
Nominal standard deviation (bp) Standard deviation of insert size
Planned number of cycles Planned number of cycles for your submission. When the Platform is Helicos HeliScope, the Planned number of cycles is required.
1.3.2.2 Platform

Platform:the sequencing platforms and instrument models.

Platform Instrument Model
LS454 454 GS
454 GS 20
454 GS FLX
454 GS FLX Titanium
454 GS FLX+
454 GS Junior
Capillary Technologies AB 310 Genetic Analyzer
AB 3130 Genetic Analyzer
AB 3130xL Genetic Analyzer
AB 3500 Genetic Analyzer
AB 3500xL Genetic Analyzer
AB 3730 Genetic Analyzer
AB 3730xL Genetic Analyzer
ABI Solid AB 5500 Genetic Analyzer
AB 5500xl Genetic Analyzer
AB 5500x-Wl Genetic Analyzer
AB 5500x-Wl Genetic Analyzer
AB SOLiD 4 System
AB SOLiD 4hq System
AB SOLiD PI System
AB SOLiD System 1.0
AB SOLiD System 2.0
AB SOLiD System 3.0
BGISeq BGISEQ-100
BGISEQ-500
BGISEQ-1000
BGISEQ-2000
DNBSEQ-T7
MGISEQ-2000RS
CapitalBio Company BioelectronSeq 4000
Bionano Genomics BioNano IRYS
BioNano SAPHYR
Complete Genomics Complete Genomics
Daan Gene DA8600
Helicos BioSciences Corporation Helicos HeliScope
HYK Genetic HYK-PSTAR-IIA
Illumina Illumina HiSeq X Ten
Illumina Genome Analyzer
Illumina Genome Analyzer II
Illumina Genome Analyzer IIx
Illumina HiScanSQ
Illumina HiSeq 1000
Illumina HiSeq 1500
Illumina HiSeq 2000
Illumina HiSeq 2500
Illumina HiSeq 3000
Illumina HiSeq 4000
Illumina MiSeq
Illumina MiniSeq
Illumina NovaSeq 5000
Illumina NovaSeq 6000
Illumina Nextseq 500
Illumina Nextseq 550
Illumina iSeq 100
Berry Genomics NextSeq CN500
IonTorrent Ion Torrent PGM
Ion Torrent Proton
Ion Torrent S5
Ion Torrent S5 XL
Oxford Nanapore OXFORD_NANOPORE GridION
OXFORD_NANOPORE MinION
OXFORD_NANOPORE PromethION
PacBio SMRT PacBio RS
PacBio RS II
PacBio Sequel
PacBio Sequel II
1.3.2.3 Strategy

Strategy:sequencing technique intended for the library.

Strategy Sequencing strategy used in the experiment
WGA Whole genome amplification.
WGS Whole genome shotgun.
WES Whole exome sequencing is a genomic technique for sequencing, all of the protein-coding genes in a genome (known as the exome).
WXS Random sequencing of exonic regions selected from the genome.
RNA-Seq Random sequencing of whole transcriptome.
miRNA-Seq Micro RNA and other small non-coding RNA sequencing.
Tn-Seq Gene fitness determination through transposon seeding.
WCS Whole chromosome (or other replicon) shotgun.
CLONE Genomic clone based (hierarchical) sequencing.
POOLCLONE Shotgun of pooled clones (usually BACs and Fosmids).
AMPLICON Sequencing of overlapping or distinct PCR or RT-PCR products.
CLONEEND Clone end (5', 3', or both) sequencing.
FINISHING Sequencing intended to finish (close) gaps in existing coverage.
ChIP-Seq Direct sequencing of chromatin immunoprecipitates.
MNase-Seq Direct sequencing following MNase digestion.
DNase-Hypersensitivity Sequencing of hypersensitive sites, or segments of open chromatin that are more readily cleaved by DNaseI.
Bisulfite-Seq Sequencing following treatment of DNA with bisulfite to convert cytosine residues to uracil depending on methylation status.
EST Single pass sequencing of cDNA templates.
FL-cDNA Full-length sequencing of cDNA templates.
CTS Concatenated Tag Sequencing.
MRE-Seq Methylation-Sensitive Restriction Enzyme Sequencing strategy.
MeDIP-Seq Methylated DNA Immunoprecipitation Sequencing strategy.
MBD-Seq Direct sequencing of methylated fractions sequencing strategy.
Synthetic-Long-Read Binning and barcoding of large DNA fragments to facilitate assembly of the fragment.
ATAC-seq Assay for Transposase-Accessible Chromatin (ATAC) strategy is used to study genome-wide chromatin accessibility. alternative method to DNase-seq that uses an engineered Tn5 transposase to cleave DNA and to integrate primer DNA sequences into the cleaved genomic DNA.
ChIA-PET Direct sequencing of proximity-ligated chromatin immuneprecipitates.
FAIRE-seq Formaldehyde Assisted Isolation of Regulatory Elements.
Hi-C Chromosome Conformation Capture technique where a biotinlabeled nucleotide is incorporated at the ligation junction, enabling selective purification of chimeric DNA ligation junctions followed by deep sequencing.
ncRNA-Seq Capture of other non-coding RNA types, including post-translation modification types such as snRNA (small nuclear RNA) or snoRNA (small nucleolar RNA), or expression regulation types such as siRNA (small interfering RNA) or piRNA/piwi/RNA (piwi-interacting RNA).
RAD-Seq Restriction Site Associated DNA Sequence.
RIP-Seq Direct sequencing of RNA immunoprecipitates (includes CLIP-Seq, HITS-CLIP and PAR-CLIP).
SELEX Systematic Evolution of Ligands by EXponential enrichment.
ssRNA-seq strand-specific RNA sequencing.
Targeted-Capture Targeted-Capture sequencing.
Tethered Chromatin Conformation Capture Tethered Chromatin Conformation Capture sequencing.
TCR-seq High throughput sequencing to map T-cell receptor (TCR) repertoires at high resolution.
BCR-seq High throughput sequencing to map B-cell receptor (BCR) repertoires at high resolution.
MeRIP-Seq MeRIP-Seq maps m6A-methylated RNA. Deep sequencing provides high-resolution reads of m6A-methylated RNA.
OTHER Library strategy not listed (please include additional info in the “design description”).
1.3.2.4 Source

Source:The library source specifies the type of source material that is being sequenced.

Source Type of genetic source material sequenced
GENOMIC Genomic DNA (includes PCR products from genomic DNA).
TRANSCRIPTOMIC Transcription products or non-genomic DNA (EST, cDNA, RT-PCR, screened libraries.
METATRANSCRIPTOMIC Transcription products from community targets.
METAGENOMIC Mixed material from metagenome.
SYNTHETIC Synthetic DNA.
VIRAL RNA Viral RNA.
OTHER Other, unspecified, or unknown library source material. (please include additional info in the “design description”)
1.3.2.5 Selection

Selection:whether any method was used to select and/or enrich the material being sequenced.

Selection Method of selection or enrichment used in the Experiment
unspecified Library enrichment, screening, or selection is not specified. (please include additional info in the “design description”)
RANDOM Random selection by shearing or other method.
PCR Source material was selected by designed primers.
RANDOM PCR Source material was selected by randomly generated primers.
RT-PCR Source material was selected by reverse transcription PCR.
HMPR Hypo-methylated partial restriction digest.
MF Methyl Filtrated.
CF-S Cot-filtered single/low-copy genomic DNA.
CF-M Cot-filtered moderately repetitive genomic DNA.
CF-H Cot-filtered highly repetitive genomic DNA.
CF-T Cot-filtered theoretical single-copy genomic DNA.
MDA Multiple displacement amplification.
MSLL Methylation Spanning Linking Library.
cDNA complementary DNA.
ChIP Chromatin immunoprecipitation.
MNase Micrococcal Nuclease (MNase) digestion.
DNAse Deoxyribonuclease (MNase) digestion.
Hybrid Selection Selection by hybridization in array or solution.
Reduced Representation Reproducible genomic subsets, often generated by restriction fragment size selection, containing a manageable number of loci to facilitate re-sampling.
Restriction Digest DNA fractionation using restriction enzymes.
5-methylcytidine antibody Selection of methylated DNA fragments using an antibody raised against 5-methylcytosine or 5-methylcytidine (m5C).
MBD2 protein methyl-CpG binding domain Enrichment by methyl-CpG binding domain.
CAGE Cap-analysis gene expression.
RACE Rapid Amplification of cDNA Ends.
size fractionation Physical selection of size appropriate targets.
Padlock probes capture method Circularized oligonucleotide probes.
Poly-A polyA enriched RNA-seq.
other Other library enrichment, screening, or selection process. (please include additional info in the “design description”)
1.3.3 Run
Attributes*mandatory attribute
Name Description Tips Value Format
*ID Run IDs, prefixed with 'R' and followed by a natural number such as: R1, R2, R3... The Run ID must be unique.
*Run title Short description that will identify the Run on public pages. It can have any format, but we suggest that you make it concise, unique and consistent and as informative as possible. Every Run from same Experiment must be unique. {text}
*BioProject accession BioProject accession. Typical of the form PRJCA [number], NOT SubPRJCA [number], like PRJCA000005. PRJCA[number]
*Experiment ID Experiment IDs, prefixed with 'E' and followed by a natural number, such as E1, E2, E3…
*Run data file type This column has drop-down menus that allow you to select from a controlled vocabulary.
*File name 1 All data file names must be unique without spaces, brackets, periods, or forward (/) or backward slashes (\). 1.Fastq format can be compressed using gzip or bzip2 (and DOES NOT accepts zip or rar).
2.BAM format do not compress.
3.PacBio sequel and Ion Torrent series sequencers can upload tar compression format.
4.Doublecheck that your file names is accurate before sending to us.
*MD5 checksum 1 MD5 checksums are a 32-character alphanumeric string. 1. For Mac and Linux system users, the native command line tools "md5sum"(Linux) and "md5"(Mac OX) can be used to generate MD5 checksums.
2.Windows users must need to download a third-party utility, like winmd5free.
32-character alphanumeric string
File name 2 All data file names must be unique without spaces, brackets, periods, or forward (/) or backward slashes (\). Those fields require for paired-end data only.
MD5 checksum 2 MD5 checksums are a 32-character alphanumeric string. 32-character alphanumeric string
Reference file name Reference name. When your Run data file type select BAM format.
1. If you want to submit your reference file to our FTP Site, you need to fill in the reference_name and reference_md5. We only accept Fasta file under GZIP and BZIP2 compression formats;
2. If your reference file is already in other database, please fill in the Assembly Name or Accession and Assembly Accession URL.
3.PacBio sequel and Ion Torrent series sequencers leave this column empty is available.
MD5 for reference file MD5 for reference file. 32-character alphanumeric string
Assembly Name or Accession Assembly Name or Accession.
Assembly Accession URL Assembly Accession URL. URL
2. Sequencing file
2.1 File type

This page reviews the submission file formats currently supported by the GSA, and gives guidance to submitters about current file formats and policies regarding GSA submissions.

File types File suffix Applicable platforms Is recommended
Fastq .fastq.gz
.fq.gz
.fastq.bz2
.fq.bz2
All Platforms Yes
Bam .bam All Platforms Yes
Sff .sff LS454
ION_TORRENT
BGISEQ-100
DA8600
Complete Genomics Native .tar.gz
.tar
Complete Genomics
BGISEQ-500
BGISEQ-1000
Solid Native .tar.gz
.tar
ABI SOLID
PacBio_HDF5 .tar
.tar.gz
PacBio RS
PacBio RS II
PacBio RS /PacBio RS II recommend
PacBio Sequel Native .tar
.tar.gz
PacBio Sequel PacBio Sequel recommend
Ab1 .ab1 CAPILLARY
Oxford Nanopore Native .tar
.tar.gz
Oxford Nanapore
10x Genomics .tar
.tar.gz
Bnx .bnx.gz
.bnx.bz2
Bionano Genomics
Fasta .fasta.gz
.fasta.bz2
.fa.gz
.fa.bz2
Helicos Native .tar
.tar.gz
Helicos BioSciences Corporation
2.2 File formats

Read data can be submitted in several standards and platform specific formats. We recommend that read data submitted in BAM Fastq and BAM format.

Fastq format

Single and paired reads are accepted as Fastq files that meet the following requirements:

1) Quality scores must be in Phred scale. Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.

2) No technical reads (adapters, linkers, barcodes) are allowed.

3) Single reads must be submitted using a single Fastq file and can be submitted with or without read names.

4) Paired reads must be submitted using two Fastq files.

5) Paired read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2' (regular expression for the reads: "^@([a-zA-Z0-9_-]+:[0-9]+:[a-zA-Z0-9]+:[0-9]+:[0-9]+:[0-9-]+:[0-9-]+) ([12]):[YN]:[0-9]*[02468]:[ACGTN]+$").

6) The first line for each read must start with '@'.

7) The base calls and quality scores must be separated by a line starting with '+'.

8) The Fastq files must be compressed using gzip or bzip2.

9) The regular expression for bases is “^([ACGTNactgn.]*?)$”

BAM format

Submitted BAM files must be readable with Samtools and Picard.

BAM file names are required to end up with the .bam suffix (e.g. ‘a.bam’).