Standards - GSA - CNCB-NGDC

GSA Data Standards

1. Metadata

1.1 BioProject

BioProject is a searchable collection of complete and incomplete (in-progress) large-scale molecular projects including genome sequencing and assembly, transcriptome, metagenomic, annotation, expression and mapping projects.

1.1.1 ID system

BioProject accession No. is prefixed with 'PRJC' and followed by 1 Capital letters and 6 digits. For example, PRJCA000001.

1.1.2 Attributes

See the Standard of BioProject metadata for detailed information.

1.2 BioSample

BioSample contains descriptions of biological source materials used in studies that have data in other National Genomics Data Center databases such as Genome Sequence Archive, Genome Warehouse, Gene Expression Nebulas, Genome Variation Map, Methylation Bank, etc. The National Genomics Data Center, working collaboratively with multiple partner institutions/laboratories, develops a family of standards for big omics data representation, analysis, search and exchange.

1.2.1 ID system

BioSample accession No. is prefixed with 'SAMC', and followed by6 digits. For example, SAMC000001.

1.2.2 Sample types

Pathogen

Used for pathogen samples that are relevant to public health. Required attributes include those considered useful for the rapid analysis and trace back of pathogens.

Clinical or host-associated

Environmental, food or other

Microbe

Used for bacteria or other microbes when it is not appropriate or advantageous to use for Pathogen or Virus packages.

Animal

Used for multicellular samples or cell lines derived from common laboratory model organisms, e.g., mouse, rat, Drosophila, worm, fish, frog, or large mammals including zoo and farm animals.

Human

Human genetic related raw sequence reads should be submitted to the GSA for Human database.

Plant

Used for any plant sample or cell line.

Virus

Used for all virus samples not directly associated with disease. Viral pathogens should be submitted using the Pathogen: Clinical or host-associated pathogen package.

Metagenome/Environmental Sample (GSC MIMS unsupported)

Use for metagenome/environmental samples when it is not appropriate or advantageous to use the GSC (Genome Standards Consortium) MIMS (Minimum Information about a MetaGenome Sequence) standards.

Metagenome/Environmental Sample (GSC MIMS compliant)

Describe and standardize sample metadata, defined by the GSC (Genome Standards Consortium) MIMS (Minimum Information about a MetaGenome Sequence) standards for metagenome/environmental samples.

human-gut

soil

water

1.2.3 Attributes

See the Standard of BioSample metadata for detailed information.

1.3 GSA

The Genome Sequence Archive (GSA) is a data repository specialized for archiving raw sequence reads.

1.3.1 ID system

A GSA object consists of a series of Experiments and Runs.

GSA Accession No. is prefixed with 'CRA' and followed by 6 digits. For example, CRA000001.

Experiment Accession No. is prefixed with 'CRX' and followed by 6 digits. For example, CRX000001.

Run Accession No. is prefixed with 'CRR' and followed by 6 digits. For example, CRR000001.

1.3.2 Experiment

1.3.2.1 Attributes

Attributes*mandatory attribute

Name	Description	Tips	Value Format
*ID	Experiment IDs, prefixed with 'E' and followed by a natural number, such as E1, E2, E3.... The Experiment ID must be unique.
*Experiment title	Short description that will identify the Experiment on public pages. It can have any format, but we suggest that you make it concise, unique, consistent, and as informative as possible.	Every Experiment from same Sample must be unique.	{text}
*BioProject accession	BioProject accession.	Typical of the form PRJCA [number], NOT SUBPRJCA [number], like PRJCA000005.
*BioSample name	Sample Name is a name that you choose for the sample. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible.	Every Sample Name from a single Submitter must be unique.	{text}
*Platform	This column has drop-down menus that allow you to select from a controlled vocabulary Once specified for one row, these values can be copied-and-pasted down.	See the Platform form for details
*Library Construction / Experimental Design	Free-form description of the methods used to create the sequencing library; a brief 'materials and methods' section.	e.g., DNA of sorted NCSCs was extracted from the cell line using a QIAamp DNA Mini Kit, sheared to approximately 300-500 bp using a Covaris S220 instrument. Then the libraries were constructed through end-repair, A-tailing, adapter ligation and bisulfate-converted using a ZymoEZ DNA Methylation Kit.	{text}
Library name	Name of Library.
*Strategy	This column has drop-down menus that allow you to select from a controlled vocabulary.	Once specified for one row, these values can be copied-and-pasted down. See the Strategy form for details.
*Source	This column has drop-down menus that allow you to select from a controlled vocabulary.	Once specified for one row, these values can be copied-and-pasted down. See the Source form for details.
*Selection	This column has drop-down menus that allow you to select from a controlled vocabulary.	Once specified for one row, these values can be copied-and-pasted down. See the Selection form for details.
*Layout	This column has drop-down menus that allow you to select from a controlled vocabulary.	Once specified for one row, these values can be copied-and-pasted down.
*Read length for mate 1(bp)	Planned Read Length of Mate1 for your submission.	When Platform is PacBio sequel and Ion Torrent series sequencers, leave this column empty is available.
Read length for mate 2 (bp)	Planned Read Length of Mate 2 for your submission.	Require for paired-end data only.
Insert size (bp)	Fragment size for Paired reads. Please provide a numerical value for the median interval of the insert size.	Require for paired-end data only.
Nominal size (bp)	Nominal size
Nominal standard deviation (bp)	Standard deviation of insert size
Planned number of cycles	Planned number of cycles for your submission.	When the Platform is Helicos HeliScope, the Planned number of cycles is required.

1.3.2.2 Platform

Platform:the sequencing platforms and instrument models.

Platform	Instrument Model
LS454	454 GS
	454 GS 20
	454 GS FLX
	454 GS FLX Titanium
	454 GS FLX+
	454 GS Junior
Capillary Technologies	AB 310 Genetic Analyzer
	AB 3130 Genetic Analyzer
	AB 3130xL Genetic Analyzer
	AB 3500 Genetic Analyzer
	AB 3500xL Genetic Analyzer
	AB 3730 Genetic Analyzer
	AB 3730xL Genetic Analyzer
ABI Solid	AB 5500 Genetic Analyzer
	AB 5500xl Genetic Analyzer
	AB 5500x-Wl Genetic Analyzer
	AB 5500x-Wl Genetic Analyzer
	AB SOLiD 4 System
	AB SOLiD 4hq System
	AB SOLiD PI System
	AB SOLiD System 1.0
	AB SOLiD System 2.0
	AB SOLiD System 3.0
BGISeq	BGISEQ-100
	BGISEQ-500
	BGISEQ-1000
	BGISEQ-2000
	DNBSEQ-T7
	MGISEQ-2000RS
CapitalBio Company	BioelectronSeq 4000
Bionano Genomics	BioNano IRYS
Bionano Genomics	BioNano SAPHYR
Complete Genomics	Complete Genomics
Daan Gene	DA8600
Helicos BioSciences Corporation	Helicos HeliScope
HYK Genetic	HYK-PSTAR-IIA
Illumina	Illumina HiSeq X Ten
	Illumina Genome Analyzer
	Illumina Genome Analyzer II
	Illumina Genome Analyzer IIx
	Illumina HiScanSQ
	Illumina HiSeq 1000
	Illumina HiSeq 1500
	Illumina HiSeq 2000
	Illumina HiSeq 2500
	Illumina HiSeq 3000
	Illumina HiSeq 4000
	Illumina MiSeq
	Illumina MiniSeq
	Illumina NovaSeq 5000
	Illumina NovaSeq 6000
	Illumina Nextseq 500
	Illumina Nextseq 550
	Illumina iSeq 100
Berry Genomics	NextSeq CN500
IonTorrent	Ion Torrent PGM
	Ion Torrent Proton
	Ion Torrent S5
	Ion Torrent S5 XL
Oxford Nanapore	OXFORD_NANOPORE GridION
	OXFORD_NANOPORE MinION
	OXFORD_NANOPORE PromethION
PacBio SMRT	PacBio RS
	PacBio RS II
	PacBio Sequel
	PacBio Sequel II

1.3.2.3 Strategy

Strategy:sequencing technique intended for the library.

Strategy	Sequencing strategy used in the experiment
WGA	Whole genome amplification.
WGS	Whole genome shotgun.
WES	Whole exome sequencing is a genomic technique for sequencing, all of the protein-coding genes in a genome (known as the exome).
WXS	Random sequencing of exonic regions selected from the genome.
RNA-Seq	Random sequencing of whole transcriptome.
miRNA-Seq	Micro RNA and other small non-coding RNA sequencing.
Tn-Seq	Gene fitness determination through transposon seeding.
WCS	Whole chromosome (or other replicon) shotgun.
CLONE	Genomic clone based (hierarchical) sequencing.
POOLCLONE	Shotgun of pooled clones (usually BACs and Fosmids).
AMPLICON	Sequencing of overlapping or distinct PCR or RT-PCR products.
CLONEEND	Clone end (5', 3', or both) sequencing.
FINISHING	Sequencing intended to finish (close) gaps in existing coverage.
ChIP-Seq	Direct sequencing of chromatin immunoprecipitates.
MNase-Seq	Direct sequencing following MNase digestion.
DNase-Hypersensitivity	Sequencing of hypersensitive sites, or segments of open chromatin that are more readily cleaved by DNaseI.
Bisulfite-Seq	Sequencing following treatment of DNA with bisulfite to convert cytosine residues to uracil depending on methylation status.
EST	Single pass sequencing of cDNA templates.
FL-cDNA	Full-length sequencing of cDNA templates.
CTS	Concatenated Tag Sequencing.
MRE-Seq	Methylation-Sensitive Restriction Enzyme Sequencing strategy.
MeDIP-Seq	Methylated DNA Immunoprecipitation Sequencing strategy.
MBD-Seq	Direct sequencing of methylated fractions sequencing strategy.
Synthetic-Long-Read	Binning and barcoding of large DNA fragments to facilitate assembly of the fragment.
ATAC-seq	Assay for Transposase-Accessible Chromatin (ATAC) strategy is used to study genome-wide chromatin accessibility. alternative method to DNase-seq that uses an engineered Tn5 transposase to cleave DNA and to integrate primer DNA sequences into the cleaved genomic DNA.
ChIA-PET	Direct sequencing of proximity-ligated chromatin immuneprecipitates.
FAIRE-seq	Formaldehyde Assisted Isolation of Regulatory Elements.
Hi-C	Chromosome Conformation Capture technique where a biotinlabeled nucleotide is incorporated at the ligation junction, enabling selective purification of chimeric DNA ligation junctions followed by deep sequencing.
ncRNA-Seq	Capture of other non-coding RNA types, including post-translation modification types such as snRNA (small nuclear RNA) or snoRNA (small nucleolar RNA), or expression regulation types such as siRNA (small interfering RNA) or piRNA/piwi/RNA (piwi-interacting RNA).
RAD-Seq	Restriction Site Associated DNA Sequence.
RIP-Seq	Direct sequencing of RNA immunoprecipitates (includes CLIP-Seq, HITS-CLIP and PAR-CLIP).
SELEX	Systematic Evolution of Ligands by EXponential enrichment.
ssRNA-seq	strand-specific RNA sequencing.
Targeted-Capture	Targeted-Capture sequencing.
Tethered Chromatin Conformation Capture	Tethered Chromatin Conformation Capture sequencing.
TCR-seq	High throughput sequencing to map T-cell receptor (TCR) repertoires at high resolution.
BCR-seq	High throughput sequencing to map B-cell receptor (BCR) repertoires at high resolution.
MeRIP-Seq	MeRIP-Seq maps m6A-methylated RNA. Deep sequencing provides high-resolution reads of m6A-methylated RNA.
OTHER	Library strategy not listed (please include additional info in the “design description”).

1.3.2.4 Source

Source:The library source specifies the type of source material that is being sequenced.

Source	Type of genetic source material sequenced
GENOMIC	Genomic DNA (includes PCR products from genomic DNA).
TRANSCRIPTOMIC	Transcription products or non-genomic DNA (EST, cDNA, RT-PCR, screened libraries.
METATRANSCRIPTOMIC	Transcription products from community targets.
METAGENOMIC	Mixed material from metagenome.
SYNTHETIC	Synthetic DNA.
VIRAL RNA	Viral RNA.
OTHER	Other, unspecified, or unknown library source material. (please include additional info in the “design description”)

1.3.2.5 Selection

Selection:whether any method was used to select and/or enrich the material being sequenced.

Selection	Method of selection or enrichment used in the Experiment
unspecified	Library enrichment, screening, or selection is not specified. (please include additional info in the “design description”)
RANDOM	Random selection by shearing or other method.
PCR	Source material was selected by designed primers.
RANDOM PCR	Source material was selected by randomly generated primers.
RT-PCR	Source material was selected by reverse transcription PCR.
HMPR	Hypo-methylated partial restriction digest.
MF	Methyl Filtrated.
CF-S	Cot-filtered single/low-copy genomic DNA.
CF-M	Cot-filtered moderately repetitive genomic DNA.
CF-H	Cot-filtered highly repetitive genomic DNA.
CF-T	Cot-filtered theoretical single-copy genomic DNA.
MDA	Multiple displacement amplification.
MSLL	Methylation Spanning Linking Library.
cDNA	complementary DNA.
ChIP	Chromatin immunoprecipitation.
MNase	Micrococcal Nuclease (MNase) digestion.
DNAse	Deoxyribonuclease (MNase) digestion.
Hybrid Selection	Selection by hybridization in array or solution.
Reduced Representation	Reproducible genomic subsets, often generated by restriction fragment size selection, containing a manageable number of loci to facilitate re-sampling.
Restriction Digest	DNA fractionation using restriction enzymes.
5-methylcytidine antibody	Selection of methylated DNA fragments using an antibody raised against 5-methylcytosine or 5-methylcytidine (m5C).
MBD2 protein methyl-CpG binding domain	Enrichment by methyl-CpG binding domain.
CAGE	Cap-analysis gene expression.
RACE	Rapid Amplification of cDNA Ends.
size fractionation	Physical selection of size appropriate targets.
Padlock probes capture method	Circularized oligonucleotide probes.
Poly-A	polyA enriched RNA-seq.
other	Other library enrichment, screening, or selection process. (please include additional info in the “design description”)

1.3.3 Run

Attributes*mandatory attribute

Name	Description	Tips	Value Format
*ID	Run IDs, prefixed with 'R' and followed by a natural number such as: R1, R2, R3... The Run ID must be unique.
*Run title	Short description that will identify the Run on public pages. It can have any format, but we suggest that you make it concise, unique and consistent and as informative as possible.	Every Run from same Experiment must be unique.	{text}
*BioProject accession	BioProject accession.	Typical of the form PRJCA [number], NOT SubPRJCA [number], like PRJCA000005.	PRJCA[number]
*Experiment ID	Experiment IDs, prefixed with 'E' and followed by a natural number, such as E1, E2, E3…
*Run data file type	This column has drop-down menus that allow you to select from a controlled vocabulary.
*File name 1	All data file names must be unique without spaces, brackets, periods, or forward (/) or backward slashes (\).	1.Fastq format can be compressed using gzip or bzip2 (and DOES NOT accepts zip or rar). 2.BAM format do not compress. 3.PacBio sequel and Ion Torrent series sequencers can upload tar compression format. 4.Doublecheck that your file names is accurate before sending to us.
*MD5 checksum 1	MD5 checksums are a 32-character alphanumeric string.	1. For Mac and Linux system users, the native command line tools "md5sum"(Linux) and "md5"(Mac OX) can be used to generate MD5 checksums. 2.Windows users must need to download a third-party utility, like winmd5free.	32-character alphanumeric string
File name 2	All data file names must be unique without spaces, brackets, periods, or forward (/) or backward slashes (\).	Those fields require for paired-end data only.
MD5 checksum 2	MD5 checksums are a 32-character alphanumeric string.	Those fields require for paired-end data only.	32-character alphanumeric string
Reference file name	Reference name.	When your Run data file type select BAM format. 1. If you want to submit your reference file to our FTP Site, you need to fill in the reference_name and reference_md5. We only accept Fasta file under GZIP and BZIP2 compression formats; 2. If your reference file is already in other database, please fill in the Assembly Name or Accession and Assembly Accession URL. 3.PacBio sequel and Ion Torrent series sequencers leave this column empty is available.
MD5 for reference file	MD5 for reference file.		32-character alphanumeric string
Assembly Name or Accession	Assembly Name or Accession.
Assembly Accession URL	Assembly Accession URL.		URL

2. Sequencing file

2.1 File type

This page reviews the submission file formats currently supported by the GSA, and gives guidance to submitters about current file formats and policies regarding GSA submissions.

File types	File suffix	Applicable platforms	Is recommended
Fastq	.fastq.gz .fq.gz .fastq.bz2 .fq.bz2	All Platforms	Yes
Bam	.bam	All Platforms	Yes
Sff	.sff	LS454 ION_TORRENT BGISEQ-100 DA8600
Complete Genomics Native	.tar.gz .tar	Complete Genomics BGISEQ-500 BGISEQ-1000
Solid Native	.tar.gz .tar	ABI SOLID
PacBio_HDF5	.tar .tar.gz	PacBio RS PacBio RS II	PacBio RS /PacBio RS II recommend
PacBio Sequel Native	.tar .tar.gz	PacBio Sequel	PacBio Sequel recommend
Ab1	.ab1	CAPILLARY
Oxford Nanopore Native	.tar .tar.gz	Oxford Nanapore
10x Genomics	.tar .tar.gz
Bnx	.bnx.gz .bnx.bz2	Bionano Genomics
Fasta	.fasta.gz .fasta.bz2 .fa.gz .fa.bz2
Helicos Native	.tar .tar.gz	Helicos BioSciences Corporation

2.2 File formats

Read data can be submitted in several standards and platform specific formats. We recommend that read data submitted in BAM Fastq and BAM format.

Fastq format

Single and paired reads are accepted as Fastq files that meet the following requirements:

1) Quality scores must be in Phred scale. Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.

2) No technical reads (adapters, linkers, barcodes) are allowed.

3) Single reads must be submitted using a single Fastq file and can be submitted with or without read names.

4) Paired reads must be submitted using two Fastq files.

5) Paired read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2' (regular expression for the reads: "^@([a-zA-Z0-9_-]+:[0-9]+:[a-zA-Z0-9]+:[0-9]+:[0-9]+:[0-9-]+:[0-9-]+) ([12]):[YN]:[0-9]*[02468]:[ACGTN]+$").

6) The first line for each read must start with '@'.

7) The base calls and quality scores must be separated by a line starting with '+'.

8) The Fastq files must be compressed using gzip or bzip2.

9) The regular expression for bases is “^([ACGTNactgn.]*?)$”

BAM format

Submitted BAM files must be readable with Samtools and Picard.

BAM file names are required to end up with the .bam suffix (e.g. ‘a.bam’).