GSA - Documentation

Documents for GSA

Items in each metadata object

The items in each metadata object (Version 2.1) containing detailed data items descriptions is freely available CN US.

GSA Submission Quick Start Guide

The GSA Submission Quick Start Guide (Version 2.3) containing submission descriptions is freely available CN US.

Tutorial

GSA Data Model

Designed for compatibility, Genome Sequence Archive (GSA) follows Nucleotide Sequence Database Collaboration (INSDC) data standards and structures. Organizational framework of the GSA data is based on the concepts of BIOPROJECT (corresponds to PROJECT in the BioProject database), BIOSAMPLE (corresponds to SAMPLE in the BioSample database), EXPERIMENT, and RUN.

Figure 1. Data model in GSA

Organization of metadata objects

Followings are examples of metadata. Submitters can organize meta data objects flexibly.

♦ Comparative genome sequencing of three strains (paired-end) Include paired-end read files in a Run(Figure 2).

Figure 2. Comparative genome sequencing of three strains (paired-end)

♦ Technical and biological replicates. Biological replicates should be classified as two different samples; technical replicates should be considered as two different experiments.

Figure 3. Technical and biological replicates

Data submission process

To create a submission, users need to register and log into the BIG Data Center Submission Portal (BIG Sub,https://ngdc.cncb.ac.cn/gsub/). In order to simplify the submission procedure, GSA is equipped with a user-friendly input wizard for data submission (Figure 4).

♦ All data associated with the same BIOPROJECT should be submitted to a single GSA.

♦ EXPERIMENT and RUN objects contain instrument and library information and are directly associated with sequence data.

♦ Each EXPERIMENT is a unique sequencing result for a specific sample.

♦ Paired-end data files (forward/reverse) must be listed together in the same RUN in order for the two files to be correctly processed as paired-end.

Figure 4. Graphic illustration of data submissions to GSA

Release of linked BioProject/BioSample/GSA

Linked BioProject, BioSample, and GSA data are released as follows (Figure 5): Release of the BioProject records DO NOT trigger release of the other linked data. Release of the BioSample records JUST triggers release of BioProject; however, DO trigger release of the referencing GSA. Release of the GSA nucleotide sequence data DO trigger release of the linked BioProject and BioSample records.

Figure 5. Release of linked BioProject/BioSample/GSA

Release Policies and Disclaimers

1. A date can be set by authors to withhold the release of new submissions for a specified period.

2. The release date can be changed through the BIG Sub portal:https://ngdc.cncb.ac.cn/gsub/submit/gsa/[substitute your GSA accession number]/contents

3. If a paper citing the sequence or accession number is published prior to the specified date, the sequence will be released upon publication. Otherwise, GSA will release sequence data on the specified date.

4. As soon as they are available, please send the full publication data--all authors, title, journal, volume, pages and date to the following address: gsa@big.ac.cn

Data curation & quality control process

The submitted data will go through a three-step review process before archived (Figure 1). The first step is the online validation during the metadata submission. In this step, both the structure and vocabulary of the metadata will be checked automatically. The second step is the manual review, namely, the expert review. In this step, the data administrator will double-check the metadata to ensure the accuracy of the information. The last step is the quality control for the sequence files. In this step, both the format and content of the files will be checked, and the quality of the files will be evaluated.

Figure 1. GSA data curation & quality control process

Frequently Asked Questions

Answers to some of the most frequently asked questions submitted to the GSA are listed as follows.

Introduction
1. What is GSA?
2. How can I submit data to GSA?
GSA accounts
1. How do I acquire a BIG Sub account?
2. I have forgotten my BIG Sub username and password.
Data entry and transmit
Data release and cite
Help
1. Contact information
2. Collaboration & visit

Introduction
1. What is GSA?
  GSA is short for Genome Sequence Archive, a data repository for genome, transcriptome and other omics primitive sequencing data. It archives raw sequence data produced from a wide variety of sequencing platforms. GSA is one of database resources in National Genomics Data Center（NGDC）. , part of Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS), serving as a primary archive of genome sequencing data for worldwide institutions and laboratories. gsa.doc.faq.h1AA2
2. How can I submit data to GSA?
  Only registered users can submit data using BIG Submission (BIG Sub,https://ngdc.cncb.ac.cn/gsub/) Portal. Please refer to the GSA Submission Quick Start Guide.gsa.doc.faq.h1BAagsa.doc.faq.h1BAbgsa.doc.faq.h1BAcgsa.doc.faq.h1BAd
GSA accounts
1. How do I acquire a BIG Sub account?
  Any user can freely register and create a BIG Sub account.After your registration data is submitted, a confirmation email will be automatically sent to you for activating your account. gsa.doc.faq.h2AA1gsa.doc.faq.h2AA2
2. I have forgotten my BIG Sub username and password.
  1) If you just have forgotten your password, you can reset the password by clicking “Forgot password”. gsa.doc.faq.h2BA1gsa.doc.faq.h2BA2
  
  2) After submitting, you will receive a response email. Please click the following URL to update your password within 10 minutes or you will need to submit email again.
  
  If you have any problems about your account usage, please email gsa@big.ac.cn for assistance. gsa.doc.faq.h2BD1 gsa.doc.faq.h2BD2

Data entry and transmit

How do I get started?
After logging on the login system, you can follow steps below to finish the submission:

1) Create a GSA submission in GSA database.

2) Register your project (BioProject) and biological samples (BioSamples) if you did not register them before at BioProject and BioSample databases, respectively. Please refer to the GSA Submission Quick Start Guide. gsa.doc.faq.h3AC1 gsa.doc.faq.h3AC2 gsa.doc.faq.h3AC3 gsa.doc.faq.h3AC4gsa.doc.faq.h3AC5gsa.doc.faq.h3AC6

3) Submit GSA metadata -information that will link your project, samples/experiments and file names. gsa.doc.faq.h3AD1gsa.doc.faq.h3AD2

4) Upload sequence data files by FTP.

How do I connect to the GSA data by FTP?
In the current version of GSA, it is highly recommended that you submit your files using a dedicated FTP tool (such as FileZilla Client) to log in to the FTP server, follow the tools instruction to set the transfer mode; If you are using FTP command, type the “binary” command before the “mput” command. gsa.doc.faq.h3CA1gsa.doc.faq.h3CA2

Transmitting your data files to the GSA FTP site

Address: ftp://submit.big.ac.cn

User and Password are same as you login the BIG Sub

NOTICE: Navigate (use command cd) to GSA folder in the Remote Site box. Then upload files will be removed after the whole submission is finished processing.

After finishing all above tasks, GSA team will check your information and files, and give your feedback.

What kind of data file format do we recommend?

In the current version, we recommend that read data is either submitted in FASTQ or BAM format. In addition, GSA only accepts GZIP and BZIP2 compression formats (and DOES NOT accepts 7-ZIP, RAR or TAR). In addition, GSA does not accept multiplexed data.

Format	File suffix	Description
Fastq format	.fastq.gz .fq.gz .fastq.bz2 .fq.bz2	fastq files with constant read length
BAM format	.bam	Binary SAM format for use by loaders that combine alignment and sequencing data
HDF5 format	.bax.h5 .bas.h5	HDF5 is a data model, library, and file format for storing and managing data.
Reference_FASTA	.fasta.gz .fa.gz	Reference sequence file in single fasta format used to construct SRA archive file format.
SFF format	.sff	454 Standard Flowgram Format file
SRF format	.srf	SRF is a generic format for DNA sequence data. This format has sufficient flexibility to store data from current and future DNA sequencing technologies.

What is the process for submitted files?
All submitted files will be regularly moved from FTP to a staging area for processing. Thus, it is quite normal that files “disappear” from FTP. If files succeed in passing the process, they will be made public or controlled access according to their release date set by users.

What is an MD5 checksum and how do I compute it?
MD5 checksums are used to verify the integrity of transmitted data. An MD5 checksum is a 32-character alphanumeric string like "e3b5dd475c449300dd11f258538ff494".

♦  For Linux users, use: $ md5sum

♦  For Mac users, use: $ md5

♦  Windows users need to use a third-party tool, e.g. winmd5free. gsa.doc.faq.h3GA4gsa.doc.faq.h3GA5

Data release and cite
1. How do I share my data?
  After accessing the GSA database through the BIG Sub account, please find the “Share” button in the last column “Operation” of this list as shown below.
  
  By clicking the “Share” tab, you can get the “Shared URL” as shown in the figure below. You can copy and paste the URL to editors, and then they can peer review your data.
2. How do I make my data publicly available?
  After the article published, you can click on the "Release Now" button in the last column “Operation” of the list as shown below.
  
  Please Click "Yes" in the "Confirmation Box" to trigger GSA release. The release of GSA will trigger the release of BioProject and BioSample, so you DO NOT need to release BioProject and BioSample in their respective system separately.
  
  NOTICE：Data can be searched and downloaded in the GSA database as soon as they are archived.
3. Which accession numbers should be cited in my publication?
  When you have successfully submitted data to GSA, please consider to use the following words to describe data deposition in your manuscript:
  
  The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (Genomics, Proteomics & Bioinformatics 2021) in National Genomics Data Center (Nucleic Acids Res 2021), China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRAxxxxxx) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa. gsa.doc.faq.h4CB1.
  
  Please cite the following required publications.
  
  The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types. Genomics, Proteomics & Bioinformatics 2021, https://doi.org/10.1016/j.gpb.2021.08.001
  Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2021. Nucleic Acids Res 2021, 49(D1):D18–D28. https://doi.org/10.1093/nar/gkaa1022 [PMID=33175170]
Help
1. Contact information
  If you have any question or would like to give us any suggestion/comment or report a bug, please feel free to contact us via email gsa@big.ac.cn or Instant Messaging Software (QQ Group: 548170081). gsa@big.ac.cn gsa.doc.faq.h5AA1548170081).
2. Collaboration & visit
  We are also happy if you would like to have a visit to explore the possibility for collaboration or learn more about GSA,
  
  Address:
  
        National Genomics Data Center
  
        China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences
  
        No.1 Beichen West Road, Chaoyang District
  
        Beijing 100101, China
  
        Tel: +86 (10) 8409-7340