FAQ - GSA - CNCB-NGDC

Frequently Asked Questions

Answers to some of the most frequently asked questions submitted to the GSA are listed as follows.

Introduction
1. What is GSA?
2. How can I submit data to GSA?
GSA accounts
1. How do I acquire a BIG Sub account?
2. I have forgotten my BIG Sub username and password.
Data entry and transmit
Data release and cite

Introduction
1. What is GSA?
  GSA is short for Genome Sequence Archive, a data repository for genome, transcriptome and other omics primitive sequencing data. It archives raw sequence data produced from a wide variety of sequencing platforms. GSA is one of database resources in National Genomics Data Center（NGDC）. , part of Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS), serving as a primary archive of genome sequencing data for worldwide institutions and laboratories.
2. How can I submit data to GSA?
  Only registered users can submit data using BIG Submission (BIG Sub,https://ngdc.cncb.ac.cn/gsub/) Portal. Please refer to the GSA Submission Quick Start Guide.
GSA accounts
1. How do I acquire a BIG Sub account?
  Any user can freely register and create a BIG Sub account.After your registration data is submitted, a confirmation email will be automatically sent to you for activating your account.
2. I have forgotten my BIG Sub username and password.
  1) If you just have forgotten your password, you can reset the password by clicking “Forgot password”.
  
  2) After submitting, you will receive a response email. Please click the following URL to update your password within 10 minutes or you will need to submit email again.
  
  If you have any problems about your account usage, please email gsa@big.ac.cn for assistance.

Data entry and transmit

How do I get started?
After logging on the login system, you can follow steps below to finish the submission:

1) Create a GSA submission in GSA database.

2) Register your project (BioProject) and biological samples (BioSamples) if you did not register them before at BioProject and BioSample databases, respectively. Please refer to the GSA Submission Quick Start Guide.

3) Submit GSA metadata -information that will link your project, samples/experiments and file names.

4) Upload sequence data files by FTP.

How do I connect to the GSA data by FTP?
In the current version of GSA, it is highly recommended that you submit your files using a dedicated FTP tool (such as FileZilla Client) to log in to the FTP server, follow the tools instruction to set the transfer mode; If you are using FTP command, type the “binary” command before the “mput” command.

Transmitting your data files to the GSA FTP site

Address: ftp://submit.big.ac.cn

User and Password are same as you login the BIG Sub

NOTICE: Navigate (use command cd) to GSA folder in the Remote Site box. Then upload files will be removed after the whole submission is finished processing.

After finishing all above tasks, GSA team will check your information and files, and give your feedback.

What kind of data file format do we recommend?

In the current version, we recommend that read data is either submitted in FASTQ or BAM format. In addition, GSA only accepts GZIP and BZIP2 compression formats (and DOES NOT accepts 7-ZIP, RAR or TAR). In addition, GSA does not accept multiplexed data.

Format	File suffix	Description
Fastq format	.fastq.gz .fq.gz .fastq.bz2 .fq.bz2	fastq files with constant read length
BAM format	.bam	Binary SAM format for use by loaders that combine alignment and sequencing data
HDF5 format	.bax.h5 .bas.h5	HDF5 is a data model, library, and file format for storing and managing data.
Reference_FASTA	.fasta.gz .fa.gz	Reference sequence file in single fasta format used to construct SRA archive file format.
SFF format	.sff	454 Standard Flowgram Format file
SRF format	.srf	SRF is a generic format for DNA sequence data. This format has sufficient flexibility to store data from current and future DNA sequencing technologies.

What is the process for submitted files?
All submitted files will be regularly moved from FTP to a staging area for processing. Thus, it is quite normal that files “disappear” from FTP. If files succeed in passing the process, they will be made public or controlled access according to their release date set by users.

What is an MD5 checksum and how do I compute it?
MD5 checksums are used to verify the integrity of transmitted data. An MD5 checksum is a 32-character alphanumeric string like "e3b5dd475c449300dd11f258538ff494".

♦  For Linux users, use: $ md5sum

♦  For Mac users, use: $ md5

♦  Windows users need to use a third-party tool, e.g. winmd5free.

Data release and cite
1. How do I share my data?
  After accessing the GSA database through the BIG Sub account, please find the “Share” button in the last column “Operation” of this list as shown below.
  
  By clicking the “Share” tab, you can get the “Shared URL” as shown in the figure below. You can copy and paste the URL to editors, and then they can peer review your data.
2. How do I make my data publicly available?
  After the article published, you can click on the "Release Now" button in the last column “Operation” of the list as shown below.
  
  Please Click "Yes" in the "Confirmation Box" to trigger GSA release. The release of GSA will trigger the release of BioProject and BioSample, so you DO NOT need to release BioProject and BioSample in their respective system separately.
  
  NOTICE：Data can be searched and downloaded in the GSA database as soon as they are archived.
3. Which accession numbers should be cited in my publication?
  When you have successfully submitted data to GSA, please consider to use the following words to describe data deposition in your manuscript:
  
  The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (Genomics, Proteomics & Bioinformatics 2021) in National Genomics Data Center (Nucleic Acids Res 2021), China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRAxxxxxx) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa.
  
  Please cite the following required publications.
  
  The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types. Genomics, Proteomics & Bioinformatics 2021, https://doi.org/10.1016/j.gpb.2021.08.001
  Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2024. Nucleic Acids Res 2024, Jan 5;52(D1):D18-D32. https://doi.org/10.1093/nar/gkad1078 [PMID=38018256]