GSA Handbook

The GSA Handbook (Version 1.1, March 2016) containing detailed data items' descriptions is freely available here.

GSA Submission Quick Start Guide

The GSA Submission Quick Start Guide (Version 1.0, Jan 2017) containing submission descriptions is freely available here.

GSA Data Model

The data model adopted by GSA consists of Project, Sample, Experiment and Run. Unlike other data depositories, GSA features "Umbrella Project", which is used to, albeit optional when registering a BioProject, effectively manage multiple highly revelant projects supported by a collaborative grant or mega grant, e.g., 1000 Human Genomes Project, Dog 10K Genomes Project.

GSA Data Relationships

Data relationships in GSA are as follows.

BioProject is an overall description of a single research initiative, typically involving multiple samples.

BioSample describes biological source material; each physically unique specimen should be registered as a single BioSample with a unique set of attributes.

Experiment describes detailed treatment for each BioSample. Each sample may have multiple experiments and each experiment belongs to a specific BioSample.

Run describes technical batch related files that belong to a specific Experiment. Each Run may have multiple files.

Frequently Asked Questions
Answers to some of the most frequently asked questions submitted to the GSA are listed as follows.
  1. Introduction
    1. What is GSA?

      GSA is shorten for Genome Sequence Archive, a data repository for genome, transcriptome and other omics primitive sequencing data. It archives raw sequence data produced from a wide variety of sequencing platforms. GSA is one of database resources in BIG Data Center (BIGD), part of Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS), serving as a primary archive of genome sequencing data for worldwide institutions and laboratories.

    2. How can I submit data to GSA?

      Only registered users can submit data to GSA. Briefly, data submission requires the following steps.

         a)    Create a GSA account and/or login to GSA;

         b)    Enter metadata information;

         c)    Submit data files;

         d)    Specify the release date.

  2. GSA Accounts
    1. How do I acquire a GSA account?

      Any user can freely register and create a GSA account. After your registration data is submitted, a confirmation email will be automatically sent to you for activating your account.

    2. I’ve forgotten my GSA username and password?

      ♦  If you just have forgotten your password, you may find the password by clicking “Forgot password”. You will receive an e-mail and please follow the URL to reset your password within 30 minutes.

      ♦  If you are already a member and you’ve forgotten both your GSA username and password, please feel free to contact us. We will do our best to help you.

  3. Data Entry and Transmit
    1. How do I get started?

      Data submission requires that you log into GSA, so you need to create an account if you are not a member.

      Please note that fields marked * are required when submitting metadata.

    2. How do I connect to the GSA data by FTP?

      In the current version 1.1 of GSA, it is highly recommended that you submit your files using a dedicated FTP tool (e.g., FileZilla). Please transmit you data files to the GSA FTP site using the following credentials:



            Password: PPtaT385

      Upload files to the /GSA directory.

      Please use the binary mode to transfer files. If you are using a FTP client, follow the tool's instruction to set the transfer mode; If you are using ftp command, type the binary command before the put command.

      Please NOTE that you should create a unique folder on the FTP server.

    3. What is your data file format?

      In the current version, we recommend that read data is either submitted in FASTQ or BAM format. And GSA only accepts GZIP and BZIP2 compression formats (and DOES NOT accepts 7-ZIP, RAR or TAR). In addition, GSA does not accept multiplexed data.

    4. How do I name the transmitted data files’ names?

      The data files are submitted in FASTQ format, listed in a Run and merged into one or several sequence archive file (please do not exceed 10 GB). Therefore, data files from different samples or replicates should not be grouped in the same Run. Single reads must be submitted using a single archive file and can be named with the suffix appended, like ‘_1’, ‘_2’, etc. Paired-end data files (forward/reverse), conversely, MUST be listed in a single run in order. For example, forward and reverse reads are alternate in the file and are named in order with “F” and “R” appended, respectively (i.e., read "1F", followed by read "1R", then read "2F", then "2R").

    5. What is the process for submitted files?

      All submitted files will be regularly moved from FTP to a staging area for processing. Thus, it is quite normal that files “disappear” from FTP. If files succeed in passing the process, they will be made public or controlled access according to their release date set by users.

    6. What is an MD5 checksum and how do I compute it?

      MD5 checksums are used to verify the integrity of transmitted data. An MD5 checksum is a 32-character alphanumeric string like "e3b5dd475c449300dd11f258538ff494".

      ♦  For Linux users, use: $ md5sum

      ♦  For Mac users, use: $ md5

      ♦  Windows users need to use a third-party tool.

  4. Data release and cite
    1. How do I set the release date or make data publicly available?

      When you submit data, you will find a button named “Set release date” at the bottom of web page. After you specify the release date, it will trigger or extend the data release according to the inputted date. It is suggested that you set the release date of Experiment/Run later than BioProject or BioSample.

    2. Which accession numbers should be cited in my publication?

      The GSA submission is organized into the following objects with unique prefix and standard naming.

      ♦  BioProject (Project): PRJC

      ♦  BioSample (Sample): SAMC

      ♦  Experiment: CRX

      ♦  Run: CRR

      Please cite accession number(s) of objects of your interest in your publication. To provide more detailed information for your submission, it is recommended that the BioProject accession number be used in your publication.

  5. Help
    1. Contact information

      If you have any question or would like to give us any suggestion/comment or report a bug, please feel free to contact us via email ( or Instant Messaging Software (QQ Group: 548170081).

    2. Collaboration & Visit

      We are also happy if you would like to have a visit to explore the possibility for collaboration or learn more about GSA,


      BIG Data Center

      Beijing Institute of Genomics, Chinese Academy of Sciences

      No.1 Beichen West Road, Chaoyang District

      Beijing 100101, China

      Tel: +86 (10) 8409-7340

      Fax: +86 (10) 8409-7720