Documents for GSA

GSA Handbook

The GSA Handbook (Version 2.0, June 2017) containing detailed data items' descriptions is freely available here.



GSA Submission Quick Start Guide

The GSA Submission Quick Start Guide (Version 2.0, June 2017) containing submission descriptions is freely available here.



Tutorial

GSA Data Model

Designed for compatibility, Genome Sequence Archive (GSA) follows INSDC data standardsand structures. All data are organized into four objects,i.e., BioProject, BioSample, Experiment, and Run (Figure 1). "BioProject", bearing an accession number prefixed with "PRJC", providesan overall description for an individual research initiative, including basic description, organism, data type, submitter, funding information, and publication(s) if available.

Figure 1: Data model in GSA



Organization of metadata objects

Followings are examples of metadata. Submitters can organize meta data objects flexibly.

 Comparative genome sequencing of three strains (paired-end) Include paired-end read files in a Run(Figure 2).

Figure 2: Comparative genome sequencing of three strains (paired-end)



 Technical and biological replicates.

Figure 3: Technical and biological replicates(Figure 3)



Data submission and retrieval

To create a submission, users need to register and log into theGenome Sequence Submission (Gsub) System. In order to maximally simplify the submission procedure, GSA is equipped with a user-friendly input wizard for metadata collection(Figure 4). To ease sequence file uploading, GSA provides a FTP server supporting two Internet Protocols (IPv4 and IPv6).

Figure 4: Graphic illustration of data submissions to GSA



Frequently Asked Questions

Answers to some of the most frequently asked questions submitted to the GSA are listed as follows.
  1. Introduction
    1. What is GSA?
    2. How can I submit data to GSA?
  2. GSA Accounts
    1. How do I acquire a GSA account?
    2. I’ve forgotten my GSA username and password?
  3. Data Entry and Transmit
    1. How do I get started?
    2. How do I connect to the GSA data by FTP?
    3. What is your data file format?
    4. How do I name the transmitted data files’ names?
    5. What is the process for submitted files?
    6. What is an MD5 checksum and how do I compute it?
  4. Data release and cite
    1. How do I set the release date or make data publicly available?
    2. Which accession numbers should be cited in my publication?
  5. Help
    1. Contact information
    2. Collaboration & Visit

  1. Introduction
    1. What is GSA?

      GSA is shorten for Genome Sequence Archive, a data repository for genome, transcriptome and other omics primitive sequencing data. It archives raw sequence data produced from a wide variety of sequencing platforms. GSA is one of database resources in BIG Data Center (BIGD), part of Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS), serving as a primary archive of genome sequencing data for worldwide institutions and laboratories.

    2. How can I submit data to GSA?

      Only registered users can submit data using Genome Sequence submission (Gsub) System. Briefly, data submission requires the following steps.

         a)    Create a BIGD account and/or login to Gsub;

         b)    Enter metadata information;

         c)    Submit data files;

         d)    Specify the release date.

  2. GSA Accounts
    1. How do I acquire a GSA account?

      Any user can freely register and create a Gsub account. After your registration data is submitted, a confirmation email will be automatically sent to you for activating your account.

    2. I’ve forgotten my GSA username and password?

      ♦  If you just have forgotten your password, you may find the password by clicking “Forgot password”. You will receive an e-mail and please follow the URL to reset your password within 30 minutes.

      ♦  If you are already a member and you’ve forgotten both your GSA username and password, please feel free to contact us. We will do our best to help you.

  3. Data Entry and Transmit
    1. How do I get started?

      Data submission requires that you log into Genome Sequence Submission (Gsub) System, so you need to create an account if you are not a member.

      Please note that fields marked * are required when submitting metadata.

    2. How do I connect to the GSA data by FTP?

      In the current version 2.0 of GSA, it is highly recommended that you submit your files using a dedicated FTP tool (e.g., FileZilla). Please transmit you data files to the Gsub FTP site using the following credentials:

            Address:   ftp://submit.big.ac.cn

            User:         Same as you login the Gsub

            Password: Same as you login the Gsub

      Please NOTE that you should create a unique folder on the FTP server.

    3. What is your data file format?

      In the current version, we recommend that read data is either submitted in FASTQ or BAM format. And GSA only accepts GZIP and BZIP2 compression formats (and DOES NOT accepts 7-ZIP, RAR or TAR). In addition, GSA does not accept multiplexed data.

    4. How do I name the transmitted data files’ names?

      The data files are submitted in FASTQ format, listed in a Run and merged into one or several sequence archive file (please do not exceed 10 GB). Therefore, data files from different samples or replicates should not be grouped in the same Run. Single reads must be submitted using a single archive file and can be named with the suffix appended, like '1', '_2', etc. Paired-end data files (forward/reverse), conversely, MUST be listed in a single run in order. For example, forward and reverse reads are alternate in the file and are named in order with "F" and "R" appended, respectively (i.e., read "1F", followed by read "1R", then read "2F", then "2R").

    5. What is the process for submitted files?

      All submitted files will be regularly moved from FTP to a staging area for processing. Thus, it is quite normal that files "disappear" from FTP. If files succeed in passing the process, they will be made public or controlled access according to their release date set by users.

    6. What is an MD5 checksum and how do I compute it?

      MD5 checksums are used to verify the integrity of transmitted data. An MD5 checksum is a 32-character alphanumeric string like "e3b5dd475c449300dd11f258538ff494".

      ♦  For Linux users, use: $ md5sum

      ♦  For Mac users, use: $ md5

      ♦  Windows users need to use a third-party tool.

  4. Data release and cite
    1. How do I set the release date or make data publicly available?

      When you submit data, you will find a button named "Set release date" at the bottom of web page. After you specify the release date, it will trigger or extend the data release according to the inputted date. It is suggested that you set the release date of Experiment/Run later than BioProject or BioSample.

    2. Which accession numbers should be cited in my publication?

      The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (Genomics, Proteomics & Bioinformatics 2017) in BIG Data Center (Nucleic Acids Res 2017), Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, under accession numbers PRJCAxxxxxx, PRJCAyyyyyy that are publicly accessible at http://bigd.big.ac.cn/gsa. Please cite the following required publications.

       GSA: Genome Sequence Archive. Genomics, Proteomics & Bioinformatics 2017, 15(1): 14-18. [PMID=28387199]

       The BIG Data Center: from deposition to integration to translation. Nucleic Acids Res 2017, 45(D1): D18-D24. [ PMID=27899658]

  5. Help
    1. Contact information

      If you have any question or would like to give us any suggestion/comment or report a bug, please feel free to contact us via email (gsa@big.ac.cn) or Instant Messaging Software (QQ Group: 548170081).

    2. Collaboration & Visit

      We are also happy if you would like to have a visit to explore the possibility for collaboration or learn more about GSA,

      Address:

            BIG Data Center

            Beijing Institute of Genomics, Chinese Academy of Sciences

            No.1 Beichen West Road, Chaoyang District

            Beijing 100101, China

            Tel: +86 (10) 8409-7340

            Fax: +86 (10) 8409-7720