Skip to content

2.1 NCBI Datasets: an overview

mtntsuchiya edited this page Oct 15, 2024 · 4 revisions

NCBI Datasets is a resource that allows users to download data and metadata from API, web and command-line tool. In this workshop, we will be focusing on the command-line tool, its structure and organization.

getting_started

A. Command-line tools: datasets and dataformat

While the web interface is helpful, there are times when it's more convenient to access genomes through a command-line environment. For example, let's say you are working on your institution's high-performance computing (HPC) system and you need to download dozens (or hundreds of genomes). Even if you're using the Datasets web interface, this would potentially be a two step process:

  1. Download the genome data package locally;
  2. Transfer the files to the HPC system.

With the NCBI Datasets command-line interface (CLI), you can do this process in a single step. Our CLI allows users to access not only genomes, but also genes, ortholog sets and virus genomes.

The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.

datasets-schema

In this virtual machine, we have the necessary tools installed for you to explore NCBI Datasets without the need to configure anything. When you decide to use NCBI Datasets on your own machine or HPC system, you need to install it. More information on how to install NCBI Datasets can be found in our documentation page.

The NCBI Datasets CLI command structure is very intuitive. If you take a look at the diagram below, you will notice that the commands are built by choosing one option from each vertical rectangle. Let's start!

datasets-command

In addition to datasets, we also have dataformat, a companion tool to explore and convert metadata to TSV or Excel formats. We will cover the dataformat command syntax and use in the metadata section.

Next 2.2 Retrieving bacterial data and metadata using datasets