Recommendations for the formatting of Variant Call Format (VCF) files to make plant genotyping data FAIR

F1000Res. 2022 Feb 24:11:ELIXIR-231. doi: 10.12688/f1000research.109080.2. eCollection 2022.

Abstract

In this opinion article, we discuss the formatting of files from (plant) genotyping studies, in particular the formatting of metadata in Variant Call Format (VCF) files. The flexibility of the VCF format specification facilitates its use as a generic interchange format across domains but can lead to inconsistency between files in the presentation of metadata. To enable fully autonomous machine actionable data flow, generic elements need to be further specified. We strongly support the merits of the FAIR principles and see the need to facilitate them also through technical implementation specifications. They form a basis for the proposed VCF extensions here. We have learned from the existing application of VCF that the definition of relevant metadata using controlled standards, vocabulary and the consistent use of cross-references via resolvable identifiers (machine-readable) are particularly necessary and propose their encoding. VCF is an established standard for the exchange and publication of genotyping data. Other data formats are also used to capture variant data (for example, the HapMap and the gVCF formats), but none currently have the reach of VCF. For the sake of simplicity, we will only discuss VCF and our recommendations for its use, but these recommendations could also be applied to gVCF. However, the part of the VCF standard relating to metadata (as opposed to the actual variant calls) defines a syntactic format but no vocabulary, unique identifier or recommended content. In practice, often only sparse descriptive metadata is included. When descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. To address this, we propose recommendations for supplying and encoding metadata, focusing on use cases from plant sciences. We expect there to be overlap, but also divergence, with the needs of other domains.

Keywords: ELIXIR; FAIR; data management; genotyping; phenotyping; plant; snp; vcf.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Genotype
  • Metadata*
  • Software*

Grants and funding

This study received funding from ELIXIR, the research infrastructure for life-science data, through the ELIXIR Implementation Study: FONDUE - FAIR-ification of Plant Genotyping Data and its linking to Phenotyping using ELIXIR Platforms. ML and SW received funding for the AGENT project from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 862613. US received funding for the de.NBI project from the German BMBF under the FKZ 031A536A.