VARAN - User Guide

User guide

The Varan application is organized into two distinct functionalities, each designed to handle different tasks and requiring specific inputs:

Study Creation: This function allows users to initiate a new study folder from scratch, setting up all the necessary configurations and files.
Study Manipulation: This functionality supports updating, extracting, or removing samples from an already existing study folder.

To ensure transparency and traceability, the system implements the following features:

Versioning: Each study is automatically versioned (e.g., study_v1, study_v2, etc.), allowing clear tracking of changes and updates over time.
Detailed Reports: For each version, a comprehensive report is generated, which includes the filters applied, the configuration used in the study, the number of samples and patients, and a comparison with previous versions to track the study's evolution.
Validation Check: Before uploading a study to cBioPortal, the system performs a validation to confirm that the upload can be executed successfully.

⚠️ Be warned that files succeeding this validation may still fail to load (correctly).

Installation Procedures

Varan can be installed either via Docker or through a local setup. Please refer to the appropriate section below based on your preferred installation method:

Docker installation (recommended): Follow the step-by-step instructions provided in the Quickstart Guide.
Local installation: Detailed procedures for installing Varan locally are available here.

Study Creation

Input Preparation

To create a new study, Varan accepts two types of input:

Folder Input: a folder containing all the necessary data files.
File Input: a file containing the paths to the data files.

Both formats are supported for processing, and you can choose the one that best suits your workflow.

Varan currently accept Mutations and CNA data as input.

Folder

The user must organize an input folder containing all of the vcf and tsv files requested following the structure reported below:

input_folder/
├── CNV
│   ├── 001.vcf
│   ├── 002.vcf
│   └── 003.vcf
├── SNV
│   ├── 001.vcf
│   ├── 002.vcf
│   └── 003.vcf
├── CombinedOutput
│   ├── 111_CombinedVariantOutput.tsv
│   ├── 222_CombinedVariantOutput.tsv
│   └── 333_CombinedVariantOutput.tsv
├── FUSIONS
│   └── Fusions.tsv
├── sample.tsv   
└── patient.tsv

Where:

SNV folder contains all VCF files related to single nucleotide variants.
CNV folder contains all VCF files related to copy number variants.
CombinedVariantOutput folder contains all the Combined Variant Output files in TSV format, named based on the patient ID. These files include information about TMB, MSI, and Fusions.
FUSIONS folder contains the template fusion.tsv, which can be filled with fusion data if the Combined Variant Output files are unavailable.
sample.tsv file is a template for clinical data of the samples.
patient.tsv file is a template for clinical data of the patients.

File(s)

The user must compile several input files (by filling in specific templates):

sample.tsv file is a template for clinical data of the samples; ⚠️ mandatory.
patient.tsv file is a template for clinical data of the patients; optional.
fusions.tsv file is a template for fusions' information; optional.

Templates

For both folder-based and file-based input, users are required to complete the provided templates. The following section outlines the structure of these templates:

sample.tsv

The following table shows the required fields for the sample clinical data (this will generate the data_clinical_sample.txt file).

You can view and download a template version here.

⚠️ This file is required for Varan analysis!

SAMPLE_ID	PATIENT_ID	ONCOTREE_CODE	snv_path	cnv_path	comb_path	MSI	TMB	MSI_THR	TMB_THR	...
0000001_DNA	11111111	BOWEL	path_to_snv	path_to_cnv	path_to_combined_output	1	12			...

The obligatory fields to keep are:

SAMPLE_ID: IDs of all samples of interest.
PATIENT_ID: IDs of all patients of interest.
ONCOTREE_CODE: code to associate for the oncoKB annotation.
snv_path: path to SNV file for every sample.
cnv_path: path to CNV file for every sample.
cnv_path: path to Combined Variant Output file for every sample.
MSI: MSI value for every sample.
TMB: TMB value for every sample.
MSI_THR: MSI categorization based on the threshold set in the conf.ini file.
TMB_THR: TMB categorization based on the threshold set in the conf.ini file.

⚠️ The user can add new columns starting from the last one.

patient.tsv

This template must be filled by the user with all disposable patients' clinical info and will be used to create the data_clinical_patient.txt.

You can view and download a template version here.

⚠️ This file is optional.

PATIENT_ID	AGE	SEX	...
11111111	45	F	...

The obligatory fields to keep are:

PATIENT_ID: IDs of all patients of interest.
AGE: Age all patients of interest.
SEX: Sex all patients of interest.

fusion.tsv

This template must be filled by the user with all disposable fusion info and will be used to create the data_sv.txt.

You can view and download a template version here.

⚠️ This file is optional.

Sample_Id	SV_Status	Site1_Hugo_Symbol	Site2_Hugo_Symbol	...
0000001_DNA	SOMATIC	APC	BRCA1	...

The obligatory fields to keep are:

Sample_Id: IDs of all sample of interest
SV_Status: fusion type
Site1_Hugo_Symbol: first gene involved in fusion
Site2_Hugo_Symbol: second gene involved in fusion

Configuration File

The next step to start using Varan is to properly configure the conf.ini file. You can view and download a template version here.

⚠️ In the next subparagraph, for each field the type of variable requested will be inserted between angle brackets <>. For string, two possible entries can be found: < 'string' > and < string >. In the first case, it is requested to insert text inside quotation marks (i.e. DESCRIPTION='this is the description'), while on the other one, quotation marks are not requested (i.e. PROJECT_NAME=study).

⚠️ For Docker, a partially pre-filled conf.ini file is provided, with some paths already configured.

⚠️ Only the fields from the conf.ini file that are relevant to this functionality will be displayed below.

Paths

This section allows you to specify the paths for vcf2maf (VCF2MAF), ClinVar (CLINV), fasta (REF_FASTA), VEP (VEP_PATH), and its cache (VEP_DATA). In the CACHE section, you can set the Ensembl version (e.g., 111).

[Paths]
VCF2MAF = < string >
CLINV = < string >
REF_FASTA = < string >
VEP_PATH = < string >
VEP_DATA = < string >
CACHE = < string >

⚠️ The reference genome must be the same used for the secondary analysis and the VCFs annotation.

Multiple

In this section, it's possible to specify the paths in case of multiple SNV, CNV, and/or Combined Variant Output files analysis.

[Multiple]
SNV= < string >
CNV= < string >
COMBOUT= < string >

⚠️ This section has to be filled only in case of input by file.

Zip

In this section, you can decide how to manage MAF and SNV_FILTERED files.

[Zip]
ZIP_MAF = < boolean >
ZIP_SNV_FILTERED = < boolean >

OncoKB

In this section, it's possible to insert the personal oncoKB key. This key is required to execute the oncoKB annotation.

[OncoKB]
ONCOKB= < string >

⚠️ You can request the oncoKB key here

Project

In this section, it's possible to specify project info like study name, ID, description, and profile. These info will be inserted in meta files.

[Project]
PROJECT_ID = < string >
PROJECT_NAME = < string >
DESCRIPTION = < 'string' >
PROFILE_MUT = < 'string' >
PROFILE_CNA = < 'string' >
PROFILE_CNA_HG19 = < 'string' >
PROFILE_SV = < 'string' >

Filters

Here, it is possible to specify the filters' threshold to apply to SNV MAF files. For more information, refer to filters section.

[Filters]
t_VAF_min = < float >
t_VAF_max = < float >
t_VAF_min_novel = < float >
AF = < string >
drop_NA_AF = < boolean >
ONCOKB_FILTER = < ['string','string',...] >
CLIN_SIG = < ['string','string',...] >
CONSEQUENCES = < ['string','string',...] >
POLYPHEN = < ['string','string',...] >
IMPACT = < ['string','string',...] >
SIFT = < ['string','string',...] >

⚠️ The AF field can be populated with < / > / <= />= val (i.e., AF = <0.005).

Cna

In this section, the user can insert CNV genotypes of interest, ploidy, and decide whether or not to apply the CNVkit formula for copy number variation analysis.

[Cna]
HEADER_CNV = < ['string','string',...] >
PLOIDY = < int >
CNVkit = < boolean >

⚠️ Ploidy will be used to evaluate copy number discretization using cnvkit formula.

TMB

Here TMB thresholds can be specified.

[TMB]
THRESHOLD_TMB = < {'string':'string', 'string':'string', ...} >

i.e. THRESHOLD = {'Low':'<=5', 'Medium':'<10', 'High':'>=10'} where the string before ':' is the label assigned to TMB value, while the other is the specific threshold (i.e. for a sample with TMB=15, a label 'High' will be reported in the clinical sample data).

MSI

Here MSI thresholds for sites and values can be specified.

[MSI]
THRESHOLD_SITES = < string >
THRESHOLD_MSI = < string >

⚠️ THRESHOLD_SITES value will be used only if MSI information is extracted from Combined Variant Output files. If MSI is directly reported as a value inside the input tsv, only THRESHOLD_MSI will be applied.
⚠️ Both THRESHOLD_SITES and THRESHOLD_MSI can be populated with < / > / <= / >= val (i.e., THRESHOLD_MSI = <20).

FUSION

Here Fusions thresholds can be specified.

[FUSION]
THRESHOLD_FUSION = < string >

⚠️ THRESHOLD can be populated with < / > / <= / >= val (i.e., THRESHOLD_FUSION = >=15).

ClinicalSample

Users can customize column names and data types for the data_clinical_sample.txt file.

[ClinicalSample]
HEADER_SAMPLE_SHORT = < ['string','string',...] >
HEADER_SAMPLE_LONG = < ['string','string',...] >
HEADER_SAMPLE_TYPE = < ['string','string',...] >

⚠️ HEADER_SAMPLE_TYPE accepts only STRING, NUMBER, BOOLEAN. If a different type is inserted, an error will be raised by Varan. ⚠️ If these fields are left empty, a default Header will be produced.

ClinicalPatient

Users can customize column names and data types for the data_clinical_patient.txt file.

[ClinicalPatient]
HEADER_PATIENT_SHORT = < ['string','string',...] >
HEADER_PATIENT_LONG = < ['string','string',...] >
HEADER_PATIENT_TYPE = < ['string','string',...] >

⚠️ HEADER_PATIENT_TYPE accepts only STRING, NUMBER, BOOLEAN. If a different type is inserted, an error will be raised by Varan.
⚠️ If these fields are left empty, a default Header will be produced.

Annotations

In this section, it's possible to insert manual notes that will appear in the report_VARAN.html file.

[Annotations]
ANNOTATIONS = < ['string','string',...] >

Launch Varan

Options

The available options for creating a study with Varan are:

Options	Description	Type	Required
-i --input	Add this option to insert the path of the input (folder or file(s))	string/list	Yes
-o --output_folder	Add this option to insert the path where to save the output folder	string	Yes
-c --cancer	Add this option to specify the cancer type	string	Yes
-f --filter	Add this option to filter out variants from VCF/MAF, followed by the appropriate letters corresponding to the filters you wish to apply. More info here	string	No
-k --onocoKB	Add this option to annotate with oncoKB	boolean	No
-t --analysis_Type	Add this option to specify the type of file (SNV, CNV, Fusion, or Tab) to analyze. If not specified, all analysis will be done	string	No
-w --overWrite	Add this option to overwrite an already existing output folder	boolean	No
-R --resume	Add this option to resume an already started analysis. ⚠️This option must be used with caution, because it assumes that the previous VCF to MAF conversion step was successful	boolean	No
-m --multiple	Add this option to specify that the input is a multi-sample vcf file (a single VCF containing information from multiple patients)	boolean	No

Filtering

Use the following letters after the -f option to apply the corresponding filters described in the table above. You can specify multiple letters at once to apply multiple filters simultaneously:

Letter	Description
d	Filters SNV mutations with ALT="." and FILTER ≠"PASS".
p	Filters MAF mutations with FILTER ≠"PASS".
v	Filters MAF mutations with vaf values (t_VF column) not included in the interval [t_VAF_min; t_VAF_max] specified in the conf.ini file.
n	Applies a specific VAF filter for new mutations (dbSNP_RS = "novel"), excluding new MAF mutations with vaf not included in the interval specified in the conf.ini file, t_VAF_min_novel.
a	Filters MAF mutations with AF values not included in the interval specified in the conf.ini file. You can choose whether to remove or keep mutations without a value for this column using the 'drop_NA_AF' setting.
o	Filters MAF mutations with ONCOGENIC values different from those specified in the conf.ini file under the ONCOKB_FILTER field. This requires the OncoKB annotation (-k).
c	Filters MAF mutations with CLIN_SIG values equal to those specified in the conf.ini file under the CLIN_SIG field.
q	Filters MAF mutations with Consequences values different from those specified in the conf.ini file under the CONSEQUENCES field.
y	Filters MAF mutations with PolyPhen annotations different from those specified in the conf.ini file under the POLYPHEN field.
i	Filters MAF mutations with IMPACT annotations equal to those specified in the conf.ini file under the IMPACT field.
s	Filters MAF mutations with SIFT annotations equal to those specified in the conf.ini file under the SIFT field.

Release filtering configuration

Varan is released with a compiled configuration that filters out common variants with a gnomAD frequency of < 0.04% in at least one gnomAD column. Oncogenic and likely oncogenic variants are retained based on OncoKB classifications. Finally, it retains all variants with an allelic frequency of ≥ 0.05.

Examples

To launch Varan docker version is required to mount several volumes (-v) for granting a correct functioning.

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan <commands>

Ex 1) Launch Varan base analysis with input folder:

Run this command to process the contents of the input folder

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta varan <commands> -v <conf.ini>:/conf.ini varan -i <path_input_folder> -o /output/<output_name> -c <type_of_cancer>

Ex 2) Launch Varan base analysis with input file:

Run one of these commands to process the contents of the input file(s):

If you have sample.tsv and patient.tsv as input:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i sample.tsv patient.tsv -o /output/<output_name> -c <type_of_cancer>

If you have sample.tsv and fusion.tsv as input:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i sample.tsv "" fusion.tsv -o /output/<output_name> -c <type_of_cancer>

If you have sample.tsv, patient.tsv and fusion.tsv as input:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i sample.tsv patient.tsv fusion.tsv -o /output/<output_name> -c <type_of_cancer>

Ex 3) Multiple vcf analysis:

Run this command to specify that your input is a multi-vcf file or folder:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -m

Ex 4) Overwrite analysis:

Run this command to overwrite the output folder:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -w

Ex 5) Resume analysis:

Run this command to resume an already started analysis:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -R

Ex 6) Specify analysis:

Run one of these commands to specify the analysis' type:

snv -> only snv analysis.

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -t snv

cnv -> only cnv analysis.

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -t cnv

fus -> only fusion analysis.

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -t fus

tab -> only meta file creation.

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -t tab

Ex 7) OncoKB annotation:

Execute the command below to enable OncoKB annotation:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -k

Ex 8) Filter VCF/MAF:

The following are a few examples of filters that can be applied, including various possible combinations:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -f q

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -f cav

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output -v <vep_cache_path>:/vep_cache -v <genomes_folder>:/ref_fasta -v <conf.ini>:/conf.ini varan -i <input_folder> -o /output/<output_name> -c <type_of_cancer> -k -f divo

Study Manipulation

Input Preparation

Folder

The input for this type of analysis should be a properly populated study folder. This can either be the output from the previous analysis or an existing study folder downloaded from cBioPortal. Depending on the requested action (Update, Extract, Remove), the additional input may be either another study folder or a sample list in TSV format.

Configuration File

The next step to start using Varan is to properly configure the conf.ini file. You can view and download a template version here.

⚠️ Only the fields from the conf.ini file that are relevant to this functionality will be displayed below.

Zip

In this section, you can decide if you want to copy and zip MAF from the original study.

[Zip]
ZIP_MAF = < boolean >
COPY_MAF = < boolean >

Annotations

In this section, it's possible to insert manual notes that will appear in the report_VARAN.html file.

[Annotations]
ANNOTATIONS = < ['string','string',...] >

Launch Varan

Options

The possible options to launch Varan main for block 2 are:

Options	Input	Type	Required
-u --Update	Add this option if you want to update an existing study folder or merge two studies	boolean	One between -u, -e or -r is required
-e --Extract	Add this option if you want to extract samples from an existing study folder	boolean	One between -u, -e or -r is required
-r --Remove	Add this option if you want to remove samples from an existing study folder	boolean	One between -u, -e or -r is required
-p --Path	Add this option to specify the path of the existing study folder to update, or from which to remove/extract samples	string	Yes
-n --NewPath	Add this option to specify the path of the study folder containing updated/new information	string	Only if the -u option is selected
-s --SampleList	Add this option to insert the path of the .txt file containing the list of samples to remove/extract from the study folder	string	Only if the -e or -r option is selected
-N --Name	Add this option if you want to customize the study name (studyID)	string	No
-o --output_folder	Add this option to specify the path where to save the output folder. If not provided, a new version of the existing study will be created	string	Only if the -N option is selected

Examples

To launch Varan Docker version, it is required to mount at least the input and the output folder for correct functioning.

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output varan <commands>

Ex 1) Update a study folder:

Run this command to update a study folder:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output varan -u -p /input/<path_to_old_study_folder> -n /input/<path_to_new_study_folder> -o /output/<path_to_output_folder>

Ex 2) Extract a study with a subset of samples:

Run this command to extract a list of samples from a study folder and create a new study containing only these samples in the output path:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output varan -e -p /input/<path_to_study_folder> -s /input/<path_to_sample_list_file> -o /output/<path_to_output_folder>

Ex 3) Remove samples from a study:

Run this command to remove a list of samples from a study folder and save a new study without them in the output path, assigning a customized study name:

docker run --rm -it -v <input_folder>:/input -v <output_folder>:/output varan -r -p /input/<path_to_study_folder> -s /input/<path_to_sample_list_file> -o /output/<path_to_output_folder> -N <new_studyID_in_meta>

Output

After varan.py runs successfully, the output folder will have the following structure and contents:

study_name
├── case_lists
│   ├── cases_cna.txt
│   ├── cases_sequenced.txt
│   └── cases_sv.txt
├── data_clinical_patient.txt
├── data_clinical_sample.txt
├── data_cna_hg19.seg
├── data_cna_hg19.seg.fc.txt
├── data_cna.txt
├── data_mutations_extended.txt
├── data_sv.txt
├── img
├── MAF_Filtered or MAF_Onco_filtered(*)
├── MAF_OncoKB (**)
├── maf or maf.zip (***)
├── meta_clinical_patient.txt
├── meta_clinical_sample.txt
├── meta_cna_hg19_seg.txt
├── meta_cna.txt
├── meta_mutations_extended.txt
├── meta_study.txt
├── meta_sv.txt
├── report_validate.html
└── report_VARAN.html

(*) Filtered MAF files will be stored in the MAF_Onco_filtered or MAF_Filtered folders, depending on whether the MAFs were annotated with OncoKB.

(**) MAF_OncoKB folder will be created only if the -k option is set and will contain all the MAF with OncoKB annotations.

(***) A folder containing MAF files. It will be zipped if ZIP_MAF is set as True in conf.ini.

Varan Reports

Validate Report

report_validate.html is an automatically generated file that provides a report on the cBioPortal validation, highlighting any errors or warnings encountered during the validation process.

For more info about cBioPortal offline validation see here

Summary Report

report_VARAN.html is an automatically generated file that provides a summary of the study. It is structured into several sections.

Note: While the structure remains consistent, not all fields appear in every report. Their presence depends on the specific action taken (e.g., creation, update, extraction, or removal).

General Information

This section provides essential details about the study such as:

Command line: Full command used to generate or modify the study, including all relevant parameters.
Study identifiers: Names associated with the study, which may include the current study, references to existing studies, or intermediate datasets depending on the workflow.
Cancer type: Influences variant annotation based on cancer-specific context and guidelines.
Sample and patient count: Number of samples and patients included in the resulting dataset.

Detailed Overview

This section provides a breakdown of sample-level changes applied to cases_cna, cases_sequenced and cases_sv.

For each dataset, the report indicates:

Which samples were extracted or removed.
The number and identifiers of affected samples.
The resulting size of each dataset after the operation, if applicable.

Filters & Configuration

This section summarizes the filters and configuration settings applied to the study. Filters control which data are included in the analysis. Configuration parameters define how molecular features like CNA, TMB, MSI, and gene fusions are processed and interpreted.

For create operations, the section shows all filters and configurations explicitly defined for the newly created study.
For update operations, only filters and configurations shared between the two studies are shown here. Any differences, including parameters present in one study but not the other, are listed separately under the section Differences in Filters & Configurations.
For extract and remove operations, this section reports the filters and configuration settings from the original study used as the basis for extraction or removal.

Graphical Overview

This section provides a visual summary of the study history over time. It includes two main plots:

A barplot showing the number of samples and patients across different study versions.
A barplot showing the total number of SNVs, CNVs, and fusions detected in each version.

The number of versions displayed depends on the version of the current study. Up to the last 5 versions are shown to highlight the evolution of the dataset over time.

Annotations

In this section you can find the manual annotations added in the specific section of the conf.ini file.