Jupyter Notebook

Analysis flow

Here, we’ll track typical data transformations like subsetting that occur during analysis.

# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./analysis-flow --schema bionty
Hide code cell output
 initialized lamindb: testuser1/analysis-flow
import lamindb as ln
import bionty as bt
 connected lamindb: testuser1/analysis-flow

Save an initial dataset

import lamindb as ln
import bionty as bt


# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.core.datasets.anndata_with_obs()

# validate and register features
curate = ln.Curator.from_anndata(
        "cell_type": bt.CellType.name,
        "cell_type_id": bt.CellType.ontology_id,
        "tissue": bt.Tissue.name,
        "disease": bt.Disease.name,
curate.save_artifact(description="anndata with obs")

!python analysis-flow-scripts/register_example_file.py
Hide code cell output
 connected lamindb: testuser1/analysis-flow
 created Transform('K4wsS5DTYdFp0000'), started new Run('qNH7xFT9...') at 2025-03-10 13:32:08 UTC
 added 4 records with Feature.name for "columns": 'cell_type', 'cell_type_id', 'tissue', 'disease'
 saving validated records of 'cell_type'
 added 3 records from public with CellType.name for "cell_type": 'hepatocyte', 'hematopoietic stem cell', 'T cell'
 added 1 record with CellType.name for "cell_type": 'my new cell type'
 saving validated records of 'var_index'
 added 99 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', ...
 "var_index" is validated against Gene.ensembl_gene_id
 saving validated records of 'tissue'
 added 4 records from public with Tissue.name for "tissue": 'liver', 'kidney', 'brain', 'heart'
 saving validated records of 'disease'
 added 4 records from public with Disease.name for "disease": 'liver lymphoma', 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease'
 "cell_type" is validated against CellType.name
 "cell_type_id" is validated against CellType.ontology_id
 "tissue" is validated against Tissue.name
 "disease" is validated against Disease.name
 finished Run('qNH7xFT9') after 6s at 2025-03-10 13:32:14 UTC

Open a dataset, subset it, and register the result

Track the current notebook:

Hide code cell output
 created Transform('eNef4Arw8nNM0000'), started new Run('latJ4jpf...') at 2025-03-10 13:32:16 UTC
 notebook imports: bionty==1.1.2 lamindb==1.2.0
artifact = ln.Artifact.get(description="anndata with obs")
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = '6rZYePeBeXY11mKc0000'
│   ├── .size = 46992
│   ├── .hash = 'IJORtcQUSS11QBqD-nTD0A'
│   ├── .n_observations = 40
│   ├── .path = 
│   │   /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow/.lamindb/6rZYePeBeXY11mKc0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2025-03-10 13:32:14
│   └── .transform = 'register_example_file.py'
├── Dataset features/.feature_sets
│   ├── var99                    [bionty.Gene]                                                       
│   │   TSPAN6                      float                                                               
│   │   TNMD                        float                                                               
│   │   DPM1                        float                                                               
│   │   SCYL3                       float                                                               
│   │   FIRRM                       float                                                               
│   │   FGR                         float                                                               
│   │   CFH                         float                                                               
│   │   FUCA2                       float                                                               
│   │   GCLC                        float                                                               
│   │   NFYA                        float                                                               
│   │   STPG1                       float                                                               
│   │   NIPAL3                      float                                                               
│   │   LAS1L                       float                                                               
│   │   ENPP4                       float                                                               
│   │   SEMA3F                      float                                                               
│   │   CFTR                        float                                                               
│   │   ANKIB1                      float                                                               
│   │   CYP51A1                     float                                                               
│   │   KRIT1                       float                                                               
│   │   RAD52                       float                                                               
│   └── obs4                     [Feature]                                                           
cell_type                   cat[bionty.CellType]       T cell, hematopoietic stem cell, hepatoc…
cell_type_id                cat[bionty.CellType]       T cell, hematopoietic stem cell, hepatoc…
disease                     cat[bionty.Disease]        Alzheimer disease, cardiac ventricle dis…
tissue                      cat[bionty.Tissue]         brain, heart, kidney, liver              
└── Labels
    └── .tissues                    bionty.Tissue              liver, kidney, brain, heart              
        .cell_types                 bionty.CellType            hepatocyte, hematopoietic stem cell, T c…
        .diseases                   bionty.Disease             liver lymphoma, cardiac ventricle disord…

Get a backed AnnData object

adata = artifact.open()
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object 6rZYePeBeXY11mKc0000.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

Subset dataset to specific cell types and diseases

cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
Hide code cell output
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
Name: count, dtype: int64

Register the subsetted AnnData:

curate = ln.Curator.from_anndata(
        "cell_type": bt.CellType.name,
        "disease": bt.Disease.name,
        "tissue": bt.Tissue.name,
Hide code cell output
/opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
 "var_index" is validated against Gene.ensembl_gene_id
 "cell_type" is validated against CellType.name
 "disease" is validated against Disease.name
 "tissue" is validated against Tissue.name
artifact = curate.save_artifact(description="anndata with obs subset")
Hide code cell output
    returning existing schema with same hash: Schema(uid='JTGbFlMb0zuM1iwFoOxS', n=99, dtype='float', itype='bionty.Gene', is_type=False, hash='-frOq7J0bik-J7Ad9DX7HA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, run_id=1, created_at=2025-03-10 13:32:14 UTC)
    returning existing schema with same hash: Schema(uid='wpiIxOOHmJgHCs3RtLqe', n=4, itype='Feature', is_type=False, otype='DataFrame', hash='c1ODB5BNA52JXBD3d-AbRA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, run_id=1, created_at=2025-03-10 13:32:14 UTC)
Artifact .h5ad/AnnData
├── General
│   ├── .uid = '7bpcbn95e7RMs8Ya0000'
│   ├── .size = 38992
│   ├── .hash = 'RgGUx7ndRplZZSmalTAWiw'
│   ├── .n_observations = 20
│   ├── .path = 
│   │   /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow/.lamindb/7bpcbn95e7RMs8Ya0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2025-03-10 13:32:17
│   └── .transform = 'Analysis flow'
├── Dataset features/.feature_sets
│   ├── var99                    [bionty.Gene]                                                       
│   │   TSPAN6                      float                                                               
│   │   TNMD                        float                                                               
│   │   DPM1                        float                                                               
│   │   SCYL3                       float                                                               
│   │   FIRRM                       float                                                               
│   │   FGR                         float                                                               
│   │   CFH                         float                                                               
│   │   FUCA2                       float                                                               
│   │   GCLC                        float                                                               
│   │   NFYA                        float                                                               
│   │   STPG1                       float                                                               
│   │   NIPAL3                      float                                                               
│   │   LAS1L                       float                                                               
│   │   ENPP4                       float                                                               
│   │   SEMA3F                      float                                                               
│   │   CFTR                        float                                                               
│   │   ANKIB1                      float                                                               
│   │   CYP51A1                     float                                                               
│   │   KRIT1                       float                                                               
│   │   RAD52                       float                                                               
│   └── obs4                     [Feature]                                                           
cell_type                   cat[bionty.CellType]       T cell, hematopoietic stem cell          
disease                     cat[bionty.Disease]        chronic kidney disease, liver lymphoma   
tissue                      cat[bionty.Tissue]         kidney, liver                            
cell_type_id                cat[bionty.CellType]                                                
└── Labels
    └── .tissues                    bionty.Tissue              liver, kidney                            
        .cell_types                 bionty.CellType            hematopoietic stem cell, T cell          
        .diseases                   bionty.Disease             liver lymphoma, chronic kidney disease   

Examine data lineage

Query a subsetted .h5ad artifact containing “hematopoietic stem cell” and “T cell”:

cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
Hide code cell output
Artifact(uid='7bpcbn95e7RMs8Ya0000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', kind='dataset', otype='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, space_id=1, storage_id=1, run_id=2, created_by_id=1, created_at=2025-03-10 13:32:17 UTC)

Common questions that might arise are:

  • What is the history of this artifact?

  • Which features and labels are associated with it?

  • Which notebook analyzed and registered this artifact?

  • By whom?

  • And which artifact is its parent?

Let’s answer this using LaminDB:

print("--> What is the lineage of this artifact?\n")

print("\n\n--> Which features and labels are associated with it?\n")

print("\n\n--> Which notebook analyzed and saved this artifact\n")

print("\n\n--> Who save this artifact?\n")

print("\n\n--> Which artifacts were inputs?\n")
--> What is the lineage of this artifact?
--> Which features and labels are associated with it?
Artifact .h5ad/AnnData
└── Dataset features/.feature_sets
    ├── var99                    [bionty.Gene]                                                       
TSPAN6                      float                                                               
TNMD                        float                                                               
DPM1                        float                                                               
SCYL3                       float                                                               
FIRRM                       float                                                               
FGR                         float                                                               
CFH                         float                                                               
FUCA2                       float                                                               
GCLC                        float                                                               
NFYA                        float                                                               
STPG1                       float                                                               
NIPAL3                      float                                                               
LAS1L                       float                                                               
ENPP4                       float                                                               
SEMA3F                      float                                                               
CFTR                        float                                                               
ANKIB1                      float                                                               
CYP51A1                     float                                                               
KRIT1                       float                                                               
RAD52                       float                                                               
    └── obs4                     [Feature]                                                           
        cell_type                   cat[bionty.CellType]       T cell, hematopoietic stem cell          
        disease                     cat[bionty.Disease]        chronic kidney disease, liver lymphoma   
        tissue                      cat[bionty.Tissue]         kidney, liver                            
        cell_type_id                cat[bionty.CellType]                                                

Artifact .h5ad/AnnData
└── Labels
    └── .tissues                    bionty.Tissue              liver, kidney                            
        .cell_types                 bionty.CellType            hematopoietic stem cell, T cell          
        .diseases                   bionty.Disease             liver lymphoma, chronic kidney disease   
--> Which notebook analyzed and saved this artifact

Transform(uid='eNef4Arw8nNM0000', is_latest=True, key='analysis-flow.ipynb', description='Analysis flow', type='notebook', space_id=1, created_by_id=1, created_at=2025-03-10 13:32:16 UTC)

--> Who save this artifact?

User object (1)

--> Which artifacts were inputs?
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
1 6rZYePeBeXY11mKc0000 None anndata with obs .h5ad dataset AnnData 46992 IJORtcQUSS11QBqD-nTD0A None 40 md5 True False 1 1 None None True 1 2025-03-10 13:32:14.899000+00:00 1 None 1
Hide code cell content
!rm -r ./analysis-flow
!lamin delete --force analysis-flow
 deleting instance testuser1/analysis-flow