lamindb.Collection¶

Bases: Record, IsVersioned, TracksRun, TracksUpdates

Collections of artifacts.

Collections provide a simple way of versioning collections of artifacts.

Parameters:

artifacts – list[Artifact] A list of artifacts.
key – str A file-path like key, analogous to the key parameter of Artifact and Transform.
description – str | None = None A description.
revises – Collection | None = None An old version of the collection.
run – Run | None = None The run that creates the collection.
meta – Artifact | None = None An artifact that defines metadata for the collection.
reference – str | None = None A simple reference, e.g. an external ID or a URL.
reference_type – str | None = None A way to indicate to indicate the type of the simple reference "url".

See also

Artifact

Examples

Create a collection from a list of Artifact objects:

>>> collection = ln.Collection([artifact1, artifact2], key="my_project/my_collection")

Create a collection that groups a data & a metadata artifact (e.g., here RxRx: cell imaging):

>>> collection = ln.Collection(data_artifact, key="my_project/my_collection", meta=metadata_artifact)

Attributes¶

property data_artifact: Artifact | None¶

Access to a single data artifact.

If the collection has a single data & metadata artifact, this allows access via:

collection.data_artifact  # first & only element of collection.artifacts
collection.meta_artifact  # metadata

property name: str¶

Name of the collection.

Splits key on / and returns the last element.

property ordered_artifacts: QuerySet¶

Ordered QuerySet of .artifacts.

Accessing the many-to-many field collection.artifacts directly gives you non-deterministic order.

Using the property .ordered_artifacts allows to iterate through a set that’s ordered in the order of creation.

property stem_uid: str¶

Universal id characterizing the version family.

The full uid of a record is obtained via concatenating the stem uid and version information:

stem_uid = random_base62(n_char)  # a random base62 sequence of length 12 (transform) or 16 (artifact, collection)
version_uid = "0000"  # an auto-incrementing 4-digit base62 number
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid

property transform: Transform | None¶: Transform whose run created the collection.

property versions: QuerySet¶

Lists all records of the same version family.

>>> new_artifact = ln.Artifact(df2, revises=artifact).save()
>>> new_artifact.versions()

Simple fields¶

uid: str¶: Universal id, valid across DB instances.

key: str¶: Name or path-like key.

description: str | None¶: A description or title.

hash: str | None¶: Hash of collection content.

reference: str | None¶: A reference like URL or external ID.

reference_type: str | None¶: Type of reference, e.g., cellxgene Census collection_id.

meta_artifact: Artifact | None¶

An artifact that stores metadata that indexes a collection.

It has a 1:1 correspondence with an artifact. If needed, you can access the collection from the artifact via a private field: artifact._meta_of_collection.

version: str | None¶

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

is_latest: bool¶: Boolean flag that indicates whether a record is the latest in its version family.

created_at: datetime¶: Time of creation of record.

updated_at: datetime¶: Time of last update to record.

Relational fields¶

space: Space¶: The space in which the record lives.

created_by: User¶: Creator of record.

run: Run | None¶: Run that created the collection.

ulabels: ULabel¶: ULabels sampled in the collection (see Feature).

input_of_runs: Run¶: Runs that use this collection as an input.

artifacts: Artifact¶: Artifacts in collection.

references: Reference¶: Linked references.

projects: Project¶: Linked projects.

Class methods¶

classmethod df(include=None, features=False, limit=100)¶

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

Use arguments include or feature to include other data.

Parameters:

include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "ulabels__name", "cell_types__name", etc. or a list of such strings.
features (bool | list[str], default: False) – If True, map all features of the Feature registry onto the resulting DataFrame. Only available for Artifact.
limit (int, default: 100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.

Return type:

DataFrame

Examples

Include the name of the creator in the DataFrame:

>>> ln.ULabel.df(include="created_by__name"])

Include display of features for Artifact:

>>> df = ln.Artifact.df(features=True)
>>> ln.view(df)  # visualize with type annotations

Only include select features:

>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])

classmethod filter(*queries, **expressions)¶

Query records.

Parameters:

queries – One or multiple Q objects.
expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ln.ULabel(name="my label").save()
>>> ln.ULabel.filter(name__startswith="my").df()

classmethod get(idlike=None, **expressions)¶

Get a single record.

Parameters:

idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.
expressions – Fields and values passed as Django query expressions.

Return type:

Record

Returns:

A record.

Raises:

lamindb.errors.DoesNotExist – In case no matching record is found.

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ulabel = ln.ULabel.get("FvtpPJLJ")
>>> ulabel = ln.ULabel.get(name="my-label")

classmethod lookup(field=None, return_field=None)¶

Return an auto-complete object for a field.

Parameters:

field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.
return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")

classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶

Search.

Parameters:

string (str) – The input string to match against the field ontology values.
field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.
limit (int | None, default: 20) – Maximum amount of top results to return.
case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")

classmethod using(instance)¶

Use a non-default LaminDB instance.

Parameters:: instance (str | None) – An instance identifier of form “account_handle/instance_name”.
Return type:: QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

Methods¶

append(artifact, run=None)¶

Append an artifact to the collection.

This does not modify the original collection in-place, but returns a new version of the original collection with the appended artifact.

Parameters:

artifact (Artifact) – An artifact to add to the collection.
run (Run | None, default: None) – The run that creates the new version of the collection.

Return type:

Collection

Examples:

collection_v1 = ln.Collection(artifact, key="My collection").save()
collection_v2 = collection.append(another_artifact)  # returns a new version of the collection
collection_v2.save()  # save the new version

cache(is_run_input=None)¶

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:: is_run_input (bool | None, default: None) – Whether to track this collection as run input.
Return type:: list[UPath]

delete(permanent=None)¶

Delete collection.

Parameters:: permanent (bool | None, default: None) – Whether to permanently delete the collection record (skips trash).
Return type:: None

Examples

For any Collection object collection, call:

>>> collection.delete()

describe()¶

Describe relations of record.

Return type:: None

Examples

>>> artifact.describe()

load(join='outer', is_run_input=None, **kwargs)¶

Stage and load to memory.

Returns in-memory representation if possible such as a concatenated DataFrame or AnnData object.

Return type:: Any

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)¶

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

If your AnnData collection is in the cloud, move them into a local cache first via cache().

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections of AnnData artifacts.

Parameters:

layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.
obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.
obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.
obs_filter (dict[str, str | list[str]] | None, default: None) – Select only observations with these values for the given obs columns. Should be a dictionary with obs column names as keys and filtering values (a string or a list of strings) as values.
join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.
encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.
unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.
cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.
parallel (bool, default: False) – Enable sampling with multiple processes.
dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm
stream (bool, default: False) – Whether to stream data from the array backend.
is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.get(description="my collection")
>>> mapped = collection.mapped(obs_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)

open(is_run_input=None)¶

Return a cloud-backed pyarrow Dataset.

Works for pyarrow compatible formats.

Return type:: Dataset

Notes

For more info, see tutorial: Slice arrays.

restore()¶

Restore collection record from trash.

Return type:: None

Examples

For any Collection object collection, call:

>>> collection.restore()

save(using=None)¶

Save the collection and underlying artifacts to database & storage.

Parameters:: using (str | None, default: None) – The database to which you want to save.
Return type:: Collection

Examples

>>> collection = ln.Collection("./myfile.csv", name="myfile")

view_lineage(with_children=True)¶

Graph of data flow.

Return type:: None

Notes

For more info, see use cases: Data lineage.

Examples

>>> collection.view_lineage()
>>> artifact.view_lineage()