Glossary

artifact

Stores a dataset or model as a file or folder.

curator
  • Object class designed to ensure your dataset conforms to a desired schema.

  • Helps with validation, standardization (e.g., by fixing typos or mapping synonyms), and annotation (linking it against metadata entities so that it becomes queryable).

FAIR

FAIR data is data which meets the principles of findability, accessibility, interoperability, and reusability wikipedia.

feature

A feature is a property of a measurement [Wikipedia]. It’s equivalent to a variable in statistics and is typically equated with a dimension of a dataset.

LaminDB comes with a Feature registry to organize dataset dimensions and equates them with statistical variables.

instance

Shorthand for “LaminDB instance”, a database that manages metadata for datasets in different storage locations.

label

A label refers to a descriptor or tag that is assigned to something to describe, identify, or categorize it.

lakehouse

A data lakehouse combines the flexibility and cost-effectiveness of a data lake with the data management and ACID transaction support of a data warehouse, enabling both structured and unstructured data analytics in a single framework. Lakehouse frameworks include Databrick’s Delta Lake, Google’s BigLake, Amazon’s Lake Formation, Dremio, Starburst and others. Here is a blog post from Google, a blog post from AWS, a glossary entry and a paper from Databricks.

ORM

Object-relational mapper. In LaminDB every sub-class of Record (every instance of Registry) is an ORM that corresponds to a SQL table in the underlying metadata database wikipedia.

observation

In statistics (machine learning), an observation refers to a particular measured instance of a set of random variable.

In biology, an observation typically corresponds to measuring (reading out) a set of properties from a biological sample.

record

A record is a data structure that consists in a sequence of typed fields that hold values [Wikipedia].

In LaminDB, a metadata record is modeled as a Record and represents a row in a in a reqistry (a table in the SQL database).

It automatically sets up important behaviors and methods (like filtering, querying, and converting records to DataFrames) needed to interact with the metadata database.

sample

In biology, a sample is an instance or part of a biological system.

In statistics (machine learning), a sample is an observation of a set of random variables (features, labels, metadata).

Depending on the observational unit chosen for representing data, the statistical sample might correspond 1:1 to a biological sample. Often, this choice presents an interesting cases, as variation across physical samples - targeted in the experimental design - can directly be explained by variation across statistical (digital) samples.

variable

We almost always mean “random variable”, when we say “variable”.

Random variables and their observations are core to statistics [Wikipedia].

An independent variable is sometimes called a feature, “predictor variable”, “regressor”, “covariate”, “explanatory variable”, “risk factor”, “input variable”, among others [Wikipedia].

A dependent variable is sometimes called a “response variable”, “regressand”, “criterion”, “predicted variable”, “measured variable”, “explained variable”, “experimental variable”, “responding variable”, “outcome variable”, “output variable”, “target” or “label”.

schema

Blueprint for your data’s structure. Tool for curating and validating the organization of your data, helping maintain data integrity as it evolves through various processing steps.

registry

A table in a SQL database (SQLite/Postgres) holding records.

transform

A piece of code (script, notebook, pipeline, function) that can be applied to input data to produce output data.

UI

Graphical user interface, for instance, a browser-based data catalog.