Glossary¶
- artifact¶
Stores a dataset or model as a file or folder.
- curator¶
Object class designed to ensure your dataset conforms to a desired schema.
Helps with validation, standardization (e.g., by fixing typos or mapping synonyms), and annotation (linking it against metadata entities so that it becomes queryable).
- FAIR¶
FAIR data is data which meets the principles of findability, accessibility, interoperability, and reusability wikipedia.
- feature¶
A feature is a property of a measurement [Wikipedia]. It’s equivalent to a variable in statistics and is typically equated with a dimension of a dataset.
LaminDB comes with a
Feature
registry to organize dataset dimensions and equates them with statistical variables.- instance¶
Shorthand for “LaminDB instance”, a database that manages metadata for datasets in different storage locations.
- label¶
A label refers to a descriptor or tag that is assigned to something to describe, identify, or categorize it.
- lakehouse¶
A data lakehouse combines the flexibility and cost-effectiveness of a data lake with the data management and ACID transaction support of a data warehouse, enabling both structured and unstructured data analytics in a single framework. Lakehouse frameworks include Databrick’s Delta Lake, Google’s BigLake, Amazon’s Lake Formation, Dremio, Starburst and others. Here is a blog post from Google, a blog post from AWS, a glossary entry and a paper from Databricks.
- ORM¶
Object-relational mapper. In LaminDB every sub-class of
Record
(every instance ofRegistry
) is an ORM that corresponds to a SQL table in the underlying metadata database wikipedia.- observation¶
In statistics (machine learning), an observation refers to a particular measured instance of a set of random variable.
In biology, an observation typically corresponds to measuring (reading out) a set of properties from a biological sample.
- record¶
A record is a data structure that consists in a sequence of typed fields that hold values [Wikipedia].
In LaminDB, a metadata record is modeled as a
Record
and represents a row in a in a reqistry (a table in the SQL database).It automatically sets up important behaviors and methods (like filtering, querying, and converting records to DataFrames) needed to interact with the metadata database.
- sample¶
In biology, a sample is an instance or part of a biological system.
In statistics (machine learning), a sample is an observation of a set of random variables (features, labels, metadata).
Depending on the observational unit chosen for representing data, the statistical sample might correspond 1:1 to a biological sample. Often, this choice presents an interesting cases, as variation across physical samples - targeted in the experimental design - can directly be explained by variation across statistical (digital) samples.
- variable¶
We almost always mean “random variable”, when we say “variable”.
Random variables and their observations are core to statistics [Wikipedia].
An independent variable is sometimes called a feature, “predictor variable”, “regressor”, “covariate”, “explanatory variable”, “risk factor”, “input variable”, among others [Wikipedia].
A dependent variable is sometimes called a “response variable”, “regressand”, “criterion”, “predicted variable”, “measured variable”, “explained variable”, “experimental variable”, “responding variable”, “outcome variable”, “output variable”, “target” or “label”.
- schema¶
Blueprint for your data’s structure. Tool for curating and validating the organization of your data, helping maintain data integrity as it evolves through various processing steps.
- registry¶
A table in a SQL database (SQLite/Postgres) holding records.
- transform¶
A piece of code (script, notebook, pipeline, function) that can be applied to input data to produce output data.
- UI¶
Graphical user interface, for instance, a browser-based data catalog.