Deep Lake (Deep Learning)

Script error: No such module "Draft topics". Script error: No such module "AfC topic".

Deep Lake is a system or repository of data resembling data lakes, where the raw data includes images, videos, audio, and other unstructured data.^[1] Deep Lake allows conversion of the raw data into a deep learning-native format (NumPy-like arrays, also known as tensors). The data can then be easily manipulated and streamed to a machine learning model training process across the network. As sucg, Deep Lake retains the same properties of data lakes, such as data version control, SQL queries, ingesting data with ACID transactions, and visualizing terabyte-scale datasets for analytical workloads. One notable difference of a deep lake is its focus on storing complex unstructured data in a deep learning-native format, and the ability to stream Deep Lake datasets over the network to (a) be queried, (b) visualized in-browser, or (c) used with machine learning frameworks such as PyTorch^[2], TensorFlow^[3] or JAX^[4]^[5], as well as various MLOps tools^[6]. Deep Lake can also be used as a vector database for Large Language Model training, as well as developing artificial intelligence applications with LangChain.

A Deep Lake can be used locally (on organisation's premises) or "in the cloud" (using cloud services from vendors such as Amazon or Google).

Background[edit]

The term has been used by Assaf Pinhasi^[7] and Activeloop^[6]^[8] to refer to an architectural blueprint for managing deep learning data. Deep Lake was released on September 30, 2022^[9].

Features[edit]

Deep Lake has the following features:

Dataset version control[edit]

Typical data lakes offer time travel, or a linear timeline of changes that happened to the dataset. However, time traveling does not solve for multiple versions of the same asset, such as multiple annotations from different data annotators (specialists codifying machine learning datasets). In contrast, Deep Lake enables git-like dataset version control, which resolves this issue.

In-browser visualization engine[edit]

Deep Lake enables users to visualize their complex datasets alongside the metadata (such as bounding boxes, masks, annotations, etc.).

Rapid queries with Tensor Query language[edit]

Thanks to the underlying data format, Deep Lake enables users query their datasets with complex queries also involving NumPy-like array manipulations, as well as standard SQL operations and expressions, including arithmetic, or logical composition.

SELECT images [100:500, 100:500], boxes + ARRAY[-100, -100, 0, 0] 
WHERE contains(categories, 'bicycle') and weather == 'raining' 
ORDER BY AOI(boxes, prediction) desc
LIMIT 1000

The above query, for instance, would allow a user to build a dataset of 1,000 images and labels where it was “raining”, and there were “bicycles” captured on the camera.

Materialize[edit]

Materialization of a dataset transforms the "virtual view" into deep learning-ready tensorial format. It also enables efficient data streaming from Deep Lake directly to the GPU.

Streaming data loader[edit]

Data loaders, are the pieces of software in charge of moving data from where data is stored into GPUs while training machine learning models. Upon running a query such as the one mentioned above, a user may save the materialized dataset and then stream it to a machine learning model.

An example^[10] of streaming for CIFAR-100 could be:

$ pip3 install deeplake #installs deeplake

import deeplake
from torchvision import datasets, transforms, models

dataset_path = 'hub://activeloop/cifar100-train'
ds = deeplake.load(dataset_path) # Returns a Deep Lake Dataset but does not download data locally

tform = transforms.Compose([
    transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run
    transforms.RandomRotation(20), # Image augmentation
    transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
    transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
])

#PyTorch Dataloader
dataloader= ds.pytorch(batch_size = 16, num_workers = 2, 
    transform = {'images': tform, 'labels': None}, shuffle = True)
    
cifar100_pytorch = ClassificationDataset(ds_train, transform = tform) #
dataloader_pytroch = DataLoader(dataset_pt, batch_size = 16, num_workers = 2, shuffle = True)

Deep Lake Performance against alternatives[edit]

In an independent survey and benchmarking of open-source data loaders by Ofeidis et al., 2022^[11], such as TorchData, Webdataset^[12], Squirrel^[13], PyTorch^[2], FFCV^[14], as well as its previous version, Activeloop Hub^[15], Deep Lake data loader was shown to be highly performant against its alternatives. Deep Lake demonstrated loading speed over the public internet is naturally slower than from a local disk, some libraries, such as Deep Lake, showed "remarkable results" (only a 13% increase in time compared to loading from a local disk). Deep Lake also outperformed alternatives for networked loading, whilst FFCV led the pack for multi-GPU training.^[11]

Criticism[edit]

An alternative to Deep Lake would be to extend standard approaches such as Parquet or Arrow formats to support deep learning data. This could be beneficial due to the ability of existing analytical tools including Spark or Kafka. However, proponents of Deep Lake argue that those tools are optimized for tabular, time-series, and event stream processing and will break with complex, unstructured data.^[16]

References[edit]

↑ Deep Lake: Data Lake for Deep Learning, Activeloop, 2022-09-30, retrieved 2022-09-30
↑ ^2.0 ^2.1 Paszke, Adam; Gross, Sam; Massa, Francisco; Lerer, Adam; Bradbury, James; Chanan, Gregory; Killeen, Trevor; Lin, Zeming; Gimelshein, Natalia; Antiga, Luca; Desmaison, Alban; Köpf, Andreas; Yang, Edward; DeVito, Zach; Raison, Martin (2019-12-03). "PyTorch: An Imperative Style, High-Performance Deep Learning Library". arXiv:1912.01703 [cs.LG].
↑ Abadi, Martín; Agarwal, Ashish; Barham, Paul; Brevdo, Eugene; Chen, Zhifeng; Citro, Craig; Corrado, Greg S.; Davis, Andy; Dean, Jeffrey; Devin, Matthieu; Ghemawat, Sanjay; Goodfellow, Ian; Harp, Andrew; Irving, Geoffrey; Isard, Michael (2016-03-16). "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems". arXiv:1603.04467 [cs.DC].
↑ JAX: Autograd and XLA, Google, 2022-09-30, retrieved 2022-09-30
↑ "Using JAX to accelerate our research". www.deepmind.com. Retrieved 2022-09-30.
↑ ^6.0 ^6.1 Hambardzumyan, Sasun; Tuli, Abhinav; Ghukasyan, Levon; Rahman, Fariz; Topchyan, Hrant; Isayan, David; Harutyunyan, Mikayel; Hakobyan, Tatevik; Stranic, Ivo; Buniatyan, Davit (2022-09-22). "Deep Lake: a Lakehouse for Deep Learning". arXiv:2209.10785 [cs.DC].
↑ Pinhasi, Assaf (2022-06-13). "Deep Lake — an architectural blueprint for managing Deep Learning data at scale — part I". Medium. Retrieved 2022-09-30.
↑ "Deep Lake". www.deeplake.ai. Retrieved 2022-09-30.
↑ "Releases · activeloopai/deeplake". GitHub. Retrieved 2022-09-30.
↑ "Step 7: Connecting Deep Lake Datasets to ML Frameworks". docs.activeloop.ai. Retrieved 2022-09-30.
↑ ^11.0 ^11.1 Ofeidis, Iason; Kiedanski, Diego; Tassiulas, Leandros (2022-09-27). "An Overview of the Data-Loader Landscape: Comparative Performance Analysis". arXiv:2209.13705 [cs.DC].
↑ The WebDataset Format, webdataset, 2022-09-29, retrieved 2022-09-30
↑ Sohofi, Alireza; Yu, Tiansu; Aribal, Alp; Loetzsch, Winfried; Team, Squirrel Developer; Wollmann, Thomas (2022-09-25), Squirrel: A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way, retrieved 2022-09-30
↑ ffcv ImageNet Training, FFCV, 2022-09-04, retrieved 2022-09-30
↑ "A New Way of Managing Deep Learning Datasets". KDnuggets. Retrieved 2022-09-30.
↑ "From Oracle to Databases for AI: The Evolution of Data Storage". KDnuggets. Retrieved 2022-09-30.

This article "Deep Lake (Deep Learning)" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Deep Lake (Deep Learning). Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[1] Deep Lake: Data Lake for Deep Learning, Activeloop, 2022-09-30, retrieved 2022-09-30

[:0-2] 2.0 ^2.1 Paszke, Adam; Gross, Sam; Massa, Francisco; Lerer, Adam; Bradbury, James; Chanan, Gregory; Killeen, Trevor; Lin, Zeming; Gimelshein, Natalia; Antiga, Luca; Desmaison, Alban; Köpf, Andreas; Yang, Edward; DeVito, Zach; Raison, Martin (2019-12-03). "PyTorch: An Imperative Style, High-Performance Deep Learning Library". arXiv:1912.01703 [cs.LG].

[3] Abadi, Martín; Agarwal, Ashish; Barham, Paul; Brevdo, Eugene; Chen, Zhifeng; Citro, Craig; Corrado, Greg S.; Davis, Andy; Dean, Jeffrey; Devin, Matthieu; Ghemawat, Sanjay; Goodfellow, Ian; Harp, Andrew; Irving, Geoffrey; Isard, Michael (2016-03-16). "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems". arXiv:1603.04467 [cs.DC].

[4] JAX: Autograd and XLA, Google, 2022-09-30, retrieved 2022-09-30

[5] "Using JAX to accelerate our research". www.deepmind.com. Retrieved 2022-09-30.

[:1-6] 6.0 ^6.1 Hambardzumyan, Sasun; Tuli, Abhinav; Ghukasyan, Levon; Rahman, Fariz; Topchyan, Hrant; Isayan, David; Harutyunyan, Mikayel; Hakobyan, Tatevik; Stranic, Ivo; Buniatyan, Davit (2022-09-22). "Deep Lake: a Lakehouse for Deep Learning". arXiv:2209.10785 [cs.DC].

[7] Pinhasi, Assaf (2022-06-13). "Deep Lake — an architectural blueprint for managing Deep Learning data at scale — part I". Medium. Retrieved 2022-09-30.

[8] "Deep Lake". www.deeplake.ai. Retrieved 2022-09-30.

[9] "Releases · activeloopai/deeplake". GitHub. Retrieved 2022-09-30.

[10] "Step 7: Connecting Deep Lake Datasets to ML Frameworks". docs.activeloop.ai. Retrieved 2022-09-30.

[Ofeidis-11] 11.0 ^11.1 Ofeidis, Iason; Kiedanski, Diego; Tassiulas, Leandros (2022-09-27). "An Overview of the Data-Loader Landscape: Comparative Performance Analysis". arXiv:2209.13705 [cs.DC].

[12] The WebDataset Format, webdataset, 2022-09-29, retrieved 2022-09-30

[13] Sohofi, Alireza; Yu, Tiansu; Aribal, Alp; Loetzsch, Winfried; Team, Squirrel Developer; Wollmann, Thomas (2022-09-25), Squirrel: A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way, retrieved 2022-09-30

[14] ffcv ImageNet Training, FFCV, 2022-09-04, retrieved 2022-09-30

[15] "A New Way of Managing Deep Learning Datasets". KDnuggets. Retrieved 2022-09-30.

[16] "From Oracle to Databases for AI: The Evolution of Data Storage". KDnuggets. Retrieved 2022-09-30.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]