Descriptive documentation¶
Quickstart¶
The easiest way to create a dataset is to use the
dtoolcore.DataSetCreator
context manager.
The code below creates a frozen dataset without any metadata or data.
>>> from dtoolcore import DataSetCreator
>>> with DataSetCreator("my-awesome-dataset", "/tmp") as ds_creator:
... uri = ds_creator.uri
Clearly, this dataset is not very interesting. However, we can use it to show
how one can load a dtoolcore.DataSet
instance from a dataset’s URI,
using the dtoolcore.DataSet.from_uri()
method.
>>> from dtoolcore import DataSet
>>> dataset = DataSet.from_uri(uri)
>>> print(dataset.name)
my-awesome-dataset
Creating a dataset¶
A dtool dataset packages data and metadata. Descriptive metadata is stored in a “README” file at the top level of the dataset. It is best practise to use the YAML file format for the README.
The code below creates a variable that holds a string with descriptive metadata in YAML format.
>>> readme_content = "---\ndescription: three text files with animals"
The readme_content
can be added to the dataset when creating a
dtoolcore.DataSetCreator
context manager.
>>> with DataSetCreator("animal-dataset", "/tmp", readme_content) as ds_creator:
... animal_ds_uri = ds_creator.uri
... for animal in ["cat", "dog", "parrot"]:
... handle = animal + ".txt" # Unix-like relpath
... fpath = ds_creator.prepare_staging_abspath_promise(handle)
... with open(fpath, "w") as fh:
... fh.write(animal)
...
The code above does several things. It stores the URI of the dataset in the
variable animal_ds_uri
. It loops over the strings cat
, dog
,
parrot
and creates a so called “handle” for each one of them. A handle is a
human readable label of an item in a dataset. It has to be unique and look like
a Unix-style relpath. The handle is then passed into the
dtoolcore.DataSetCreator.prepare_staging_abspath_promise()
method which
returns the absolute path to a file that needs to be created within the
lifetime of the context manager. Otherwise a
dtoolcore.DtoolCoreBrokenStagingPromise
exception is raised. The code then
uses these absolute paths to create files in these locations. When the context
manager exits these files are added to the dataset, the temporary location
where the files were created is deleted and the dataset is frozen.
Working with items in a dataset¶
Below a dataset is loaded from the animal_ds_uri
.
>>> animal_dataset = DataSet.from_uri(animal_ds_uri)
The readme content can be accessed using the
dtoolcore.DataSet.get_readme_content()
method.
>>> print(animal_dataset.get_readme_content())
---
description: three text files with animals
Items in a dataset are accessed using their identifiers. The item identifiers
can be accessed using the dtoolcore.DataSet.identifiers
property.
>>> for i in animal_dataset.identifiers:
... print(i)
...
e55aada093b34671ec2f9467fe83f0d3d8c31f30
d25102a700e072b528db79a0f22b3a5ffe5e8f5d
26f0d76fb3c3e34f0c7c8b7c3461b7495761835c
To view information about each item one can use the
dtoolcore.DataSet.item_properties()
method that returns a dictionary with
the items hash
, size_in_bytes
, relpath
(also known as “handle”).
>>> for i in animal_dataset.identifiers:
... item_props = animal_dataset.item_properties(i)
... info_str = "{hash} {size_in_bytes} {relpath}".format(**item_props)
... print(info_str)
...
d077f244def8a70e5ea758bd8352fcd8 3 cat.txt
68238cd792d215bdfdddc7bbb6d10db4 6 parrot.txt
06d80eb0c50b49a509b49f2424e8c805 3 dog.txt
To get the content of an item one can use the
dtoolcore.DataSet.item_content_abspath()
method. The method guarantees
that the content of the item will be available in the abspath provided. This is
important when working with datasets stored in the cloud, for example
in an AWS S3 bucket.
>>> for i in animal_dataset.identifiers:
... fpath = animal_dataset.item_content_abspath(i)
... with open(fpath, "r") as fh:
... print(fh.read())
...
cat
parrot
dog
Annotating a dataset with key/value pairs¶
The descriptive metadata in the readme is not ideally suited for programatic access to metadata. If one needs to interact with metadata programatically it is much easier to do so using so called “annotations”. These are key/value pairs that can be added to a dataset.
In the code below the dtoolcore.DataSet.put_annotation()
method is used
to add add the key/value pair “category”/”pets” to the dataset.
>>> animal_dataset.put_annotation("category", "pets")
The dtoolcore.DataSet.get_annotation()
can then be used to access the
value of the “category” annotation.
>>> print(animal_dataset.get_annotation("category"))
pets
It is also possible to add an annotation to a dataset inside a
dtoolcore.DataSetCreator
conext manager using the
dtoolcore.DataSetCreator.put_annotation()
method.
Working with item metadata¶
It is also possible to add per item metadata. This is, for example, useful if one wants to access only a subset of items from a dataset. Below is a dictionary that can be used to look up the family of a set of animals.
>>> family = {
... "tiger": "felidae",
... "lion": "felidae",
... "horse": "equidae"
... }
The code below creates a new dataset and adds the “family” of the animal
as a piece of metadata to each item using the
dtoolcore.DataSetCreator.add_item_metadata()
method.
>>> with DataSetCreator("animal-2-dataset", "/tmp") as ds_creator:
... animal2_ds_uri = ds_creator.uri
... for animal in ["tiger", "lion", "horse"]:
... handle = animal + ".txt" # Unix-like relpath
... fpath = ds_creator.prepare_staging_abspath_promise(handle)
... with open(fpath, "w") as fh:
... fh.write(animal)
... ds_creator.add_item_metadata(handle, "family", family[animal])
...
Per item metadata are stored in what is referred to as “overlays”. It is
possible to get back the content of an overlay using the
dtoolcore.DataSet.get_overlay()
method.
>>> animal2_dataset = DataSet.from_uri(animal2_ds_uri)
>>> family_overlay = animal2_dataset.get_overlay("family")
The family_overlay
is a Python dictonary, where they keys correspond to the
item identifiers.
>>> for key, value in family_overlay.items():
... print("{} {}".format(key, value))
...
85b263904920cc18caa5630e4124f4311847e6b8 felidae
433635d53dae167009941349491abf7aae9becbd felidae
f480009aa5a5c43d09f40f39df7a5a3ec5f42237 equidae
The code below uses this per item metadata to only process the cats (“felidae”).
>>> for i in animal2_dataset.identifiers:
... if family_overlay[i] != "felidae":
... continue
... fpath = animal2_dataset.item_content_abspath(i)
... with open(fpath, "r") as fh:
... print(fh.read())
...
lion
tiger
To add an overlay to an existing dataset one can use the dtoolcore.DataSet.put_overlay method. This takes as input a dictonary where each item has a keyed entry.
Creating a derived dataset¶
In data processing it can be useful to track the provenance of the input. This
is most easily done by making use of the dtoolcore.DerivedDataSetCreator
context manager.
>>> from dtoolcore import DerivedDataSetCreator
Suppose we wanted to transfor the animals from the animal_dataset
into the
sounds that they make. Let’s create a dictionary to help us do this.
>>> animal_sounds = {
... "dog": "bark",
... "cat": "meow",
... "parrot": "squak"
... }
...
The code below creates a dataset derived from the animal_dataset
.
>>> with DerivedDataSetCreator("animal-sounds-dataset", "/tmp", animal_dataset) as ds_creator:
... animal_sounds_ds_uri = ds_creator.uri
... for i in animal_dataset.identifiers:
... input_abspath = animal_dataset.item_content_abspath(i)
... with open(input_abspath, "r") as fh:
... animal = fh.read()
... handle = animal_dataset.item_properties(i)["relpath"]
... output_abspath = ds_creator.prepare_staging_abspath_promise(handle)
... with open(output_abspath, "w") as fh:
... fh.write(animal_sounds[animal])
...
The derived dataset has now been created and it can be loaded using the
dtoolcore.DataSet.from_uri
method.
>>> animal_sounds_dataset = DataSet.from_uri(animal_sounds_ds_uri)
This has been automatically annotated with source_dataset_name
,
source_dataset_uuid
, and source_dataset_uri
.
>>> print(animal_sounds_dataset.get_annotation("source_dataset_name"))
animal-dataset
The code example below looks at the data in the animal-sounds-dataset
dataset.
>>> for i in animal_sounds_dataset.identifiers:
... handle = animal_sounds_dataset.item_properties(i)["relpath"]
... fpath = animal_sounds_dataset.item_content_abspath(i)
... with open(fpath, "r") as fh:
... content = fh.read()
... print("{} - {}".format(handle, content))
...
cat.txt - meow
parrot.txt - squak
dog.txt - bark
Tagging a dataset¶
It is possible to add “tags” to datasets and protodatasets. Tags are labels that can be used to organise datasets into groups. A tag is basically a short piece of text describing a dataset. It is possible to label a dataset with several tags.
In the code below we label the animal_sounds_dataset
with the tags
“animal” and “sound”.
>>> animal_sounds_dataset.put_tag("animal")
>>> animal_sounds_dataset.put_tag("sound")
The code below iterates over all the tags in the dataset and prints them.
>>> for tag in animal_sounds_dataset.list_tags():
... print(tag)
...
animal
sound
It is possible to delete a tag.
>>> animal_sounds_dataset.delete_tag("sound")
If the tag does not exist the command above would simply do nothing, but would not raise any exceptions.