Schema and Data Types

Gantry schema helps Gantry know the type of data in the application

Data types in Gantry encode the way data should be interpreted, regardless of the format it is in when it is logged. You can view the schema for your application on the Schema tab.

How are data types used in Gantry?

The data type for each field determines how that field can be used in Gantry. For example:

  • How that data is displayed in the data table
  • What alerts, metrics, and projections can be computed on the field

Available data types

The following types are available in Gantry:

TypeExample
Float{”x”: 1.0}
Text{”x”: “Hello, this is a random string”}
Integer{”x”: 2}
Boolean{”x”: True}
Categorical[ {”x”: “cat”},
{”x”: “dog”} ]
ID{”x”: “d68da1f6-b7f9-43d2-8aa4-4ad35a8c7ef7”}
Datetime{”datetime”: 2020-07-10 15:00:00.000 }
UnixTime{”timestamp”: 1672937012000 }
Array{”x”: [”apple”, “cat”, “banana”, “dog”]}
Array{”x”: [0.12, 0.34, 0.9, 0.23]}
Array{”x”: [1, 2, 3, 4]}
Array{”x”: [True, False, False, True]}
Array{”x”: [“d68da1f6-b7f9-43d2-8aa4-4ad35a8c7ef7”, “d68da1f6-b7f9-43d2-8aa4-4ad35a8c7ef7”, “d68da1f6-b7f9-43d2-8aa4-4ad35a8c7ef7”, “d68da1f6-b7f9-43d2-8aa4-4ad35a8c7ef7”]}
NERTag{"x": ["I-PER", "B-PER", "O"]}
Image{”x”: “s3://bucket/image.jpg”}
Audio{”x”: “https://storage.googleapis.com/example-presigned-audio.mp3”}
Json{”frame_1”: {”bbox1”: {”class”: “dog”, “bbox”: [1,2,3,4]}}}
Embedding{"x": [1.0, 2.0, ..., 512.0]}
Unknown

Array types

Gantry has built in support for one-dimensional Array types, which are logged as Python lists of native Python types: str, float, int, bool. Arrays are inherently high-dimensional, like Text, so Gantry often uses Projections to better understand their content. For example, length is a property that can be monitored for all arrays using the vector.length Projection. For numerical arrays, vector.<min, max. norm> Projections can be computed to in order to observe other scalar scalar properties such as statistics (mean), extrema (min, max), and norms (L1, L2). Filtering on Arrays is also possible using the Array contains filter, which will return any record whose Array field contains an exact or partial match for at least one element. For example, an “Array contains” filter for the substring "is" would return any Record with an Array<str> field with any element containing "is", such as ["this is", "an array", "of strings"]. This can be useful for filtering multi-label lists or tag attributes that might be part of an application.

Schema inference

Gantry infers the schema for an application based on the first batch of data that logged. As more data is logged, new fields that appear are automatically added to the schema. To change the type of a field, the schema needs to be edited.

Supported types and inferencing logic

The following table details supported types, and the logic used to infer that type. Note that for primitive & array primitive types, Gantry relies on type checking logic built into pyarrow

Gantry Data TypeInferencing Logic
FloatIf the value is a pyarrow floating point numeric type
Textif the value is a pyarrow string type
Integerif the value is a pyarrow integer type
Booleanif the value is a pyarrow boolean type
Categoricalif the set of values in the batch consist of strings where the average string has less than 2 words in it (basically if the values are mostly 1 word strings)
IDif the value is an instance of a python uuid object, or is a string representation of a uuid object
Datetimeif it is a pandas timestamp object or if the string can be parsed using pandas.to_datetime
UnixTimeTime from Unix Epoch in ms or seconds. Since this is impossible to distinguish from large integers, this must be manually set as UnixTime.
ArrayStringif the value is a pyarrow list with the list's value_type being a pyarrow string
ArrayFloatif the value is a pyarrow list with the list's value_type being a pyarrow float
ArrayIntegerif the value is a pyarrow list with the list’s value_type being a pyarrow integer
ArrayBooleanif the value is a pyarrow list with the list’s value_type being a pyarrow boolean
ArrayIDif the values contain arrays which consist of values that can be inferred as ID types
ImageCurrently, we assume files stored in s3 are images, and so if the value is an s3 url, we infer the column to be an Image. We plan to support other sources in the future.
AudioCurrently, we assume files stored in GCS are audio, and so if the value is a GCS presigned url, we infer the column to be Audio. We plan to support other sources in the future.
JsonCurrently unsupported, this must be manually set.
EmbeddingIf the set of values in a batch is a pyarrow list with the list's value_type being float, all lists are the same length, and the length is at least 512.
UnknownIf the value does not match any of the other types. This servers as a catch all default value for inference.

Mixed columns

If a column contains mixed primitives or mixed array primitives, then the type of the column will be set to Unknown. For more complex string types, such as Image or Audio, the column will only be inferred as that type if the majority of the values are that type. Otherwise, the column will be set to Unknown.

Nested schema

Gantry supports nested schema inference. . is reserved to represent a nested field. Any . characters in the field name will be converted to _.

Here is an example of data with nested fields:

<pre>{
  "integer": {
    "feature.1": 1,
    "feature.2": 2,
    "feature_3": 3
  },
  "categorical": {
    "feature_1": "category.1",
    "feature.2": "category.2"
  },
  "text.feature.1": "Hello, World"
}</pre>

Gantry infers the above schema as follows:

FieldType
numeric.feature_1Integer
numeric.feature_2Integer
numeric.feature_3Integer
categorical.feature_1Categorical
categorical.feature_2Categorical
text_feature_1Text

Handling bad data

Gantry does data cleaning to handle bad data for primitives. Future support is planned for cleaning more complex types.

TypeData Cleaning
IntegerIf the bad value is a:
- Float: Floats get set to null
- Boolean: True gets set to 1 and False gets set to 0.
- String: If the value can be parsed as an Integer or Float, it will be, then Floats get further casted to Integers. If not, the value gets set to null.
FloatIf the bad value is a:
- Integer: Integers work fine and get casted as Float.
- Boolean: True gets set to 1.0 and False gets set to 0.0.
- String: If the value can be parsed as a Float, then it will be, otherwise it will be set to null.
BooleanIf the bad value is a:
- Integer: If the value is equal to 0 then it gets set to False. If it equal to 1 it gets set to True.
- Float: If the value is equal to 0.0 then it gets set to False. If it is equal to 1.0 it gets set to True.

All other values and strings get set to null.
TextAll values get casted to a string, and null values stay as null

Editing the schema

Sometimes the schema will be inferred incorrectly, or the data type of a field will change. To edit a schema manually:

  1. Use the sidebar to navigate to the ingestion page:

  1. Click the three dots on the right side of the page to change the data type:

Example: data types and schemas

gantry.log_record(
    application="my_app",
    inputs={
        "numerical_feature_1": 1.1,
        "numerical_feature_2": 87,
        "categorical_feature_1": "cat_A",
        "text_feature_1": "cat_A is one of our categories",
        "image_feature_1": "s3://bucket/image.jpg",
        "audio_feature_1": "https://storage.googleapis.com/example-presigned-audio.mp3",
    },
    ...
)

In this example, there are a few ambiguities in how the data will be handled:

  • Both of the inputs numerical_feature_* are numbers, but they each have different Python data types.
  • The inputs categorical_feature_1 and text_feature_1 are both represented as strings in Python, but the former is meant to be interpreted as a category and the latter as text.
  • All of text_feature_1, image_feature_1 , audio_feature_1 are strings, but the first one is just text, whereas the latter two should be specially treated as Image and Audio.

Data types are stored in an application’s Schema. The Schema maps fields to Gantry data types. The following is an example schema for the example application above:

FieldType
numerical_feature_1Integer
numerical_feature_2Float
categorical_feature_1Categorical
text_feature_1Text
image_feature_1Image
audio_feature_1Audio