Data Types in Records
AirbyteRecords are required to conform to the Airbyte type system. This means that all sources must produce schemas and records within these types, and all destinations must handle records that conform to this type system.
Because Airbyte's interfaces are JSON-based, this type system is realized using JSON schemas. In order to work around some limitations of JSON schemas, we add an additional airbyte_type
parameter to define more narrow types.
This type system does not constrain values. However, destinations may not fully support all values - for example, Avro-based destinations may reject numeric values outside the standard 64-bit representations, or databases may reject timestamps in the BC era.
The types
This table summarizes the available types. See the Specific Types section for explanation of optional parameters.
Airbyte type | JSON Schema | Examples |
---|---|---|
String | {"type": "string"} | "foo bar" |
Boolean | {"type": "boolean"} | true or false |
Date | {"type": "string", "format": "date"} | "2021-01-23" , "2021-01-23 BC" |
Timestamp without timezone | {"type": "string", "format": "date-time", "airbyte_type": "timestamp_without_timezone"} | "2022-11-22T01:23:45" , "2022-11-22T01:23:45.123456 BC" |
Timestamp with timezone | {"type": "string", "format": "date-time"} ; optionally "airbyte_type": "timestamp_with_timezone" | "2022-11-22T01:23:45.123456+05:00" , "2022-11-22T01:23:45Z BC" |
Time without timezone | {"type": "string", "format": "time", "airbyte_type": "time_without_timezone"} | "01:23:45.123456" , "01:23:45" |
Time with timezone | {"type": "string", "format": "time", "airbyte_type": "time_with_timezone"} | "01:23:45.123456+05:00" , "01:23:45Z" |
Integer | {"type": "integer"} or {"type": "number", "airbyte_type": "integer"} | 42 |
Number | {"type": "number"} | 1234.56 |
Array | {"type": "array"} ; optionally items | [1, 2, 3] |
Object | {"type": "object"} ; optionally properties | {"foo": "bar"} |
Union | {"oneOf": [...]} |
Record structure
As a reminder, sources expose a discover
command, which returns a list of AirbyteStreams
, and a read
method, which emits a series of AirbyteRecordMessages
. The type system determines what a valid json_schema
is for an AirbyteStream
, which in turn dictates what messages read
is allowed to emit.
For example, a source could produce this AirbyteStream
(remember that the json_schema
must declare "type": "object"
at the top level):
{
"name": "users",
"json_schema": {
"type": "object",
"properties": {
"username": {
"type": "string"
},
"age": {
"type": "integer"
},
"appointments": {
"type": "array",
"items": {
"type": "string",
"format": "date-time",
"airbyte_type": "timestamp_with_timezone"
}
}
}
}
}
Along with this AirbyteRecordMessage
(observe that the data
field conforms to the json_schema
from the stream):
{
"stream": "users",
"data": {
"username": "someone42",
"age": 84,
"appointments": ["2021-11-22T01:23:45+00:00", "2022-01-22T14:00:00+00:00"]
},
"emitted_at": 1623861660
}
The top-level object
must conform to the type system. This means that all of the fields must also conform to the type system.
Nulls
Many sources cannot guarantee that all fields are present on all records. In these cases, sources should not list them as required
fields, and add that the property can be null in the jsonSchema, e.g. [null, string]
. If a null property is found for a non-nullable schema, a validation error may occur in the platform or the destination may have trouble storing the record.
Unsupported types
Destinations must have handling for all types, but they are free to cast types to a convenient representation. For example, let's say a source discovers a stream with this schema:
{
"type": "object",
"properties": {
"appointments": {
"type": "array",
"items": {
"type": "string",
"format": "date-time",
"airbyte_type": "timestamp_with_timezone"
}
}
}
}
Along with records which contain data that looks like this:
{ "appointments": ["2021-11-22T01:23:45+00:00", "2022-01-22T14:00:00+00:00"] }
The user then connects this source to a destination that cannot natively handle array
fields. The destination connector is free to simply JSON-serialize the array back to a string when pushing data into the end platform. In other words, the destination connector could behave as though the source declared this schema:
{
"type": "object",
"properties": {
"appointments": {
"type": "string"
}
}
}
And emitted this record:
{
"appointments": "[\"2021-11-22T01:23:45+00:00\", \"2022-01-22T14:00:00+00:00\"]"
}
Of course, destinations are free to choose the most convenient/reasonable representation for any given value. JSON serialization is just one possible strategy. For example, many SQL destinations will fall back to a native JSON type (e.g. Postgres' JSONB type, or Snowflake's VARIANT).
Specific types
These sections explain how each specific type should be used.
Boolean
Boolean values are represented as native JSON booleans (i.e. true
or false
, case-sensitive). Note that "truthy" and "falsy" values are not acceptable: "true"
, "false"
, 1
, and 0
are not valid booleans.