Archived documentation version rendered and hosted by DevNetExpertTraining.com
Documentation

InfluxDB schema design and data layout

This page documents an earlier version of InfluxDB. InfluxDB v2.1 is the latest stable version.

Each InfluxDB use case is unique and your schema reflects that uniqueness. In general, a schema designed for querying leads to simpler and more performant queries. We recommend the following design guidelines for most use cases:

Where to store data (tag or field)

Your queries should guide what data you store in tags and what you store in fields :

  • Store commonly-queried and grouping (group() or GROUP BY) metadata in tags.
  • Store data in fields if each data point contains a different value.
  • Store numeric values as fields (tag values only support string values).

Avoid too many series

IndexDB indexes the following data elements to speed up reads:

Tag values are indexed and field values are not. This means that querying by tags is more performant than querying by fields. However, when too many indexes are created, both writes and reads may start to slow down.

Each unique set of indexed data elements forms a series key. Tags containing highly variable information like unique IDs, hashes, and random strings lead to a large number of series, also known as high series cardinality. High series cardinality is a primary driver of high memory usage for many database workloads. Therefore, to reduce memory consumption, consider storing high-cardinality values in field values rather than in tags or field keys.

If reads and writes to InfluxDB start to slow down, you may have high series cardinality (too many series). See how to find and reduce series high cardinality.

Use the following conventions when naming your tag and field keys:

Avoid reserved keywords in tag and field keys

Not required, but avoiding the use of reserved keywords in your tag and field keys simplifies writing queries because you won’t have to wrap your keys in double quotes. See InfluxQL and Flux keywords to avoid.

Also, if a tag or field key contains characters other than [A-z,_], you must wrap it in double quotes in InfluxQL or use bracket notation in Flux.

Avoid the same name for a tag and a field

Avoid using the same name for a tag and field key. This often results in unexpected behavior when querying data.

If you inadvertently add the same name for a tag and a field, see Frequently asked questions for information about how to query the data predictably and how to fix the issue.

Avoid encoding data in measurements and keys

Store data in tag values or field values, not in tag keys, field keys, or measurements. If you design your schema to store data in tag and field values, your queries will be easier to write and more efficient.

In addition, you’ll keep cardinality low by not creating measurements and keys as you write data. To learn more about the performance impact of high series cardinality, see how to find and reduce high series cardinality.

Compare schemas

Compare the following valid schemas represented by line protocol.

Recommended: the following schema stores metadata in separate crop, plot, and region tags. The temp field contains variable numeric data.

Good Measurements schema - Data encoded in tags (recommended)
-------------
weather_sensor,crop=blueberries,plot=1,region=north temp=50.1 1472515200000000000
weather_sensor,crop=blueberries,plot=2,region=midwest temp=49.8 1472515200000000000

Not recommended: the following schema stores multiple attributes (crop, plot and region) concatenated (blueberries.plot-1.north) within the measurement, similar to Graphite metrics.

Bad Measurements schema - Data encoded in the measurement (not recommended)
-------------
blueberries.plot-1.north temp=50.1 1472515200000000000
blueberries.plot-2.midwest temp=49.8 1472515200000000000

Not recommended: the following schema stores multiple attributes (crop, plot and region) concatenated (blueberries.plot-1.north) within the field key.

Bad Keys schema - Data encoded in field keys (not recommended)
-------------
weather_sensor blueberries.plot-1.north.temp=50.1 1472515200000000000
weather_sensor blueberries.plot-2.midwest.temp=49.8 1472515200000000000

Compare queries

Compare the following queries of the Good Measurements and Bad Measurements schemas. The Flux queries calculate the average temp for blueberries in the north region

Easy to query: Good Measurements data is easily filtered by region tag values, as in the following example.

// Query *Good Measurements*, data stored in separate tag values (recommended)
from(bucket: "<database>/<retention_policy>")
  |> range(start:2016-08-30T00:00:00Z)
  |> filter(fn: (r) =>  r._measurement == "weather_sensor" and r.region == "north" and r._field == "temp")
  |> mean()

Difficult to query: Bad Measurements requires regular expressions to extract plot and region from the measurement, as in the following example.

// Query *Bad Measurements*, data encoded in the measurement (not recommended)
from(bucket: "<database>/<retention_policy>")
  |> range(start:2016-08-30T00:00:00Z)
  |> filter(fn: (r) =>  r._measurement =~ /\.north$/ and r._field == "temp")
  |> mean()

Complex measurements make some queries impossible. For example, calculating the average temperature of both plots is not possible with the Bad Measurements schema.

InfluxQL example to query schemas
# Query *Bad Measurements*, data encoded in the measurement (not recommended)
> SELECT mean("temp") FROM /\.north$/

# Query *Good Measurements*, data stored in separate tag values (recommended)
> SELECT mean("temp") FROM "weather_sensor" WHERE "region" = 'north'

Avoid putting more than one piece of information in one tag

Splitting a single tag with multiple pieces into separate tags simplifies your queries and improves performance by reducing the need for regular expressions.

Consider the following schema represented by line protocol.

Example line protocol schemas

Schema 1 - Multiple data encoded in a single tag
-------------
weather_sensor,crop=blueberries,location=plot-1.north temp=50.1 1472515200000000000
weather_sensor,crop=blueberries,location=plot-2.midwest temp=49.8 1472515200000000000

The Schema 1 data encodes multiple separate parameters, the plot and region into a long tag value (plot-1.north). Compare this to the following schema represented in line protocol.

Schema 2 - Data encoded in multiple tags
-------------
weather_sensor,crop=blueberries,plot=1,region=north temp=50.1 1472515200000000000
weather_sensor,crop=blueberries,plot=2,region=midwest temp=49.8 1472515200000000000

Use Flux or InfluxQL to calculate the average temp for blueberries in the north region. Schema 2 is preferable because using multiple tags, you don’t need a regular expression.

Flux example to query schemas

// Schema 1 -  Query for multiple data encoded in a single tag
from(bucket:"<database>/<retention_policy>")
  |> range(start:2016-08-30T00:00:00Z)
  |> filter(fn: (r) =>  r._measurement == "weather_sensor" and r.location =~ /\.north$/ and r._field == "temp")
  |> mean()

// Schema 2 - Query for data encoded in multiple tags
from(bucket:"<database>/<retention_policy>")
  |> range(start:2016-08-30T00:00:00Z)
  |> filter(fn: (r) =>  r._measurement == "weather_sensor" and r.region == "north" and r._field == "temp")
  |> mean()

InfluxQL example to query schemas

# Schema 1 - Query for multiple data encoded in a single tag
> SELECT mean("temp") FROM "weather_sensor" WHERE location =~ /\.north$/

# Schema 2 - Query for data encoded in multiple tags
> SELECT mean("temp") FROM "weather_sensor" WHERE region = 'north'

Shard group duration management

Shard group duration overview

InfluxDB stores data in shard groups. Shard groups are organized by retention policy (RP) and store data with timestamps that fall within a specific time interval called the shard duration.

If no shard group duration is provided, the shard group duration is determined by the RP duration at the time the RP is created. The default values are:

RP Duration Shard Group Duration
< 2 days 1 hour
>= 2 days and <= 6 months 1 day
> 6 months 7 days

The shard group duration is also configurable per RP. To configure the shard group duration, see Retention Policy Management.

Shard group duration tradeoffs

Determining the optimal shard group duration requires finding the balance between:

  • Better overall performance with longer shards
  • Flexibility provided by shorter shards

Long shard group duration

Longer shard group durations let InfluxDB store more data in the same logical location. This reduces data duplication, improves compression efficiency, and improves query speed in some cases.

Short shard group duration

Shorter shard group durations allow the system to more efficiently drop data and record incremental backups. When InfluxDB enforces an RP it drops entire shard groups, not individual data points, even if the points are older than the RP duration. A shard group will only be removed once a shard group’s duration end time is older than the RP duration.

For example, if your RP has a duration of one day, InfluxDB will drop an hour’s worth of data every hour and will always have 25 shard groups. One for each hour in the day and an extra shard group that is partially expiring, but isn’t removed until the whole shard group is older than 24 hours.

Note: A special use case to consider: filtering queries on schema data (such as tags, series, measurements) by time. For example, if you want to filter schema data within a one hour interval, you must set the shard group duration to 1h. For more information, see filter schema data by time.

Shard group duration recommendations

The default shard group durations work well for most cases. However, high-throughput or long-running instances will benefit from using longer shard group durations. Here are some recommendations for longer shard group durations:

RP Duration Shard Group Duration
<= 1 day 6 hours
> 1 day and <= 7 days 1 day
> 7 days and <= 3 months 7 days
> 3 months 30 days
infinite 52 weeks or longer

Note: Note that INF (infinite) is not a valid shard group duration. In extreme cases where data covers decades and will never be deleted, a long shard group duration like 1040w (20 years) is perfectly valid.

Other factors to consider before setting shard group duration:

  • Shard groups should be twice as long as the longest time range of the most frequent queries
  • Shard groups should each contain more than 100,000 points per shard group
  • Shard groups should each contain more than 1,000 points per series

Shard group duration for backfilling

Bulk insertion of historical data covering a large time range in the past will trigger the creation of a large number of shards at once. The concurrent access and overhead of writing to hundreds or thousands of shards can quickly lead to slow performance and memory exhaustion.

When writing historical data, we highly recommend temporarily setting a longer shard group duration so fewer shards are created. Typically, a shard group duration of 52 weeks works well for backfilling.


Set your InfluxDB URL

Upgrade to InfluxDB Cloud or InfluxDB 2.0!

InfluxDB Cloud and InfluxDB OSS 2.0 ready for production.