In the ever-evolving field of data science, effective data manipulation tools are essential. Enter Polars, a Rust-based library garnering attention within the data community. Boasting impressive speed and versatile capabilities, Polars is redefining our data management practices. In this blog, we delve into Polars’ core functions and practical applications, shedding light on how it empowers data professionals to efficiently tackle complex tasks.

For those well-versed in Pandas, Polars offers a blend of familiarity and innovation. Although this document is not designed to substitute the official documentation, it serves as a tool to provide you with insights into the capabilities that polars offers. Our goal is to ensure the continuous updating of this document.

Our exploration of Polars is guided by insights from the Polars User Guide, Kevin Heavey’s perspectives in Modern Polars and Matt Harrison’s engaging tutorial on Polars at PyCon. We kickstart our exploration with Matt Harrison’s shared dataset, the US Department of Energy’s Automobile Fuel Economy data. Let’s begin this journey!

Setup

Code

import pandas as pd
import polars as pl
import polars.selectors as cs

import seaborn as sns
import matplotlib.pyplot as plt

pl.Config.set_tbl_rows(2)  # limit the numbers of rows printed

polars.config.Config

Polars Version

We are using Polars 0.18.8 for this demonstration.

Code

pl.__version__

'0.18.8'

Given the large size of the dataset and the fact that some columns aren’t really useful for us, we’ll start by picking out the information we actually need. Feel free to use the code below to work with your dataframe.

Code

df = pl.read_csv("vehicles.csv", null_values="NA")

# process and save parquet for later use
(
    df.select(
        pl.col(
            "make",
            "model",
            "drive",
            "cylinders",
            "displ",
            "fuelType",
            "mpgData",
            "city08",
            "highway08",
            "comb08",
            "year",
        )
    )
    .with_columns(
        pl.col("make").cast(pl.Categorical, strict=False),
        pl.col("model").cast(pl.Categorical, strict=False),
        pl.col("drive").cast(pl.Categorical, strict=False),
        pl.col("fuelType").cast(pl.Categorical, strict=False),
        pl.col("cylinders").cast(pl.Int8, strict=False),
        pl.col("displ").cast(pl.Float32, strict=False),
        pl.col("mpgData").map_dict({"N": False, "Y": True}),
        pl.col("city08").cast(pl.Int8, strict=False),
        pl.col("highway08").cast(pl.Int8, strict=False),
        pl.col("comb08").cast(pl.Int8, strict=False),
        pl.col("year").cast(pl.Utf8).str.to_datetime("%Y").cast(pl.Date),
    )
    .write_parquet("vehicles.parquet")
)
df = pl.read_parquet("vehicles.parquet")

Inspect the polars namespace

Taking a closer look at the Polars namespace can be really helpful as you start to get to know the library. By exploring the functions, methods, and attributes available within the namespace, you’ll get a better sense of what the library can do. This exploration sets the groundwork for you to use the library more effectively and make the most of its features.

By issuing print(dir(pl)), we can explore the accessible functions within the namespace.

Code

# available functions

print(dir(pl))

['Array', 'ArrowError', 'Binary', 'Boolean', 'Categorical', 'ChronoFormatWarning', 'ColumnNotFoundError', 'ComputeError', 'Config', 'DATETIME_DTYPES', 'DURATION_DTYPES', 'DataFrame', 'DataType', 'Date', 'Datetime', 'Decimal', 'DuplicateError', 'Duration', 'Expr', 'FLOAT_DTYPES', 'Field', 'Float32', 'Float64', 'INTEGER_DTYPES', 'Int16', 'Int32', 'Int64', 'Int8', 'InvalidOperationError', 'LazyFrame', 'List', 'NUMERIC_DTYPES', 'NoDataError', 'Null', 'Object', 'PolarsDataType', 'PolarsPanicError', 'SQLContext', 'SchemaError', 'SchemaFieldNotFoundError', 'Series', 'ShapeError', 'StringCache', 'Struct', 'StructFieldNotFoundError', 'TEMPORAL_DTYPES', 'Time', 'UInt16', 'UInt32', 'UInt64', 'UInt8', 'Unknown', 'Utf8', '__all__', '__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__register_startup_deps', '__spec__', '__version__', '_reexport', 'align_frames', 'all', 'all_horizontal', 'any', 'any_horizontal', 'api', 'apply', 'approx_unique', 'arange', 'arg_sort_by', 'arg_where', 'avg', 'build_info', 'coalesce', 'col', 'collect_all', 'concat', 'concat_list', 'concat_str', 'config', 'contextlib', 'convert', 'corr', 'count', 'cov', 'cumfold', 'cumreduce', 'cumsum', 'cumsum_horizontal', 'dataframe', 'datatypes', 'date', 'date_range', 'datetime', 'dependencies', 'duration', 'element', 'enable_string_cache', 'exceptions', 'exclude', 'expr', 'first', 'fold', 'format', 'from_arrow', 'from_dataframe', 'from_dict', 'from_dicts', 'from_epoch', 'from_numpy', 'from_pandas', 'from_records', 'from_repr', 'functions', 'get_idx_type', 'get_index_type', 'groups', 'head', 'implode', 'int_range', 'int_ranges', 'io', 'last', 'lazyframe', 'lit', 'map', 'max', 'max_horizontal', 'mean', 'median', 'min', 'min_horizontal', 'n_unique', 'ones', 'os', 'polars', 'quantile', 'read_avro', 'read_csv', 'read_csv_batched', 'read_database', 'read_delta', 'read_excel', 'read_ipc', 'read_ipc_schema', 'read_json', 'read_ndjson', 'read_parquet', 'read_parquet_schema', 'reduce', 'repeat', 'rolling_corr', 'rolling_cov', 'scan_csv', 'scan_delta', 'scan_ds', 'scan_ipc', 'scan_ndjson', 'scan_parquet', 'scan_pyarrow_dataset', 'select', 'selectors', 'series', 'show_versions', 'slice', 'sql', 'sql_expr', 'std', 'string_cache', 'struct', 'sum', 'sum_horizontal', 'tail', 'threadpool_size', 'time', 'time_range', 'toggle_string_cache', 'type_aliases', 'using_string_cache', 'utils', 'var', 'when', 'wrap_df', 'wrap_s', 'zeros']

The same applies to the methods available for DataFrames.

Code

# methods defined on dataframes

print(dir(df))

['__add__', '__annotations__', '__array__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '_accessors', '_cast_all_from_to', '_comp', '_compare_to_non_df', '_compare_to_other_df', '_df', '_div', '_from_arrow', '_from_dict', '_from_dicts', '_from_numpy', '_from_pandas', '_from_pydf', '_from_records', '_ipython_key_completions_', '_pos_idx', '_read_avro', '_read_csv', '_read_ipc', '_read_json', '_read_ndjson', '_read_parquet', '_repr_html_', '_take_with_series', 'apply', 'bottom_k', 'clear', 'clone', 'columns', 'corr', 'describe', 'drop', 'drop_in_place', 'drop_nulls', 'dtypes', 'estimated_size', 'explode', 'extend', 'fill_nan', 'fill_null', 'filter', 'find_idx_by_name', 'fold', 'frame_equal', 'get_column', 'get_columns', 'glimpse', 'groupby', 'groupby_dynamic', 'groupby_rolling', 'hash_rows', 'head', 'height', 'hstack', 'insert_at_idx', 'interpolate', 'is_duplicated', 'is_empty', 'is_unique', 'item', 'iter_rows', 'iter_slices', 'join', 'join_asof', 'lazy', 'limit', 'max', 'mean', 'median', 'melt', 'merge_sorted', 'min', 'n_chunks', 'n_unique', 'null_count', 'partition_by', 'pipe', 'pivot', 'product', 'quantile', 'rechunk', 'rename', 'replace', 'replace_at_idx', 'reverse', 'row', 'rows', 'rows_by_key', 'sample', 'schema', 'select', 'set_sorted', 'shape', 'shift', 'shift_and_fill', 'shrink_to_fit', 'slice', 'sort', 'std', 'sum', 'tail', 'take_every', 'to_arrow', 'to_dict', 'to_dicts', 'to_dummies', 'to_init_repr', 'to_numpy', 'to_pandas', 'to_series', 'to_struct', 'top_k', 'transpose', 'unique', 'unnest', 'unstack', 'update', 'upsample', 'var', 'vstack', 'width', 'with_columns', 'with_row_count', 'write_avro', 'write_csv', 'write_database', 'write_delta', 'write_excel', 'write_ipc', 'write_json', 'write_ndjson', 'write_parquet']

…and expressions for the Series

Code

# available expressions

print(dir(pl.col("foo")))

['__abs__', '__add__', '__and__', '__annotations__', '__array_ufunc__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__invert__', '__le__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_from_pyexpr', '_pyexpr', '_repr_html_', '_to_expr', '_to_pyexpr', 'abs', 'add', 'agg_groups', 'alias', 'all', 'and_', 'any', 'append', 'apply', 'approx_unique', 'arccos', 'arccosh', 'arcsin', 'arcsinh', 'arctan', 'arctanh', 'arg_max', 'arg_min', 'arg_sort', 'arg_true', 'arg_unique', 'arr', 'backward_fill', 'bin', 'bottom_k', 'cache', 'cast', 'cat', 'cbrt', 'ceil', 'clip', 'clip_max', 'clip_min', 'cos', 'cosh', 'count', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'cumulative_eval', 'cut', 'degrees', 'diff', 'dot', 'drop_nans', 'drop_nulls', 'dt', 'entropy', 'eq', 'eq_missing', 'ewm_mean', 'ewm_std', 'ewm_var', 'exclude', 'exp', 'explode', 'extend_constant', 'fill_nan', 'fill_null', 'filter', 'first', 'flatten', 'floor', 'floordiv', 'forward_fill', 'from_json', 'ge', 'gt', 'hash', 'head', 'implode', 'inspect', 'interpolate', 'is_between', 'is_duplicated', 'is_finite', 'is_first', 'is_in', 'is_infinite', 'is_nan', 'is_not', 'is_not_nan', 'is_not_null', 'is_null', 'is_unique', 'keep_name', 'kurtosis', 'last', 'le', 'len', 'limit', 'list', 'log', 'log10', 'log1p', 'lower_bound', 'lt', 'map', 'map_alias', 'map_dict', 'max', 'mean', 'median', 'meta', 'min', 'mod', 'mode', 'mul', 'n_unique', 'nan_max', 'nan_min', 'ne', 'ne_missing', 'null_count', 'or_', 'over', 'pct_change', 'pipe', 'pow', 'prefix', 'product', 'qcut', 'quantile', 'radians', 'rank', 'rechunk', 'reinterpret', 'repeat_by', 'reshape', 'reverse', 'rle', 'rle_id', 'rolling_apply', 'rolling_max', 'rolling_mean', 'rolling_median', 'rolling_min', 'rolling_quantile', 'rolling_skew', 'rolling_std', 'rolling_sum', 'rolling_var', 'round', 'sample', 'search_sorted', 'set_sorted', 'shift', 'shift_and_fill', 'shrink_dtype', 'shuffle', 'sign', 'sin', 'sinh', 'skew', 'slice', 'sort', 'sort_by', 'sqrt', 'std', 'str', 'struct', 'sub', 'suffix', 'sum', 'tail', 'take', 'take_every', 'tan', 'tanh', 'to_physical', 'top_k', 'truediv', 'unique', 'unique_counts', 'upper_bound', 'value_counts', 'var', 'where', 'xor']

Access documentation

If you’re familiar with Python, you might recognize that you can access documentation by either using __doc__ or by calling the help() function.

Code

print(df["year"].dt.convert_time_zone.__doc__)


        Convert to given time zone for a Series of type Datetime.

        Parameters
        ----------
        time_zone
            Time zone for the `Datetime` Series.

        Examples
        --------
        >>> from datetime import datetime
        >>> start = datetime(2020, 3, 1)
        >>> stop = datetime(2020, 5, 1)
        >>> date = pl.date_range(start, stop, "1mo", time_zone="UTC", eager=True)
        >>> date
        shape: (3,)
        Series: 'date' [datetime[μs, UTC]]
        [
                2020-03-01 00:00:00 UTC
                2020-04-01 00:00:00 UTC
                2020-05-01 00:00:00 UTC
        ]
        >>> date = date.dt.convert_time_zone("Europe/London").alias("London")
        >>> date
        shape: (3,)
        Series: 'London' [datetime[μs, Europe/London]]
        [
            2020-03-01 00:00:00 GMT
            2020-04-01 01:00:00 BST
            2020-05-01 01:00:00 BST
        ]

Code

help(pl.head)

Help on function head in module polars.functions.lazy:

head(column: 'str | Series', n: 'int' = 10) -> 'Expr | Series'
    Get the first `n` rows.
    
    Parameters
    ----------
    column
        Column name or Series.
    n
        Number of rows to return.
    
    Examples
    --------
    >>> df = pl.DataFrame({"a": [1, 8, 3], "b": [4, 5, 2], "c": ["foo", "bar", "foo"]})
    >>> df.select(pl.head("a"))
    shape: (3, 1)
    ┌─────┐
    │ a   │
    │ --- │
    │ i64 │
    ╞═════╡
    │ 1   │
    │ 8   │
    │ 3   │
    └─────┘
    >>> df.select(pl.head("a", 2))
    shape: (2, 1)
    ┌─────┐
    │ a   │
    │ --- │
    │ i64 │
    ╞═════╡
    │ 1   │
    │ 8   │
    └─────┘

Indexing

If you’ve spent some time working with Pandas, you might find it interesting that Polars doesn’t rely on the concept of an index like Pandas does. But no need to fret—there are several other ways we can still access the rows in Polars. One of the straightforward methods—although I won’t delve too deeply into it here—is using the head, tail, and sample methods.

Use `take` for Series or Expressions

An interesting approach is through Polars’ take method. This method can be used with Expr, Series, or np.ndarray. However, it’s worth noting that if you intend to utilize it with DataFrames, a preliminary conversion to Series is necessary.

Code

df.select(pl.col("make")).to_series().take(list(range(0, 5)))

shape: (5,)

make
cat
"Alfa Romeo"
…
"Subaru"

Using `filter`

Similar to Pandas, Polars features a filter method that assists in the process of selectively filtering rows based on one or multiple conditions.

Evaluate a single condition

Code

df.filter(pl.col("model") == "Testarossa")

shape: (14, 11)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Ferrari"	"Testarossa"	"Rear-Wheel Dri…	12	4.9	"Regular"	false	9	14	11	1985-01-01
…	…	…	…	…	…	…	…	…	…	…
"Ferrari"	"Testarossa"	"Rear-Wheel Dri…	12	4.9	"Premium"	false	10	15	11	1992-01-01

Combine multiple conditions

Code

df.filter((pl.col("model") == "Testarossa") & (pl.col("make") == "Ferrari"))

shape: (9, 11)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Ferrari"	"Testarossa"	"Rear-Wheel Dri…	12	4.9	"Regular"	false	9	14	11	1985-01-01
…	…	…	…	…	…	…	…	…	…	…
"Ferrari"	"Testarossa"	"Rear-Wheel Dri…	12	4.9	"Premium"	false	10	15	11	1992-01-01

Imitate pandas’ `index` using `with_row_count`

Even in the absence of a traditional index, you can make use of the with_row_count method in Polars. This clever method comes to the rescue for tasks like indexing and filtering, providing an alternative approach.

Code

(df.with_row_count().filter(pl.col("row_nr") == 5))

shape: (1, 12)

row_nr	make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
u32	cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
5	"Subaru"	"Loyale"	"Front-Wheel Dr…	4	1.8	"Regular"	false	21	24	22	1993-01-01

Use `is_in`

The is_in function can be employed to verify whether elements of a given expression are present within another Series. We can use this to filter our rows.

Code

df.filter(pl.col("cylinders").is_in([i for i in range(6, 8)]))

shape: (14_284, 11)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Audi"	"100"	"Front-Wheel Dr…	6	2.8	"Premium"	true	17	22	19	1993-01-01
…	…	…	…	…	…	…	…	…	…	…
"Pontiac"	"Grand Am"	"Front-Wheel Dr…	6	3.3	"Regular"	false	18	26	21	1993-01-01

Select columns

Before we dive into the realm of selecting columns, it’s important to introduce the concepts of with_columns and select, while also highlighting the distinctions between these two methods.

Difference between `select` and with_columns`

As evident from the examples below, both select and with_column can be utilized for both column selection and column assignment. However, there’s a crucial distinction between the two. After performing column operations, the select method drops the unselected columns and retains only the columns that underwent operations. Conversely, the with_column method appends the new columns to the dataframe, preserving the original ones.

Code

# everything else is dropped

(
    df.select(
        (pl.col("city08") + 10).alias("added"),
        (pl.col("city08") - 10).alias("substracted"),
    )
)

shape: (41_144, 2)

added	substracted
i8	i8
29	9
…	…
26	6

Code

# columns are kept

(
    df.with_columns(
        (pl.col("city08") + 10).alias("added"),
        (pl.col("city08") - 10).alias("substracted"),
    )
)

shape: (41_144, 13)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year	added	substracted
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date	i8	i8
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19	25	21	1985-01-01	29	9
…	…	…	…	…	…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16	21	18	1993-01-01	26	6

Now that we’ve clarified this point, let’s proceed to explore the fundamental methods for selecting columns.

Select all columns

A particularly handy tool is pl.all(), which provides the current state of the dataframe—similar to pd.assign(foo=lambda df) in Pandas. This can prove useful, particularly when dealing with operations like groupby and aggregation.

Code

# this is analogous to df.select(pl.col("*")), where * represents the wildcard component
df.select(pl.all())

shape: (41_144, 11)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19	25	21	1985-01-01
…	…	…	…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16	21	18	1993-01-01

Select all columns except for…

Code

df.select(pl.col("*").exclude("year", "comb08", "highway08"))

shape: (41_144, 8)

make	model	drive	cylinders	displ	fuelType	mpgData	city08
cat	cat	cat	i8	f32	cat	bool	i8
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19
…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16

Select a subset of columns

Code

df.select(pl.col("year", "comb08", "highway08"))

shape: (41_144, 3)

year	comb08	highway08
date	i8	i8
1985-01-01	21	25
…	…	…
1993-01-01	18	21

Select columns based on regular expression

Code

df.select(pl.col("^.*(mo|ma).*$"))

shape: (41_144, 2)

make	model
cat	cat
"Alfa Romeo"	"Spider Veloce …
…	…
"Subaru"	"Legacy AWD Tur…

Select columns based on dtypes

This may remind you of pandas’ select_dtypes method.

Code

df.select(pl.col(pl.Int8, pl.Boolean))

shape: (41_144, 5)

cylinders	mpgData	city08	highway08	comb08
i8	bool	i8	i8	i8
4	true	19	25	21
…	…	…	…	…
4	false	16	21	18

Using `selectors`

Selectors offer a more intuitive approach to selecting columns from DataFrame or LazyFrame objects, taking into account factors like column names, data types, and other properties. They consolidate and enhance the functionality that’s accessible through the col() expression. More on them here.

By dtypes

Code

df.select(cs.integer(), cs.float(), cs.temporal())

shape: (41_144, 6)

cylinders	city08	highway08	comb08	displ	year
i8	i8	i8	i8	f32	date
4	19	25	21	2.0	1985-01-01
…	…	…	…	…	…
4	16	21	18	2.2	1993-01-01

Code

# all columns except for the ones containing float
df.select(cs.all() - cs.numeric())

shape: (41_144, 6)

make	model	drive	fuelType	mpgData	year
cat	cat	cat	cat	bool	date
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	"Regular"	true	1985-01-01
…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	"Premium"	false	1993-01-01

Code

# same as the one above but it uses the tilde
df.select(~cs.numeric())

shape: (41_144, 6)

make	model	drive	fuelType	mpgData	year
cat	cat	cat	cat	bool	date
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	"Regular"	true	1985-01-01
…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	"Premium"	false	1993-01-01

By column names

Code

df.select(cs.contains("08"))

shape: (41_144, 3)

city08	highway08	comb08
i8	i8	i8
19	25	21
…	…	…
16	21	18

Code

df.select(cs.starts_with("d"))

shape: (41_144, 2)

drive	displ
cat	f32
"Rear-Wheel Dri…	2.0
…	…
"4-Wheel or All…	2.2

The possibilities here are incredibly vast! I’m pretty sure you’ll find a function that suits your needs just right.

`dtype` conversions

Understanding and managing data types, as well as converting them, are very important—particularly when dealing with larger datasets. Luckily, with Polars, this process is wonderfully straightforward. Here’s a useful tip: in many cases, the cast method will be your trusty companion for data type conversions. It’s quite versatile and often does the job seamlessly. When it comes to converting dates, things can get a bit more intricate, but rest assured, it’s a hurdle you can easily overcome!

More info on the available dtypes can be found here.

Code

(
    df.select(
        pl.col("cylinders").cast(pl.Int32),
        pl.col("displ").cast(pl.Int64),
        pl.col("mpgData").cast(pl.Int64),
    )
)

shape: (41_144, 3)

cylinders	displ	mpgData
i32	i64	i64
4	2	1
…	…	…
4	2	0

Working with `str`

We can examine the available attributes when working with strings. You might notice that it boasts some familiar attributes, like extract, replace, split, and strip, among others.

Code

# attributes of str
print(dir(pl.col("*").str))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_accessor', '_pyexpr', 'concat', 'contains', 'count_match', 'decode', 'encode', 'ends_with', 'explode', 'extract', 'extract_all', 'json_extract', 'json_path_match', 'lengths', 'ljust', 'lstrip', 'n_chars', 'parse_int', 'replace', 'replace_all', 'rjust', 'rstrip', 'slice', 'split', 'split_exact', 'splitn', 'starts_with', 'strip', 'strptime', 'to_date', 'to_datetime', 'to_decimal', 'to_lowercase', 'to_time', 'to_titlecase', 'to_uppercase', 'zfill']

Basic operations

Code

(
    df.select(
        pl.col("make").cast(pl.Utf8).str.contains("Hyundai|Kia").alias("contains"),
        pl.col("make").cast(pl.Utf8).str.starts_with("Alfa").alias("starts_with"),
        pl.col("make").cast(pl.Utf8).str.ends_with("ari").alias("ends_with"),
    )
)

shape: (41_144, 3)

contains	starts_with	ends_with
bool	bool	bool
false	true	false
…	…	…
false	false	false

We can extract the numbers from the different car models:

Code

(
    df.select(
        pl.col("model")
        .cast(pl.Utf8)
        .str.extract(r"(\d+)", group_index=1)
        .cast(pl.Int32)
        .alias("extracted_number"),
    )
)

shape: (41_144, 1)

extracted_number
i32
2000
…
null

As per usual, we can replace values in a given column:

Code

(
    df.select(
        pl.col("make").cast(pl.Utf8).str.replace("r", 0).alias("replaced"),
    )
)

shape: (41_144, 1)

replaced
str
"Alfa Romeo"
…
"Suba0u"

Changing column names

Altering column names is quite reminiscent of the process in Pandas:

Code

df.rename({"make": "car maker", "model": "car model"})

shape: (41_144, 11)

car maker	car model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19	25	21	1985-01-01
…	…	…	…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16	21	18	1993-01-01

In case you would like to alter multiple column names all at once:

Code

df.select(pl.all().map_alias(lambda name: name.upper().replace("I", "0000")))

shape: (41_144, 11)

MAKE	MODEL	DR0000VE	CYL0000NDERS	D0000SPL	FUELTYPE	MPGDATA	C0000TY08	H0000GHWAY08	COMB08	YEAR
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19	25	21	1985-01-01
…	…	…	…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16	21	18	1993-01-01

`str.split` with `expand='all'`

I often find myself using the str.split method, and it caught me off guard that there isn’t a direct equivalent in Polars. Fingers crossed, we might come across an implementation like expand=True from Pandas, which would be a real game-changer here too!

Code

(
    df.select(
        pl.col("drive")
        .cast(pl.Utf8)
        .str.split_exact("-", n=1)
        .struct.rename_fields(["first_part", "second_part"])
        .alias("fields"),
    ).unnest("fields")
)

shape: (41_144, 2)

first_part	second_part
str	str
"Rear"	"Wheel Drive"
…	…
"4"	"Wheel or All"

Aggregation

Split-apply-combine operations are where Polars truly excels. This is largely due to its foundation in Rust, Polars’ use of lazy evaluation, parallel execution, and efficient memory management also contribute to its speed. As a result, split-apply-combine operations are significantly faster in Polars compared to conventional methods.

A simple split-apply-combine operation

Code

(df.groupby("make").count().sort(by="count", descending=True).limit(5))

shape: (5, 2)

make	count
cat	u32
"Chevrolet"	4003
…	…
"Toyota"	2071

Using `agg`

You can use agg to calculate statistics over either a single or multiple columns all at once.

Code

(df.groupby("make").agg(pl.count(), pl.col("model")))

shape: (136, 3)

make	count	model
cat	u32	list[cat]
"TVR Engineerin…	4	["TVR 280i/350i Convertible", "TVR 280i/350i Coupe", … "TVR 280i Coupe"]
…	…	…
"Grumman Allied…	1	["LLV"]

Here’s an illustration of how to utilize the is_not_nan method. In Polars, unknown values are represented with floating-point precision, similar to how pandas handles them. It’s important not to mix this up with missing values, which are indicated by Null in Polars. Additionally, take note of the use of limit instead of head, a concept quite akin to SQL.

Code

(
    df.groupby("make")
    .agg(
        pl.col("city08").mean().alias("mean_cyl"),
        pl.col("displ").mean().alias("mean_disp"),
    )
    .filter(pl.col("mean_cyl").is_not_nan())
    .sort(by=["mean_cyl", "mean_disp"], descending=[True, True])
    .limit(5)
)

shape: (5, 3)

make	mean_cyl	mean_disp
cat	f64	f64
"Tesla"	94.925373	NaN
…	…	…
"Azure Dynamics…	62.0	NaN

Aggregated columns can be renamed immediately with the alias method.

Code

(
    df.groupby("make", "fuelType")
    .agg(pl.col("comb08").mean().alias("filtered"))
    .filter((pl.col("fuelType") == "CNG") | (pl.col("fuelType") == "Diesel"))
)

shape: (38, 3)

make	fuelType	filtered
cat	cat	f64
"Mazda"	"Diesel"	29.714286
…	…	…
"Ford"	"Diesel"	29.542857

Here’s a scenario for performing aggregation based on the year. In this case, the “year” column holds the year-related data, which can be extracted using the dt.year attribute. Keep in mind that Polars doesn’t have native plotting capabilities like pandas. To visualize the data, as a final step, you can convert the dataframe to pandas and make use of its .plot method.

Code

(
    df.groupby(
        [
            pl.col("year").dt.year().alias("year"),
        ]
    )
    .agg([pl.col("comb08").mean(), pl.col("city08").mean()])
    .sort(
        [
            "year",
        ]
    )
    .to_pandas()
    .set_index("year")
    .plot()
)

<Axes: xlabel='year'>

Here’s another fantastic option for you: the groupby_dynamic feature. It’s really impressive, by the way. You can employ this when grouping based on years or any other time-related information. Additionally, you can make use of the every parameter for resampling, which closely resembles what you can do with pandas.

Code

(
    df.sort(by="year")
    .groupby_dynamic(index_column="year", every="2y")
    .agg([pl.col("comb08").mean(), pl.col("city08").mean()])
    .sort(
        [
            "year",
        ]
    )
    .to_pandas()
    .set_index("year")
    .plot()
)

<Axes: xlabel='year'>

Pivot

You can also accomplish split-apply-combine tasks seamlessly using the pivot function. It’s remarkably straightforward.

Code

(
    df.with_columns(
        pl.col("fuelType").cast(pl.Utf8)
    )  # conversion to str is needed for the next step
    .filter(pl.col("fuelType").is_in(["Regular", "Premium"]))
    .pivot(values="city08", index="make", columns="fuelType", aggregate_function="mean")
)

shape: (126, 3)

make	Regular	Premium
cat	f64	f64
"Alfa Romeo"	17.142857	18.902439
…	…	…
"PAS Inc - GMC"	null	14.0

Handling missing data

Getting missing value count per column

The null_count function showcases the count of missing values in the dataframe. Remember, these are not the same as np.nans; it’s important to be aware of this distinction. This functionality is akin to the pandas df.isna().sum() operation if you’re familiar with pandas.

Code

df.null_count()

shape: (1, 11)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32
0	0	1189	206	204	0	0	27	0	7	0

Getting boolean array of missing values

If you ever need to manipulate a series based on its null values, you can also obtain a boolean mask using the is_null function. This can be really handy for targeted data manipulation.

Code

df.select(pl.col("city08").is_null())

shape: (41_144, 1)

city08
bool
false
…
false

Filling missing values

You have a variety of options at your disposal for filling in missing values. Here’s a list of some of the most common ones, although it’s not an exhaustive rundown.

With literals

Code

(df.with_columns(pl.col("cylinders").fill_null(pl.lit(2))))

shape: (41_144, 11)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19	25	21	1985-01-01
…	…	…	…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16	21	18	1993-01-01

Code

(df.with_columns(pl.col("cylinders").fill_null(strategy="zero")))

shape: (41_144, 11)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19	25	21	1985-01-01
…	…	…	…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16	21	18	1993-01-01

Code

(df.with_columns(pl.col("cylinders").fill_null(strategy="forward")))

shape: (41_144, 11)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19	25	21	1985-01-01
…	…	…	…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16	21	18	1993-01-01

With an expression

This can be very useful for machine learning pipelines. When addressing missing values, you can set them to the mode, median, mean, or any other value that suits your needs.

Code

(df.with_columns(pl.col("cylinders").fill_null(pl.col("cylinders").mode())))

shape: (41_144, 11)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19	25	21	1985-01-01
…	…	…	…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16	21	18	1993-01-01

Using the `over` function

The over function in Polars is used to perform window operations, akin to what you’d achieve with window functions in SQL or the transform function in pandas. This function enables you to compute aggregations over a specified window or range of rows defined by a window specification. It’s perfect for tasks like calculating rolling averages, cumulative sums, and other operations that involve working with a specific subset of rows within the dataset.

Let’s begin by filtering the car makes to encompass only the top 5 car brands. Following that, we can utilize the over function to derive several statistics.

Code

top5 = (
    df.select(pl.col("make"))
    .to_series()
    .value_counts()
    .sort(by="counts", descending=True)
    .limit(3)
    .select(pl.col("make"))
    .to_series()
)

(
    df.filter(pl.col("make").is_in(top5)).select(
        "make",
        "model",
        pl.col("city08").mean().over("make").alias("avg_city_mpg_by_make"),
        pl.col("city08").mean().over("model").alias("avg_city_mpg_by_model"),
        pl.col("comb08").mean().over(["make", "model"]).alias("avg_comb_mpg_by_model"),
    )
)

shape: (9_957, 5)

make	model	avg_city_mpg_by_make	avg_city_mpg_by_model	avg_comb_mpg_by_model
cat	cat	f64	f64	f64
"Dodge"	"Charger"	15.462253	18.197531	21.320988
…	…	…	…	…
"Dodge"	"B150/B250 Wago…	15.462253	11.654321	12.925926

Other common operations

In the upcoming section, you’ll discover a mix of handy operations that I’ve ended up using quite often. These don’t exactly fit into the categories we’ve covered earlier, but they’re still quite useful!

Operating on multiple columns and renaming them

You have the flexibility to conduct column operations and then easily rename them. Additionally, you can make use of the prefix and suffix functions to streamline your workflow.

Code

(df.with_columns((cs.numeric() * 2).prefix("changed_")))

shape: (41_144, 16)

make	model	drive	cylinders	displ	fuelType	mpgData	city08	highway08	comb08	year	changed_cylinders	changed_displ	changed_city08	changed_highway08	changed_comb08
cat	cat	cat	i8	f32	cat	bool	i8	i8	i8	date	i8	f32	i8	i8	i8
"Alfa Romeo"	"Spider Veloce …	"Rear-Wheel Dri…	4	2.0	"Regular"	true	19	25	21	1985-01-01	8	4.0	38	50	42
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"Subaru"	"Legacy AWD Tur…	"4-Wheel or All…	4	2.2	"Premium"	false	16	21	18	1993-01-01	8	4.4	32	42	36

Count unique values

Absolutely, you can also count the unique values within a series in Polars, much like you would with the unique function in pandas. This can be really useful for understanding the distribution and diversity of data within a specific column.

Code

df.select(pl.col("make")).n_unique()

Value_counts to get number of occurrences per item

On a single column

Code

(df.select(pl.col("make")).to_series().value_counts(sort=True))

shape: (136, 2)

make	counts
cat	u32
"Chevrolet"	4003
…	…
"Isis Imports L…	1

Code

(df.select(pl.col("make")).groupby("make").count().sort(by="count", descending=True))

shape: (136, 2)

make	count
cat	u32
"Chevrolet"	4003
…	…
"Panoz Auto-Dev…	1

On multiple columns

I noticed that Polars lacks the capability to perform value counts on multiple columns, unlike pandas’ value_counts function which operates only on series. However, I’ve discovered that combining a groupby operation with a count function can effectively achieve the same outcome for multiple columns. It’s all about finding alternative approaches that get the job done!

Code

(
    df.select(pl.col("make", "model"))
    .groupby(["make", "model"])
    .count()
    .sort(by="count", descending=True)
)

shape: (4_127, 3)

make	model	count
cat	cat	u32
"Ford"	"F150 Pickup 2W…	214
…	…	…
"Mercedes-Benz"	"400SE"	1

Conditionals/ if-else statements

If you’re looking to integrate conditional if-else statements into your Polars chain, you can make use of the when, then, and otherwise functions. However, keep in mind that a series of chained when, then statements should be interpreted as if, elif, … elif, rather than if, if, … if. In other words, the first condition that evaluates to True will be selected. It’s crucial to understand this distinction when crafting your conditions.

Code

(
    df.select(
        pl.col("make"),
        pl.when(pl.col("comb08") < 15)
        .then("bad")
        .when(pl.col("comb08") < 30)
        .then("good")
        .otherwise("very good")
        .alias("fuel_economy"),
    )
)

shape: (41_144, 2)

make	fuel_economy
cat	str
"Alfa Romeo"	"good"
…	…
"Subaru"	"good"

Alright, folks, that’s it for this introduction to Polars. I know it wasn’t exactly short, but I hope it was informative! I’ll be adding more content to this article in the future, so stay tuned. Until then, happy coding 😎💻🔎

Document version

Version	Date	Comment
0.1	8/11/2023	Document created

Setup

Polars Version

Inspect the polars namespace

Access documentation

Indexing

Use take for Series or Expressions

Using filter

Evaluate a single condition

Combine multiple conditions

Imitate pandas’ index using with_row_count

Use is_in

Select columns

Difference between select and with_columns`

Select all columns

Select all columns except for…

Select a subset of columns

Select columns based on regular expression

Select columns based on dtypes

Using selectors

By dtypes

By column names

dtype conversions

Working with str

Basic operations

Changing column names

str.split with expand='all'

Aggregation

A simple split-apply-combine operation

Using agg

Pivot

Handling missing data

Getting missing value count per column

Getting boolean array of missing values

Filling missing values

With literals

With an expression

Using the over function

Other common operations

Operating on multiple columns and renaming them

Count unique values

Value_counts to get number of occurrences per item

On a single column

On multiple columns

Conditionals/ if-else statements

Document version

Use `take` for Series or Expressions

Using `filter`

Imitate pandas’ `index` using `with_row_count`

Use `is_in`

Difference between `select` and with_columns`

Using `selectors`

`dtype` conversions

Working with `str`

`str.split` with `expand='all'`

Using `agg`

Using the `over` function