Guides

How to Get Started with Apache Spark 4.1

Arkzero ResearchApr 29, 20266 min read

Last updated Apr 29, 2026

Apache Spark 4.1, released in December 2025, is the second major update in the 4.x series and resolved over 1,800 issues from more than 230 contributors. It introduces Spark Declarative Pipelines for defining data pipelines without managing execution graphs, a 1.5 MB pyspark-client for lightweight Spark Connect access, sub-second streaming via Real-Time Mode, and SQL Scripting enabled by default. You can install the Python client in under a minute with pip and start running queries against a local or remote Spark Connect server without configuring a JVM.
Apache Spark 4.1 getting started guide hero image

What Is Apache Spark 4.1

Apache Spark 4.1.0 is the second release in the 4.x line and shipped in December 2025. The release resolved over 1,800 Jira tickets from more than 230 contributors. It focuses on four areas: declarative data pipelines, lower-latency streaming, easier Python development, and a more complete SQL surface.

This guide walks through installing Spark 4.1, connecting from Python, and using the features most relevant to analysts and data teams in 2025 and 2026.

Prerequisites

You need Python 3.9 or later. For lightweight usage with Spark Connect, no Java installation is required on the client machine. Check your version:

python3 --version

If you want to run a full local Spark server, you need Java 17 or later.

Installing the Python Client

Spark 4.0 introduced a split between the full PySpark package and a lightweight pyspark-client. The client is 1.5 MB and connects to any Spark Connect server remotely without bundling a JVM runtime.

Install the lightweight client:

pip install pyspark-client==4.1.0

Install the full local runtime if you want to start your own Spark server:

pip install pyspark==4.1.0

For analysts connecting to a shared Spark cluster, Databricks, or a cloud Spark endpoint, pyspark-client is the right choice. You get the same DataFrame and SQL API at a fraction of the install size.

Connecting to a Spark Connect Server

Once installed, connect with a single line:

from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("sc://localhost").getOrCreate()

Replace localhost with your cluster hostname. Run a quick test:

df = spark.range(10)
df.show()

For local development, start a Spark Connect server using the full package:

python -m pyspark.connect.server

Spark Connect uses gRPC under the hood. The 4.1 release adds zstd-compressed protobuf plans and chunked Arrow result streaming, which improves performance on queries returning large result sets compared to earlier 4.x builds.

Spark Declarative Pipelines

The largest new feature in 4.1 is Spark Declarative Pipelines (SDP). Instead of writing step-by-step code that orchestrates reads, transforms, and writes in sequence, SDP lets you declare what each dataset contains. Spark resolves the dependency graph, runs stages in parallel where possible, handles checkpointing, and manages retries on failures automatically.

Define a pipeline using the @table decorator:

from pyspark.pipelines import table, pipeline

@table
def raw_sales(spark):
    return spark.read.parquet("s3://my-bucket/sales/raw/")

@table
def cleaned_sales(spark, raw_sales):
    return raw_sales.filter("amount > 0").dropDuplicates(["order_id"])

@table
def daily_revenue(spark, cleaned_sales):
    return cleaned_sales.groupBy("date").sum("amount")

pipeline.run()

Spark tracks lineage across these tables and reruns only the stages affected by upstream changes. For teams running multi-step SQL transformations on Spark, SDP removes the need for a separate orchestrator in many scenarios. It is conceptually similar to dbt models but executes directly on the Spark engine without an external transformation layer.

Structured Streaming Real-Time Mode

Spark 4.1 adds Real-Time Mode (RTM) to Structured Streaming. Previous versions processed data in micro-batches, which introduced latency of hundreds of milliseconds to several seconds depending on trigger interval. RTM brings continuous stateless processing with single-digit millisecond latency for eligible queries.

Enable RTM on a streaming pipeline:

query = (
    spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "broker:9092")
        .option("subscribe", "events")
        .load()
        .selectExpr("CAST(value AS STRING)")
        .writeStream
        .format("console")
        .trigger(continuous="1 second")
        .start()
)

RTM in 4.1 covers stateless Scala workloads. For teams doing event enrichment, log filtering, or format conversion in streaming pipelines, RTM can reduce end-to-end latency by an order of magnitude compared to a 5-second micro-batch trigger. Python support for stateful RTM workloads is planned for a future minor release.

VARIANT Type Is Now Generally Available

The VARIANT type stores semi-structured data such as JSON and XML natively in Spark without flattening to a fixed schema upfront. It is generally available in 4.1 and useful when input data is inconsistent: API responses with missing fields, Parquet files with evolving schemas, or CSVs with mixed types.

Read a JSON column using VARIANT:

df = spark.read.json("s3://my-bucket/api-responses/")
df.select("payload::customer_id", "payload::amount").show()

The :: colon operator accesses VARIANT fields directly. On Parquet write, Spark 4.1 shreds the VARIANT data into a typed representation. According to Apache benchmark data cited in the release notes, this speeds up subsequent reads by 30 to 50 percent on typical nested-JSON workloads.

SQL Scripting Is Now On by Default

SQL Scripting, which allows procedural logic including IF, FOR, WHILE, and variable declarations directly in SQL, is enabled by default in Spark 4.1. Analysts who run multi-step transformations through SQL without switching to Python can now use conditional logic and loops in the same SQL session.

BEGIN
  DECLARE total DOUBLE;
  SET total = (SELECT SUM(amount) FROM daily_revenue WHERE date = CURRENT_DATE);
  IF total > 1000000 THEN
    INSERT INTO alerts VALUES ('high_revenue', total, CURRENT_TIMESTAMP);
  END IF;
END;

Execute this directly via spark.sql() in Python or through the Spark JDBC driver.

JDBC Driver for Spark Connect

Spark 4.1 ships a JDBC driver for Spark Connect, which means any BI tool that speaks JDBC (Tableau, DBeaver, or standard JDBC Python libraries) can connect directly to a Spark Connect server. This replaces the Thrift server setup that Spark 2.x and 3.x users relied on for BI connectivity.

Connection string:

jdbc:spark://<host>:<port>/<database>

For analytics teams querying Spark from non-Python tools, this is a practical shortcut that avoids maintaining a separate Thrift server process.

PySpark Arrow-Native UDFs

Spark 4.1 adds Arrow-native UDF decorators that process columnar data using PyArrow directly, removing the Pandas conversion step. For UDFs over large datasets, the useArrow=True flag avoids the PyArrow-to-Pandas round-trip overhead:

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyarrow.compute as pc
import pyarrow as pa

@udf(returnType=DoubleType(), useArrow=True)
def apply_discount(prices: pa.Array) -> pa.Array:
    return pc.multiply(prices, 0.9)

Benchmarks from the Spark community show 20 to 40 percent reductions in UDF execution time for columnar workloads using Arrow-native decorators versus Pandas-based UDFs in Spark 3.x.

What to Do Next

Install Spark 4.1 with pip install pyspark==4.1.0 or download the binary from the Apache Spark downloads page. The Declarative Pipelines feature requires the full PySpark package. The lightweight pyspark-client covers all other features described here.

If you want to query data connected to a Spark cluster without writing pipeline code, VSLZ lets you connect a Spark endpoint and run natural language queries to get summaries and charts from a single prompt.

FAQ

What is new in Apache Spark 4.1?

Apache Spark 4.1 introduces Spark Declarative Pipelines (SDP) for defining data transformation graphs declaratively, Structured Streaming Real-Time Mode (RTM) for sub-second stateless streaming, VARIANT type now generally available for semi-structured data, SQL Scripting enabled by default, a JDBC driver for Spark Connect, and Arrow-native PySpark UDF decorators. The release resolved over 1,800 issues from 230 contributors.

How do I install Apache Spark 4.1 in Python?

Run pip install pyspark==4.1.0 for the full local runtime, or pip install pyspark-client==4.1.0 for the 1.5 MB lightweight client that connects to a remote Spark Connect server. The lightweight client does not require Java on the client machine and works with the same DataFrame and SQL API.

What are Spark Declarative Pipelines?

Spark Declarative Pipelines (SDP) is a new framework in Spark 4.1 that lets you define datasets and transformation logic using Python decorators. Spark automatically resolves the dependency graph, handles execution ordering, manages parallelism, and retries failed stages. You define what each table contains; Spark handles how to compute and persist it.

What is Spark Connect and how does it work in 4.1?

Spark Connect is a client-server architecture that separates the Spark driver from the client application. In 4.1, the pyspark-client package weighs just 1.5 MB and connects to any Spark Connect server over gRPC. The 4.1 release improves performance with zstd-compressed protobuf plans, chunked Arrow result streaming, and a new JDBC driver that lets BI tools like Tableau connect directly.

What is Structured Streaming Real-Time Mode in Spark 4.1?

Real-Time Mode (RTM) is a new execution mode for Structured Streaming that processes data continuously rather than in micro-batches. For stateless Scala workloads, RTM achieves single-digit millisecond latency. It is enabled by setting .trigger(continuous='1 second') on a streaming query. Stateful and Python RTM support are planned for future releases.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026