For years I shipped the same compromise: one large Apache Beam pipeline crammed into a single Python file, packaged as a Dataflow Flex Template. Every stage of the pipeline lived in one module, because that was the only arrangement I could reliably get to run on the workers. I tried custom containers more than once to break it apart, and they failed every time. The job would start, sit there, and fail with "Problem communicating with workers," so I would give up and go back to the one big file.
What I never tried was building a custom launcher alongside the custom worker, and that turned out to be the whole problem. A Flex Template uses two different container images, and I had only ever been building one.
Here are the three implementations, with the benefits and trade-offs of each.
1. The monolith: one file plus save_main_session
This is the default Flex Template path. Everything lives in __main__, and you set save_main_session=True, which pickles the main session and ships it to the workers. The workers run the stock Beam SDK image that Dataflow already caches on the worker pool.
# Dockerfile — launcher only; workers use the stock SDK image
FROM gcr.io/dataflow-templates-base/python312-template-launcher-base:latest
ENV FLEX_TEMPLATE_PYTHON_PY_FILE=/template/pipeline.py
ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE=/template/requirements.txt
COPY pipeline.py requirements.txt /template/
RUN pip install -r /template/requirements.txt
options.view_as(SetupOptions).save_main_session = True
The benefit is simplicity, and worker startup is fast because there is nothing custom to pull. The trade-off shows up the moment you try to import a second module: the workers raise ModuleNotFoundError, because only __main__ was pickled. You end up inlining everything into one file or duplicating logic between the pipeline and a testable module, and neither of those holds up as the codebase grows.
2. The custom container that never works
The obvious fix is to build a custom image, install your pipeline as a package, and point the workers at it. The mistake I made every time was building a single image for both roles: Beam SDK base, the Flex launcher copied in, my package installed, the launcher set as the entrypoint. I built the template from that image and passed the same image as --sdk_container_image.
# The trap: one image used for BOTH the template and the workers
FROM apache/beam_python3.12_sdk:2.58.0
COPY --from=gcr.io/dataflow-templates-base/python312-template-launcher-base \
/opt/google/dataflow/python_template_launcher /opt/google/dataflow/python_template_launcher
COPY . /template
RUN pip install -e /template
ENV FLEX_TEMPLATE_PYTHON_PY_FILE=/template/my_pipeline/pipeline.py
ENTRYPOINT ["/opt/google/dataflow/python_template_launcher"] # wrong entrypoint for a worker
Dataflow runs your --sdk_container_image as the SDK worker harness, and it starts that container using the image's own entrypoint. When that entrypoint is the Flex launcher rather than the Beam boot, the launcher runs, does nothing useful as a worker, and exits, so the worker pod drops into CrashLoopBackOff. The job reports "Running" while the data lag climbs and nothing is processed, which is exactly the failure I kept hitting. The custom container approach is sound, but a single image cannot serve both roles.
3. The fix: a custom launcher and a custom worker
A Flex Template needs two images because Dataflow runs each one with a different entrypoint. The launcher image builds the pipeline graph at submit time and keeps the Flex launcher entrypoint. The worker image runs the SDK harness and must keep the Beam boot entrypoint at /opt/apache/beam/boot.
# Dockerfile — the LAUNCHER (the template's --image)
FROM gcr.io/dataflow-templates-base/python312-template-launcher-base:latest
ENV FLEX_TEMPLATE_PYTHON_PY_FILE=/template/my_pipeline/pipeline.py
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE=/template/setup.py
COPY . /template
RUN pip install -e /template
# inherits the launcher entrypoint from the base
# Dockerfile.worker — the WORKER (--sdk_container_image)
FROM apache/beam_python3.12_sdk:2.58.0
COPY . /template
RUN pip install -e /template
# inherits /opt/apache/beam/boot — do not override the entrypoint
Both images install the same package, so both can import your code, but the worker keeps the boot entrypoint, so the harness actually starts. You build the template from the launcher image and pass the worker image at run time.
# build the template from the LAUNCHER image
gcloud dataflow flex-template build gs://BUCKET/template.json \
--image REGION-docker.pkg.dev/PROJECT/REPO/launcher:latest \
--sdk-language PYTHON --metadata-file metadata.json
# run it, pointing the workers at the WORKER image
gcloud dataflow flex-template run "job-$(date +%s)" \
--template-file-gcs-location gs://BUCKET/template.json \
--region REGION \
--parameters sdk_container_image=REGION-docker.pkg.dev/PROJECT/REPO/worker:latest
Once the images were split, the worker booted from the custom image, imported the modules I had finally been able to separate, and ran the pipeline on the first attempt. The benefit is real modularity: every component is its own file, independently testable, with no inlining and no save_main_session. The trade-off is that there are now two images to build and push, and the workers pull your image instead of using the cached stock one, which makes startup slower and image size something you have to care about.
The size trade-off, and how to fix it
My first worker image, built on the full apache/beam_python3.12_sdk base, came out at 2.8 GB. That size is the real cost of a custom container, because every worker pulls it before processing anything. The base image carries a lot you do not need, so the fix is to start from a slim Python image and copy only the Beam boot binary across.
# Dockerfile.worker — slim variant (~794 MB instead of ~2.8 GB)
FROM apache/beam_python3.12_sdk:2.58.0 AS sdk
FROM python:3.12-slim
COPY --from=sdk /opt/apache/beam /opt/apache/beam
COPY . /template
RUN pip install -e /template
ENTRYPOINT ["/opt/apache/beam/boot"]
Same pinned Beam version, none of the extra baggage, and it ran identically on Dataflow. That change cut the worker image by roughly 72%. The apache-beam[gcp] dependency set is the floor you cannot easily go below, but everything above it is removable.
If you would rather not maintain a custom worker image at all, you can keep the launcher image with FLEX_TEMPLATE_PYTHON_SETUP_FILE set and run the template without --sdk_container_image. Dataflow then stages your package onto the stock, pre-cached worker through setup.py, which gives you the modular code without the image pull.
Which approach to use
+---------------------------------------------+------------------------------------+
| What you want | Approach |
+---------------------------------------------+------------------------------------+
| Smallest, simplest, throwaway pipeline | Monolith with save_main_session |
| Modular code, no image to maintain | Stock worker plus setup.py staging |
| Modular code with reproducible dependencies | Custom launcher plus custom worker |
+---------------------------------------------+------------------------------------+
The lesson I wish I had years ago is that the custom container was never the thing that was broken. The launcher was the missing half. A Flex Template is two images, and once you build both, the monolith finally comes apart.
0 Comments
Leave a Comment