image translation project

hello, so this is the second task from the previous blog. it's the same recruitment take home test. i think this one is quite interesting because it's not about LLM. so without further ado, let's do this!

the assignment sounded straightforward: build a FastAPI service that takes an image, translates the text, and returns a translated image. not translated text in a text box. the actual image (magazine cover, journal article, formulas) with the words replaced in place.

test case: anything → Chinese. demo images: a French magazine cover, an academic paper with math.

no model restrictions. no response time limit. which sounds freeing until you realize nobody handed you a "translate this PNG" button. you have to build the whole pipeline yourself.

understanding the limitations first

before writing any code, i sat with what makes this hard:

1. this is not one model problem.

Google Translate takes text and gives text. we need: find text → understand layout → translate → erase original → draw new text → return PNG. that's at least five separate jobs.

2. magazine covers lie.

a cover has text in the left sidebar, a big headline in the center, credits on the right. naive OCR merging would concatenate all of it into one string "FASHION SPRING ISSUE VOGUE PARIS 2024" and NLLB would translate gibberish into more gibberish. layout matters before translation even starts.

3. academic papers have things i must not touch.

formulas, author names, affiliations, page numbers. "PAGE 12" and "Check for updates" are not body text. equations and names are something that i shouldn't touch (i suppose)

4. OCR is messy.

low-confidence fragments, garbled long tokens, lone short words — these trigger translation hallucinations. NLLB loves to invent formal boilerplate when you feed it "ART" or a 22-character no-space OCR merge error. garbage in, confident garbage out.

5. Chinese text is longer (usually) and needs different fonts.

translated Chinese often needs more pixels than the original English. you can't just paste it back in the same bounding box with the same font size. CJK rendering needs Noto Sans CJK. font sizing needs a binary search. multi-line distribution needs to respect the original line boxes.

once i mapped these walls, the architecture wrote itself:

the plan

we broke it into stages with clear responsibilities:

stage	job	tool
detect & read	find text + bounding boxes	PaddleOCR / SuryaOCR
layout	group boxes into lines/paragraphs, respect columns	custom heuristics
protect	skip formulas, authors, affiliations, noise	rule-based filters + spaCy NER
translate	multilingual text → target language	NLLB-200-distilled-600M
erase	remove original text pixels cleanly	LaMa inpainting
render	draw translated text, fit font, pick color	Pillow + Noto Sans CJK

building it (stage by stage)

OCR: finding the text

PaddleOCR with angle classification (cls=True) gives us quadrilateral boxes converted to axis-aligned rectangles. each region carries text, confidence score, bounding box, and optional multi-line layout metadata.

first test on a French magazine cover: 30+ raw boxes. good start. also immediately obvious that 30 boxes ≠ 30 things to translate.

layout: the part nobody warns you about

this is where i spent more time than expected.

magazine covers have three columns. left sidebar (~38% of width), center headline, right sidebar (~62%+). we classify each OCR box by horizontal position, group into lines (similar vertical center, ±15px), split lines with large horizontal gaps (>50px), then merge consecutive lines into paragraphs (within 18px vertical gap).

the goal: translatable chunks that respect reading order, with original line bounding boxes preserved so translated text can spread across multiple lines instead of getting crammed into one box.

without column-aware merging, the center headline eats the sidebar text and translation quality collapses.

protection: what NOT to translate

three layers of "just leave it alone":

entity protection — spaCy NER + heuristics skip author names, affiliations, university addresses. on the ChatGPT paper test, "André Barcaui" and "Universidade Federal Do Rio de Janeiro..." stay as original pixels.

formula protection — regions with math symbols (∑, ∫, ∂, √), LaTeX patterns (\frac{, \sum{), or dense operator sequences get marked as formulas and skipped entirely.

region filter — low confidence (<0.5), too short (<3 chars), page references, dates, issue numbers, lone fragments ≤7 chars, garbled tokens ≥22 chars. skipped regions keep their original pixels, no inpainting, no re-rendering.

translation: NLLB-200

Meta's NLLB-200-distilled-600M supports 200 languages via FLORES codes (fra_Latn, zho_Hans, etc.). source language auto-detected with langdetect when not provided.

we strip page references before translation. translate region by region for translatable chunks only.

inpainting: erasing without ugly boxes

solid color fill looks terrible on a photographic magazine cover. LaMa (Large Mask Inpainting) fills masked text regions with plausible background texture.

binary mask built from all translatable bounding boxes with 6px padding. per-line boxes when available for tighter masks. the inpainted image is the canvas for rendering.

rendering: putting text back

binary search font sizing (8–72px) to find the largest font that fits each box. contrast-aware text color.. sample background brightness, pick black or white. multi-line greedy wrapping across original line boxes. CJK-aware token splitting (character-level for Chinese/Japanese).

RENDER_MODE supports overlay vs block styles. Noto Sans CJK for proper Chinese characters.

testing (the fun/scary part)

test images:

French magazine cover (english-magazine.jpg, france-magazine.jpg) column layout, decorative text, photographic background
ChatGPT academic paper (chatgpt-cognitive-crutch.png) —title, abstract labels, author block, body text
paper with formulas body text translates, math stays untouched

workflow: upload → select target zh → wait (first request ~30–60s for model loading, then ~5–15s) → compare original vs translated side by side.

things we checked:

test	what we looked for
magazine cover	headline translated, sidebar not merged into nonsense, background clean after inpainting
academic paper	title + abstract labels translated, author names preserved, `"Check for updates"` handled
formulas	equations untouched, surrounding prose translated
OCR noise	page numbers and garbled fragments skipped, not hallucinated into formal Chinese
CJK rendering	Chinese characters render properly, fit inside original boxes

the debug log for the ChatGPT paper was satisfying:

raw ocr boxes: 27  merged regions: 10

[03] type=title
     EN: ChatGPT as a cognitive crutch: Evidence from...
     ZH: ChatGPT 作为认知拐杖：来自随机对照试验的证据...

[04]

raw ocr boxes: 27  merged regions: 10

[03] type=title
     EN: ChatGPT as a cognitive crutch: Evidence from...
     ZH: ChatGPT 作为认知拐杖：来自随机对照试验的证据...

[04]

raw ocr boxes: 27  merged regions: 10

[03] type=title
     EN: ChatGPT as a cognitive crutch: Evidence from...
     ZH: ChatGPT 作为认知拐杖：来自随机对照试验的证据...

[04]

27 boxes became 10 meaningful regions. authors skipped. title translated. that's the pipeline working as designed.

things that broke:

first NLLB load on CPU takes forever. GPU helps if available (USE_GPU=true).
magazine cover with aggressive merging before column logic → nonsense strings → hallucinated translations. fixed with column-aware grouping.
short OCR fragments → NLLB inventing entire formal paragraphs. fixed with MAX_SINGLE_FRAGMENT_LEN = 7 filter.
translated Chinese overflowing boxes → binary search font sizing + multi-line distribution.

the result

upload a French magazine cover, get back the same cover in Chinese — headline, sidebar, credits in the right places, background texture intact.

upload an academic paper page, get back translated prose with formulas and author names still in their original form.

it's a pipeline, not a magic button. OCR → layout → protect → translate → erase → render. each stage has its own failure mode and its own debug log line. but when the side-by-side preview finally looks right — original on the left, translated on the right, layout preserved — that's the payoff.

the assignment said "output: translated image." we took that literally. every pixel decision is intentional: what to translate, what to skip, what to erase, where to draw.

this assignment took me more than 3 days to finish (then i guess i already failed the assignment because they gave me 7 days to do the rag task and image translation task) because it's very challenging. it's not the domain that i've been learning for the past year, so i definitely have a lot of things to catch up.

until next time,

indira