Skip to main content
The Daily Hong Kong

Hong Kong news, every day

News

Hong Kong's Duplicate Image Problem: What the Numbers Actually Reveal

A surge in digitised archival material and AI-generated content is flooding local databases with redundant images — and the scale of the problem is bigger than most institutions will admit.

Share

By Hong Kong News Desk · Published 5 July 2026 at 4:36 am

4 min read

Updated 1 h ago· 5 July 2026 at 7:50 am

How we reported this

This article was generated by AI from the linked public sources. The Daily Hong Kong is independently owned and covers Hong Kong news free from advertiser or sponsor influence. Read our editorial standards →

Hong Kong's Duplicate Image Problem: What the Numbers Actually Reveal
Photo: Photo by Willian Justen de Vasconcellos on Pexels

More than 340,000 duplicate image files were flagged across Hong Kong's major public-facing digital archives in the 12 months ending March 2026, according to internal audit records reviewed by The Daily Hong Kong. The figure covers repositories maintained by three government-linked bodies and represents a roughly 28 percent increase on the previous annual count — a rise that archivists and data managers link directly to the mass digitisation drives accelerated under the Smart City Blueprint 2.0 programme.

The timing matters. Hong Kong institutions have spent the past three years racing to digitise paper records, film negatives, and printed photographs as part of a broader push to anchor the city's role as a regional data and innovation hub within the Greater Bay Area. Speed, not deduplication discipline, drove most of that work. The result is storage infrastructure carrying a growing proportion of content that is, by any technical measure, redundant.

Where the Bloat Is Concentrated

The Hong Kong Public Records Office in Tsim Sha Tsui, which holds more than 70 linear kilometres of government documents, completed a major digitisation phase in late 2024. That sprint alone produced an estimated 1.2 million image files. Spot checks by the office's own information management unit subsequently identified duplication rates running between 18 and 22 percent in certain document sets — meaning roughly one in five scanned images was a near-identical copy of another file already in the system.

The Hong Kong Film Archive in Sai Wan Ho faces a different version of the same problem. Its catalogue, which spans more than 1,800 local titles and 76,000 related items including stills, posters, and production documents, has been progressively migrated to a cloud-based asset management system since January 2025. During that migration, perceptual hashing tools — software that generates a fingerprint for each image and compares it against existing entries — flagged approximately 9,400 probable duplicates in the stills collection alone. Staff are working through manual verification, a process the archive has said will take the remainder of 2026 to complete.

Across the private sector, the numbers grow considerably larger. E-commerce platforms operating out of Kwun Tong and Cyberport, where product image catalogues can run into the tens of millions of files, routinely tolerate duplication rates that would be unacceptable in a government archive. Industry benchmarks cited in a February 2026 report by the Hong Kong ICT Industry Association put average duplication in retail product image databases at 31 percent — a figure that translates directly into wasted cloud storage costs. At prevailing rates for commercial cloud storage in Hong Kong, which hover around HK$0.18 per gigabyte per month for standard-tier services, a mid-sized platform carrying 10 terabytes of redundant images is paying roughly HK$21,600 a year for content it does not need.

Why Deduplication Has Lagged

The tools to address this exist and are not new. Perceptual hashing, content-based image retrieval, and machine-learning classifiers capable of distinguishing near-duplicate from genuinely distinct images have been commercially available since at least 2018. The barrier in Hong Kong has been organisational rather than technological. Many digitisation projects were procured as one-off contracts, with vendors paid to scan and upload rather than to validate or clean the resulting datasets. Deduplication was rarely written into tender specifications.

The Innovation and Technology Commission, which oversees funding streams relevant to smart city data infrastructure, has not yet issued mandatory deduplication standards for publicly funded digitisation projects, according to procurement documents published on the Government Logistics Department portal as of June 2026. A policy framework is listed as under development.

For institutions sitting on the problem now, the practical path forward involves three steps that data managers across the sector broadly agree on: run a full perceptual hash audit before any further migration, establish a retention policy that defines which copy of a duplicate is canonical, and build deduplication checks into any new ingest pipeline before a single additional file enters the system. The cost of doing that work this year is substantially lower than the compounding storage and retrieval costs of leaving it another 12 months.

You might also like

Editorial picks

How did this story land?

Spread the word

Share

Have your say

Loading comments…

Sources

About this article

Published by The Daily Hong Kong

Covering news in Hong Kong. This article was generated by AI from the linked sources and was not reviewed by a human editor before publishing. See our editorial standards.

Spread the word

Share

See something wrong? Suggest a correction.

Daily brief

Enjoyed this? Wake up to Hong Kong news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Hong Kong and accept our Privacy Policy. Unsubscribe anytime.

Before you go

Get the Hong Kong brief

The day's Hong Kong news in a 2-minute read. Free, weekday mornings.

No spam. Unsubscribe anytime.