Hong Kong's public digital archives contain hundreds of thousands of duplicate image files — photographs, scanned documents, and heritage records stored multiple times across incompatible systems — a problem that took the better part of a decade to accumulate and is only now being addressed in any coordinated way. The Government Records Service, based in the Kwun Tong Government Offices complex, confirmed earlier this year that a structured deduplication programme is underway, though no completion date has been publicly committed to.
The timing matters. With the Greater Bay Area integration accelerating cross-border data flows, and Hong Kong's push to position itself as a digital economy hub in competition with Singapore, the integrity and efficiency of public-sector data infrastructure has moved from a back-office concern to a policy priority. Redundant image files are not merely a storage inconvenience — they create version-control failures, slow retrieval systems, and in some cases have surfaced contradictory records in legal and planning proceedings.
How the Problem Built Up
The roots of the duplication crisis trace back to the mid-2000s, when individual government bureaux digitised their own holdings independently, with no unified file-naming convention or central metadata standard. The Leisure and Cultural Services Department, which oversees the Hong Kong Public Libraries network including the flagship Central Library on Causeway Bay's Moreton Terrace, ran its own digitisation track. The Planning Department, headquartered in North Point, ran another. The two systems did not speak to each other.
A 2014 audit — the last publicly available comprehensive review of government digital storage — found that at least three separate agencies held overlapping photographic records of the same heritage sites, including buildings along Central's Pottinger Street and structures in the Kowloon Walled City Park precinct. Storage costs were already being flagged as unsustainable even then. By the time cloud migration contracts were signed in the early 2020s, deduplication had been deferred repeatedly, meaning redundant files were simply moved offshore at additional expense rather than resolved.
The National Security Law period after June 2020 added a separate layer of complexity. Certain categories of government imagery — protest documentation, public-order records — were reclassified or access-restricted, but the underlying file structures were rarely cleaned up. Archivists working within the system have described, in general terms at public records management conferences, a situation where restricted and unrestricted versions of the same image coexist in different database nodes with no automated reconciliation.
What Deduplication Actually Involves — And Where It Stands
The current programme, which the Government Records Service has described in broad terms in its 2025–26 annual work plan, involves perceptual hashing — a technique that identifies visually identical or near-identical images even when file names or metadata differ. The Science Park campus in Pak Shek Kok, Sha Tin, is hosting some of the computational workload through a partnership arrangement with Hong Kong Cyberport's affiliated technology tenants, though the specific contractual details have not been disclosed in public documents.
The scale is significant. Government estimates cited in the 2025 Policy Address supporting documentation reference more than 40 terabytes of image data held across legacy departmental systems, a figure that does not include the separate holdings of the Hong Kong Film Archive in Sai Wan Ho or the Hong Kong Heritage Museum in Sha Tin. Each of those institutions runs its own deduplication cycle on a different schedule.
For institutions and researchers who rely on public records — law firms in Central, academic departments at the University of Hong Kong in Pok Fu Lam, journalists working from the Foreign Correspondents' Club on Lower Albert Road — the practical effect of the cleanup will be faster, more reliable search results and fewer instances of conflicting document versions surfacing in the same query. The Government Records Service has indicated that public-facing archive portals should reflect the improvements by the second quarter of 2027, though departments are migrating to the cleaned datasets on a rolling basis. Anyone with pending records requests is advised to check directly with the relevant bureau whether the files they need sit in an already-processed tranche or are still awaiting deduplication review.