자바(Java) 기반의 애플리케이션을 개발할 때, 우리는 데이터의 빠른 검색과 저장을 위해 HashMap이나 HashSet 구조를 빈번하게 사용합니다. 이때 가장 흔하게 사용되는 키(Key) 타입은 단연 문자열(String)입니다. 개발자들은 보통 해시 테이블 기반의 자료구조가

O(1)

의 탐색 성능을 보장할 것이라 믿고, 사용자 ID나 세션 토큰 같은 중요한 식별자를 문자열 형태로 해시 맵에 저장합니다.

하지만 이러한 믿음의 기저에 있는 String 클래스의 hashCode() 메서드는 완벽한

O(1)

을 보장하지 못하며, 설계상의 수학적 한계로 인해 예기치 않은 성능 저하나 심지어 보안 취약점을 유발할 수 있습니다. 독자가 겪을 수 있는 시스템 지연이나 데이터 병목 현상의 원인이 바로 이 작은 해시 함수 하나에서 시작될 수 있습니다. 본 포스트에서는 다항식 해싱(Polynomial Hashing)의 원리부터 생일 문제(Birthday Paradox)까지, String.hashCode()가 가진 구현상의 한계와 잠재적 문제점들을 깊이 있게 고찰해 봅니다.

Java String hashCode()의 내부 구현과 지연 초기화

문자열 해싱(String hashing)은 개별 문자를 비교하는 대신, 문자열을 대표하는 하나의 숫자(해시값)를 계산하여 비교 연산의 효율성을 극대화하는 기법입니다[1]. 자바의 String 클래스 역시 이 기법을 도입하여 내부적으로 다항식 해싱(Polynomial Hashing) 알고리즘을 사용합니다.

길이가

n

인 문자열

s

에 대한 다항식 해시값은 일반적으로 다음과 같은 수학적 공식을 통해 계산됩니다[2].

(sA^{n-1} + s[3]A^{n-2} + \dots + s[n-1]A^0) \bmod B

이 공식에서

s[i]

는 문자열 내 각 문자의 문자 코드(아스키 또는 유니코드 값)를 의미하며,

A

와

B

는 사전에 정의된 상수입니다[2]. 자바의 구현체에서는 승수

A

의 값으로 31을 사용하며, 모듈러 연산을 위한 상수

B

는 명시적으로 선언되지 않고 32비트 정수(int)의 오버플로우 연산을 활용하여 자연스럽게

2^{32}

로 나누어진 나머지를 취하게 됩니다.

흥미로운 점은 String 객체가 해시값을 언제, 어떻게 계산하느냐는 것입니다. 자바 스펙상 String은 상태가 변하지 않는 불변 객체(Immutable Object)입니다. 객체의 상태가 생성 이후 변경될 수 없으므로, 모든 필드가 final 조건을 만족하고 안전하게 발행(Safe publication)된다면 동기화 없이도 여러 스레드에서 안전하게 공유할 수 있습니다[4], [5].

이러한 불변성이라는 강력한 특징 덕분에, 자바의 String은 해시코드를 매번 새로 계산하는 대신 지연 초기화(Lazy Initialization) 방식을 사용합니다. hashCode()가 처음 호출될 때 한 번만 다항식 연산을 수행하여 그 결과를 non-final 필드에 캐싱(Caching)합니다[6]. 비록 이 필드가 final은 아니지만, 불변하는 문자열 상태로부터 결정론적으로(Deterministically) 항상 동일한 결과값이 도출됨이 보장되므로, 데이터 경합(Benign data race)이 발생하더라도 동시성 환경에서 완벽하게 스레드 안전(Thread-safe)하게 작동합니다[6].

String hashCode가 내포한 치명적인 문제점들

자바의 문자열 해싱 로직은 연산 속도와 캐싱 효율성 면에서는 우수하지만, 해시 충돌(Hash Collision) 측면에서는 수학적으로 큰 한계를 안고 있습니다.

2^32 모듈러 연산의 수학적 취약점

앞서 언급했듯 자바는 32비트 int 자료형의 오버플로우를 이용해 사실상

B = 2^{32}

인 모듈러 연산을 수행합니다[7]. 컴퓨터의 비트 연산 구조상 32비트나 64비트 정수를 활용해

2^{32}

및

2^{64}

에 대한 모듈러를 자동 처리하는 것은 매우 편리합니다[7].

그러나 알고리즘 관점에서 이는 결코 좋은 선택이 아닙니다. 자료에 따르면,

2^x

형태의 상수를 모듈러로 사용할 경우 항상 해시 충돌을 일으키는 특정 입력값들을 인위적으로 조합하고 생성해 낼 수 있기 때문입니다[7]. 즉, 자바의 String.hashCode() 구조는 악의적인 목적을 가진 사용자가 동일한 해시값을 가지는 수만 개의 서로 다른 문자열을 수학적으로 쉽게 역산하여 만들어낼 수 있다는 치명적인 결함을 가집니다.

생일 문제(Birthday Paradox)와 급격한 충돌 확률 증가

악의적인 공격이 아니더라도, 해시 테이블 내에서 다른 두 데이터가 같은 버킷에 배정되는 현상은 일상적인 상황에서도 직관보다 훨씬 빠르게 발생합니다.

생일 문제(Birthday Paradox) 이론에 따르면, 해시 충돌의 확률은 데이터가 증가함에 따라 우리가 상상하는 것 이상으로 기하급수적으로 높아집니다[8]. 예를 들어, 자바의 해시코드와 동일하게 32비트 정수 공간(약 43억 개의 경우의 수)을 반환하는 CRC32() 해시 함수의 경우, 불과 93,000개의 값만 해시 테이블에 입력되어도 해시값이 겹칠 확률이 1%에 도달하게 됩니다[8]. 실제 사전 데이터인 약 98,569개의 영어 단어만 해시 테이블에 넣어도 이미 충돌이 발생할 정도입니다[8].

결론적으로, 무한히 많은 조합을 가진 문자열의 세계에서 고작 32비트 공간의 정수로 해시값을 매핑한다는 것은 본질적으로 한계가 있으며, 시스템 규모가 커질수록 서로 다른 키(Key)임에도 같은 해시값을 갖는 빈도가 필연적으로 상승하게 됩니다[9].

해시 충돌이 시스템과 탐색 알고리즘에 미치는 파장

그렇다면 위와 같은 원인으로 String 키 간에 충돌이 발생하면 내부적으로 어떤 동작이 일어날까요?

해시 인덱스 구조에서 새로운 레코드를 삽입할 때, 이미 해당 해시 버킷(Bucket)이 가득 차 있거나 같은 해시 코드를 가진 다른 값이 존재한다면 오버플로우 체이닝(Overflow Chaining) 방식을 통해 충돌을 해결합니다[10]. 즉, 동일한 주소를 부여받은 모든 데이터 엔트리들이 연결 리스트(Linked List) 형태로 길게 이어지게 됩니다[11].

문제는 데이터를 조회(get)할 때 발생합니다. 특정 식별자를 찾기 위해 해시값을 계산하고 해당 버킷에 접근했는데, 그곳에 수많은 데이터가 연결 리스트로 묶여 있다면 어떻게 될까요? 시스템은 해시 테이블이 제공하는

O(1)

의 빠른 속도의 이점을 전혀 누리지 못합니다. 대신, 원하는 레코드를 정확히 찾을 때까지 체인에 묶인 모든 레코드의 탐색 키(Search-key) 값을 일일이 순차적으로 비교(Sequential Check)해야만 합니다[12]. 해싱을 이용하면 부분 문자열 비교의 시간 복잡도를

O(N^2)

에서

O(N)

으로 줄일 수 있지만[13], 이는 충돌이 없다는 전제 하의 이점입니다. 체이닝이 길어지면 조회 성능은 데이터 개수

N

에 비례하여 선형적으로 악화됩니다.

(이하 단락은 제공된 출처 외부의 일반적인 컴퓨터 과학 및 보안 지식을 바탕으로 작성되었습니다. 필요시 추가 검증을 권장합니다.)

이러한 특성을 악용한 보안 취약점이 바로 해시 충돌 서비스 거부 공격(HashDOS)입니다. 공격자가 HTTP 요청의 파라미터나 JSON 페이로드의 키값으로 자바 String.hashCode()에서 의도적으로 충돌이 발생하는 수만 개의 문자열을 전송한다고 가정해 봅시다. 서버 내부의 HashMap 버킷 하나에 수만 개의 데이터가 연결 리스트 형태로 쌓이게 되고, 이후 데이터를 조회하거나 추가할 때마다 서버의 CPU 자원을 막대하게 소모하여 애플리케이션 전체가 마비되는 심각한 장애를 초래할 수 있습니다.

설계의 한계를 극복하기 위한 최적화 전략과 결론

지금까지 살펴본 바와 같이, String.hashCode()는 다항식 해싱의 간결함과 불변 객체 캐싱을 통한 성능적 이점이 돋보이는 반면, 32비트 오버플로우 기반의 모듈러 연산 구조로 인해 의도적이거나 우연한 해시 겹침에 매우 취약하다는 분명한 한계가 존재합니다.

이러한 문제를 극복하고 더 견고한 시스템을 설계하기 위해서는 다음과 같은 대안적 접근 방식을 고려해 볼 수 있습니다.

안전한 모듈러 상수 선택: 문자열 해싱 알고리즘을 직접 구현해야 하는 시스템 설계라면, 2322^{32}232와 같은 배수 형태를 피해야 합니다. 대신 A=911382323A = 911382323A=911382323, B=972663749B = 972663749B=972663749와 같이 10910^9109 근처의 무작위 소수(Prime number) 상수를 사용하여 충돌 확률을 획기적으로 낮출 수 있습니다[9]. 나아가 파라미터가 다른 여러 개의 해시값을 조합하면 안전성은 더욱 증대됩니다[14].

암호학적 해시 함수의 활용: 보안이나 무결성이 절대적으로 보장되어야 하는 식별자 관리라면, 연산 비용은 다소 증가하더라도 사전 원상 공격(Preimage resistance)이나 충돌을 효과적으로 방어할 수 있는 MD5 (128비트 다이제스트)나 SHA-1, SHA-2 계열의 암호학적 해시 함수(Cryptographic Hash Function) 적용을 고려해야 합니다[15], [16].

최신 자바 버전의 트리화(Treeify) 도입: (외부 지식 참조) 다행히 Java 8 이후 버전부터는 HashMap 내에서 동일 버킷에 데이터가 일정 개수(기본 8개) 이상 모일 경우, 내부 구조를 연결 리스트에서 탐색 성능이 O(log⁡N)O(\log N)O(logN)인 균형 이진 트리(Red-Black Tree)로 변환하여 성능 저하를 방어하는 안전장치가 추가되었습니다.

자바 개발자로서 String.hashCode()의 내부가 단순히 "잘 분산된 고유값"을 보장할 것이라는 맹신은 위험할 수 있습니다. 구현의 수학적 원리를 이해하고 데이터 볼륨과 보안 민감도에 따라 적절한 자료구조 및 해시 전략을 채택하는 것이야말로 고성능 애플리케이션을 완성하는 진정한 전문가의 자세일 것입니다.

참고문헌

[1] book — String hashing String hashing is a technique that allows us to efficiently check whether two strings are equal1. The idea in string hashing is to compare hash values of strings instead of their individual characters. Calculating hash values A hash value of a string is a number that is calculated fro…

[2] book — (s[0]An−1 +s[1]An−2 +·· ·+s[n−1]A0) mod B, where s[0], s[1], . . . , s[n−1] are interpreted as the codes of the characters of s, and A and B are pre-chosen constants. For example, the codes of the characters of ALLEY are: A L L E Y 65 76 76 69 89 Thus, if A = 3 and B = 97, the hash value of ALLEY is…

[3] Core Technologies-all — collaborator is a singleton bean, it may already be initialized by the container.) All references are ultimately a reference to another object. Scoping and validation depend on whether you specify the ID or name of the other object through the bean or parent attribute. Specifying the target bean th…

[4] Java Concurrency in Practice — Neither the Java Language Specification nor the Java Memory Model formally defines immutability, but immutability is not equivalent to simply declaring all fields of an object final. An object whose fields are all final may still be mutable, since final fields can hold references to mutable objects.…

[5] Java Concurrency in Practice — To make things even less predictable, a thread may see a stale value the first time it reads a field and then a more up-to-date value the next time, which is why assertSanity can throw AssertionError. At the risk of repeating ourselves, some very strange things can happen when data is shared across …

[6] Java Concurrency in Practice — @Immutable public final class ThreeStooges { private final Set<String> stooges = new HashSet<String>(); public ThreeStooges() { stooges.add("Moe"); stooges.add("Larry"); stooges.add("Curly"); } public boolean isStooge(String name) { return stooges.contains(name); } } Listing 3.11. Immutable class bu…

[7] book — Some people use constants B = 232 and B = 264, which is convenient, because operations with 32 and 64 bit integers are calculated modulo 232 and 264. How-ever, this is not a good choice, because it is possible to construct inputs that always generate collisions when constants of the form 2x are used…

[8] O'Reilly.High.Performance.MySQL.3rd.Edition.Mar.2012 (1) — mysql> SELECT id FROM url WHERE url_crc=CRC32("http://www.mysql.com"); The probability of a hash collision grows much faster than you might think, due to the so-called Birthday Paradox. CRC32() returns a 32-bit integer value, so the probability of a collision reaches 1% with as few as 93,000 values.…

[9] book — Collisions are always possible, because the number of different strings is larger than the number of different hash values. However, the probability of a collision is small if the constants A and B are carefully chosen. A usual way is to choose random constants near 109, for example as follows: A = …

[10] Database-System-Concepts-7th-Edition — Formally, let K denote the set of all search-key values, and let B denote the set of all bucket addresses. A hash function h is a function from K to B. Let h denote a hash function. With in-memory hash indices, the set of buckets is simply an array of pointers, with the ith bucket at offset i. Each …

[11] Database-System-Concepts-7th-Edition — Unlike B+-tree indices, hash indices do not support range queries; for example, a query that wishes to retrieve all search key values v such that l ≤ v ≤ u cannot be efficiently answered using a hash index. Deletion is equally straightforward. If the search-key value of the record to be deleted is K…

[12] Database-System-Concepts-7th-Edition — Hash indexing using overflow chaining is also called closed addressing (or, less commonly, closed hashing). An alternative hashing scheme called open addressing is used in some applications, but is not suitable for most database indexing applications since open addressing does not support deletes ef…

[13] book — Using hashing, we can often make a brute force algorithm efficient. As an example, consider the pattern matching problem: given a string s and a pattern p, find the positions where p occurs in s. A brute force algorithm goes through all positions where p may occur and compares the strings character …

[14] book — The phenomenon in scenario 3 is known as the birthday paradox: if there are n people in a room, the probability that some two people have the same birthday is large even if n is quite small. In hashing, correspondingly, when all hash values are compared with each other, the probability that some two…

[15] tcpip-illustrated-volume-1-2nd-edition — A checksum or FCS can be used to verify message integrity against an adver-sary like Mallory if properly constructed using special functions. Such functions are called cryptographic hash functions and often resemble portions of encryption algorithms. The output of a cryptographic hash function H, wh…

[16] tcpip-illustrated-volume-1-2nd-edition — Second preimage resistance: Given H(M1), it should be difficult to deter-mine an M2 ≠ M1 such that H(M1) = H(M2). Collision resistance: It should be difficult to find any pair M1, M2 where H(M1) = H(M2) when M2 ≠ M1. If a hash function has all of these properties, then if two messages have the same …