【Google 面试真题面经】Log Stream Processor：两行接口，考穿你的数据结构细节（RegisterEvent / GetMostActiveUser）

You're given a log stream of a chat application.
Every log entry has the following fields:

We want to implement a log stream processor which supports two methods:

RegisterEvent(timestamp, sender_username, receiver_username, message_text)
- registers the event that the message have been sent
GetMostActiveUser()
- Returns the username with the largest number of conversations.
- We count any amount of messages exchanged between two unique users as a single conversation.

面试现场

面试官： 你怎么定义 conversation？
我：两个 unique users 之间只要出现过任意消息（不管方向、多少条），就算 1 个 conversation，对吧？
面试官： 对。

我：那 GetMostActiveUser 的 “largest number of conversations”，我理解成：某个 user 和多少个不同的人聊过天（unique counterpart count），而不是他参与过多少“pair”？
面试官： 是这个意思。

我：还有两个边界我想确认：
1）用户给自己发消息算不算 conversation？
2）如果出现并列第一，返回谁？
面试官： 自己发自己我们可以忽略；并列你可以返回任意一个，或者说清楚 tie-break 规则。

这一步非常关键：你要把“会话数=unique 对端数量”说出来，否则你后面写得再快也可能统计口径错。

每来一条消息，就把（sender, receiver）这对用户归一化成一个 unordered pair；如果这对之前没出现过，就给双方的 conversation count 各 +1，并维护当前的最大值用户。

你需要解决两件事：

用一个 seenPairs 集合，存“用户对”。

归一化方法：把 pair 变成 (min(userA,userB), max(userA,userB))
这样 A->B 和 B->A 会落到同一个 key。

别每次遍历全体用户。你在 RegisterEvent 的时候就维护：

追问 1：timestamp / message_text 有用吗？
这版需求里基本没用——面试官是在看你能不能识别“干扰字段”。当然你也可以补一句：如果未来要做“最近 10 分钟最活跃”，timestamp 才会变关键（滑动窗口 + 过期处理）。

追问 2：并列第一怎么处理更合理？
我会在面试里直接说清楚一个规则，比如：

追问 3：用户量很大怎么办？
hash 方案是标准解；真到超大规模，就会走分片/一致性 hash、或者把 pair 计入外部 KV/stream processor，但那已经是下一层系统设计了。

如果你正在准备 LinkedIn、Meta、Google 这类偏系统思维的题目，这种“数据依赖 + 影响评估”类型其实非常常见。建议联系我