- Scenario
- Service
- Storage
- Initial solution
- Scale
- Additional feature: Follow and unfollow
- Additional feature: Likes and dislikes
- Register
- Login
- Post a tweet
- Newsfeed
- Timeline
- Upload image
- Upload video
- Follow
- Unfollow
- User table
- Social graph - followers (SQL/NoSQL)
- Need to support multiple index
- Tweets (Tweet service)
- Images
- Videos
Columns | Type |
---|---|
id | Integer |
username | varchar |
varchar | |
password | varchar |
- Select * from friendship_table where from_user_id = user_id
Columns | Type |
---|---|
id | Integer |
from_user_id | foreign key |
to_user_id | foreign key |
Columns | Type |
---|---|
id | Integer |
user_id | foreign_key |
content | text |
created_at | timestamp |
- Client asks to add a new tweet record
- Web server asks tweet service to add a new record
postTweet(request, tweet)
DB.insertTweet(request.User, tweet)
return success
- Post a tweet
- Post a tweet: 1 DB write
- Client asks web server for news feed.
- Web server asks friendship service to get all followings.
- Web server asks tweet service to get tweets from followings.
- Web server merges each N tweets from each tweet service and merge them together.
- Web server returns merged results to client.
// each following's first 100 tweets, merge with a key way sort
getNewsFeed(request)
followings = DB.getFollowings(user=request.user)
newsFeed = empty
for follow in followings:
tweets = DB.getTweets(follow.toUser, 100)
newsFeed.merge(tweets)
sort(newsFeeds)
return newsFeed
- Algorithm level:
- 100 KlogK ( K is the number of friends)
- System leveL:
- Get news feed: N DB reads + K way merge
- Bottleneck is in N DB reads, although they could be integrated into one big DB query.
- Get news feed: N DB reads + K way merge
- High latency
- Need to wait until N DB reads finish
- Need to have an additional newsfeed table. The newsfeed table contains newsfeed for each user. The newsFeed table schema is as follows:
- Everyone's newsfeed info is stored in the same newsFeed table.
- Select * from newsFeed Table where owner_id = XX orderBy createdAt desc limit 20;
Column | Type |
---|---|
id | integer |
ownerId | foreign key |
tweetId | foreign key |
createdAt | timestamp |
- Client asks web server to post a tweet.
- Web server asks the tweet service to insert the tweet into tweet table.
- Web server asks the tweet service to initiate an asynchronous task.
- The asynchronous task gets followers from friendship table.
- The asynchronus task fanout new tweet to followers' news feed table.
postTweet(request, tweetInfo)
tweet = DB.insertTweet(request.user, tweetInfo)
// Do not need to be blocked until finished. RabbitMQ/Kafka
AsyncService.fanoutTweet(request.user, tweet)
return success
AsyncService::fanoutTweet(user, tweet)
followers = DB.getFollowers(user)
for follower in followers:
DB.insertNewsFeed(tweet, follower)
- Post a tweet: N followers, N DB writes. Executed asynchronously.
- Get newsfeed from newsFeed Table.
// Each time after a user tweet, fanout his tweets to all followers' feed list
getNewsFeed(request)
return DB.getNewsFeed(request.user)
- 1 DB query
- When number of followers is really large, the number of asynchronous task will have high latency.
- Add cache before visiting DB, faster than 1000 times
- What to cache
- Cache each user's timeline
- N DB query request -> N cache requests
- Trade off: Cache all timeline? Only cache the latest 1000 timeline
- Cache each user's newsFeed
- For users without newsfeed cache: Merge N followers' 100 latest tweets, sort and take the latest 100 tweets.
- For users with newsfeed cache: Merge N followers' tweets after a specific timestamp. And then merge with the cache.
- Cache each user's timeline
- Push-based approach stores news feed in disk, much better than the optimized pull approach.
- For inactive users, do not push
- Rank followers by weight (for example, last login time)
- When number of followers >> number of following
- Lady Gaga has 62.5M followers on Twitter. Justin Bieber has 77.6M on Instagram. Asynchronous task may takes hours to finish.
- For users with a lot of followers, use pull; For other users, use push.
- Define a threshold (number of followers)
- Below threshold use push
- Above threshold use pull
- For popular users, do not push. Followers fetch from their timeline and integrate into news feed.
- May miss updates.
- Solutions:
- Star users: Pull not push
- Half star user: Pull + Push
- Normal user: Push
- Bi-direction relationship
- No star users: Users do not have a lot of followers
- Low latency
- Single direction relationship
- Star users
- High latency
- Cache (Facebook lease get problem)
- Asynchronously executed
- Follow a user: Merge users' timeline into news feed asynchronously
- Unfollow a user: Pick out tweets from news feed asynchronously
- Benefits:
- Fast response to users.
- Disadvantages:
- Consistency. After unfollow and refreshing newsfeed, users' info still there.
- Tweet table
Columns | Type |
---|---|
id | integer |
userId | foreign key |
content | text |
createdAt | timestamp |
likeNums | integer |
commentNums | integer |
retweetNums | integer |
- Like table
Columns | Type |
---|---|
id | integer |
userId | foreignKey |
tweetId | foreignKey |
createdAt | timestamp |
- Select Count in Like Table where tweet id == 1
- Denormalize:
- Store like_numbers within Tweet Table
- Need distributed transactions.
- Might resulting in inconsistency, but not a big problem.
- Could keep consistency with a background process.