Developer.
163 stories
·
57 followers

Search Architecture

1 Comment

Instagram is in the fortunate position to be a small company within the infrastructure of a much larger one. When it makes sense, we leverage resources to leapfrog into experiences that have taken Facebook ten years to build. Facebook’s search infrastructure, Unicorn, is a social-graph-aware search engine that has scaled to indexes containing trillions of documents. In early 2015, Instagram migrated all search infrastructure from Elasticsearch into Unicorn. In the same period, we saw a 65% increase in search traffic as a result of both user growth and a 12% jump in the number of people who are using search every time they use Instagram.

These gains have come in part from leveraging Unicorn’s ability to rank queries using social features and second-order connections. By indexing every part of the Instagram graph, we powered the ability to search for anything you want - people, places, hashtags, media - faster and more easily as part of the new Search and Explore experience in our 7.0 update. 

What Is Search?

Instagram’s search infrastructure consists of a denormalized store of all entities of interest: hashtags, locations, users and media. In typical search literature these are called documents. Documents are grouped together into sets which can be queried using extremely efficient set operations such as AND, OR and NOT. The results of these operations are efficiently ranked and trimmed to only the most relevant documents for a given query.  When an Instagram user enters a search query, our backend encodes it into set operations and then computes a ranked set of the best results. 

Getting Data In 

Instagram serves millions of requests per second. Many of these, such as signups, likes, and uploads, modify existing records and append new rows to our master  PostgreSQL databases. To maintain the correct set of searchable documents, our search infrastructure needs to be notified of these changes. Furthermore, search typically needs more information than a single row in PostgreSQL — for example, the author’s account vintage is used as a search feature after a photo is uploaded. 

To solve the problem of denormalization, we introduced a system called Slipstream where events on Instagram are encoded into a large Thrift structure containing more information than typical consumers would use. These events are binary-serialized and sent over an asynchronous pub/sub channel we call the Firehose. Consumers, such as search, subscribe to the Firehose, filter out irrelevant events and react to remaining events. The Firehose is implemented on top of Facebook's Scribe which makes the messaging process asynchronous. The figure below shows the  architecture:

image

Since Thrift is schematized, we re-use objects across requests and have consumers consume messages without the need for custom deserializers. A subset of our Slipstream schema, corresponding to a photo like is shown below:

struct User {
1: required i64 id;
2: string username;
3: string fullname;
4: bool is_private;
...
}
struct Media {
1: required i64 id; 
2: required i64 owner_id;
3: required MediaContentType content_type;
...
}
struct LikeEvent {
1: required i64 liker_id;
2: required i64 media_id;
3: required i64 media_owner_id;
4: Media media;
5: User liker;
6: User media_owner;
...
8: bool is_following_media_owner;
}
union InstagramEvent {
...
2: LikeEvent like;
...
}
struct FirehoseEvent {
1: required i64 server_time_millis;
2: required InstagramEvent event;
}

Firehose messages are treated as best-effort and a small percentage of data loss is expected in messaging. We establish eventual consistency in search by a process of reconciliation or a base build. Each night, we scrape a snapshot of all Instagram PostgreSQL databases to Hive for data archiving. Periodically, we query these Hive tables and construct all appropriate documents for each search vertical. The base build is merged against data derived from Slipstream to allow our systems to be eventually consistent even in the event of data loss.

Getting Data Out

Processing Queries

Assuming that we have ingested our data correctly, our search infrastructure enables an efficient path to extracting relevant documents given a constraint. We call this constraint a query,

which is typically a derived form of user-supplied text (e.g. “Justin” with the intent of searching for Justin Bieber). Behind the scenes, queries to Unicorn are rewritten into S-Expressions that express clear intent, for example:

(and
user:maxime
(apply followed_by: followed_by:me)
)

which translates to “people named maxime followed by people I follow”. Our search infrastructure proceeds in two (intermixed) steps:

  • Candidate generation: finding a set of documents that match a given query. Our backend dives into a structure called a reverse index, which finds sets of document ids indexed by a term. For example, we may find the set of users with the name “justin” in the “name:justin” term.
  • Ranking: choosing the best documents from all the candidates. After getting candidate documents, we look up features which encode metadata about a document. For example, one feature for the user justinbieber would be his number of followers (32.3MM). These features are used to compute a “goodness” score, which is used to order the candidates. The “goodness” score can be either machine learned or hand-tuned — in the machine learning case, we may engineer features that discriminate for clicks or follows to a given candidate.

The result of the two steps is an ordered list of the best documents for a given query.

Graph-Aware Searches 

As part of our search improvements, Instagram now takes into account who you follow and who they follow in order to provide a more personalized set of results. This means that it is easier for you to find someone based on the people you follow.

Using Unicorn allowed us to index all the accounts, media, hashtags and places on Instagram and the various relationships between these entities. For example, by indexing a user’s followers, Unicorn can provide answers to questions such as:

“Which accounts does User X follow and are also followed by user Y”

Equally, by indexing the locations tagged in media Unicorn can provide responses for questions such as:

“Media taken in New York City from accounts I follow”

Improving Account Search 

While utilizing the Instagram graph alone may provide signals that improve the search experience, it may not be sufficient to find the account you are looking for. The search ranking infrastructure of Unicorn had to be adapted to work well on Instagram.

One way we did this was to model existing connections within Instagram. On Facebook, the basic relationship between accounts is non-directional (friending is always reciprocal). On Instagram, people can follow each other without having to follow back. Our team had to adapt the search ranking algorithms used to store and retrieve account to Instagram’s follow graph. For Instagram, accounts are retrieved from unicorn by going through different mixes of:

“people followed by people you follow”

and

“People followed by people who follow you”

In addition, on Instagram, people can follow each other for various reasons. It doesn’t necessarily mean that a user has the same amount of interest in all the accounts they follow. Our team built a model to rank the accounts followed by each user. This allows us to prioritize showing people followed by people that are more important to the searcher.

A Unified Search Box

image

Sometimes, the best answer for a search query can be a hashtag or a place. In the previous search experience, Instagram users had to explicitly choose between searching for accounts or hashtags. We made it easier to search for hashtags and places by removing the necessity to select between the different types of results. Instead, we built a ranking framework that allows us to predict which type of results we think the user is looking for. We found in tests that blending hashtags with accounts was such a better experience that clicks on hashtags went up by more than 20%! This increase fortunately didn’t come at the cost of significantly impacting account search.

Our classifiers are both personalized and machine-learned on the logs of searches that users are doing on Instagram. The query logs are aggregated per country to determine if a given search term such as “#tbt” would most likely result in a hashtag search or an account search. Those signals are combined with other signals, such as past searches by a given user and the quality of the results available to show, in order to produce a final blended list of results.

Media Search

Instagram’s search  infrastructure is used to power discovery features far away from user-input search. Our largest search vertical, media, contains the billions of posts on Instagram indexed by the trillions of likes. Unlike our other tiers, media search is purely infrastructure — users never enter any explicit media search queries in the app. Instead, we use it to power features that display media: explore, hashtags, locations and our newly launched editorial clusters.

image

Candidate Generation 

Lacking an explicit query, we get creative with our media reverse index terms to enable slicing along different axes. The table below shows a list of some term types currently supported in our media index:

image

Within each posting list, our media is ordered (“statically ranked”) reverse-chronologically to encourage a strong recency bias for results.  For example, we can serve the Instagram’s profile page for @thomas with a single query: (term owner:181861901). Extending to hashtags, we can serve recent media from #hyperlapse through (term hashtag:#hyperlapse). Composing Unicorn’s operators enable us to find @thomas’ Hyperlapses, by issuing (and hashtag:#hyperlapse owner:181861901).

Many of terms exist to encourage diversity in our search results. For example, we may be interested in making sure that some #hyperlapse candidates are posted by verified accounts.  Through the use of Unicorn’s WEAK AND operator we can guarantee that at least 30% of candidates come from verified accounts:

(wand 
(term hashtag:#hyperlapse)
(term verified:1 :optional-weight 0.3)
)

We exploit diversity to serve better content in the “top” sections of hashtags and locations.

Features 

Although postings lists are ordered chronologically we often want to surface the top media for a given query (hashtag, location, etc.).  After candidate generation, we go through a process of ranking which chooses the best media by assigning a score to each document. The scoring function consumes a list of features and outputs a score representing the “goodness” of a given document for our query.

Features in our index can be divided broadly into three categories:

  • Visual: features that look at the visual content of the image itself. Concretely, we run each of Instagram’s photo through a deep neural net (DNN) image classifier in an attempt to categorize the content of the photo. Afterwards, we perform face detection in order to determine the number and size each of the faces in the photo.
  • Post metadata: features that look at non-visual content of a given post. Many Instagram posts contain captions, location tags, hashtags and/or mentions which aid in determining search relevancy. For example, the FEATURE_IG_MEDIA_IS_LOCATION_TAGGED is an indicator feature determining whether a post contains a location tag.
  • Author: features that look at the person who made a given post. Some of the richest information about a post is determined by the person that made it. For example, FEATURE_IG_MEDIA_AUTHOR_VERIFIED is an indicator feature determining whether the author of a post is verified.

Depending on the use case, we tune features weights differently. On the “top” section of location pages we may wish to differentiate between photos of a location and photos in a location and down-rank photos containing large faces. Instagram uses a per-query-type ranking model that allows for modeling choices appropriate to a particular app view.

Case study: Explore 

Our media search infrastructure also extends itself into discovery, where we serve interesting content that users aren’t explicitly looking for. Instagram’s Explore Posts feature showcases interesting content from people near to you in the Instagram graph. Concretely,  one source of explore candidates “photos liked by people whose photos you have liked”. We can can encode this into a single unicorn query with:

(apply liker:(extract owner: liker:<userid>))

This proceeds inwards-outwards by:

  1. liker:<userid>:  posts that you’ve liked
  2. (extract owner:...):  the owner of those posts
  3. (apply liker:..):  media liked by those owners

After this query generates candidates, we are able to leverage our existing ranking infrastructure to determine the top posts for you. Unlike top posts on hashtag and location pages, the scoring function for explore is machine-learned instead of hand tuned.

image

Acknowledgements

By Maxime Boucher and Thomas Dimson

This project wouldn’t be possible without the contributions of Tom Jackson, Peter DeVries, Weiyi Liu, Lucas Ou-Yang, Felipe Sodre da Silva and Manoli Liodakis

Read the whole story
rafeco
10 days ago
reply
Really nice post on how search works on large scale sites.
Share this story
Delete

More on Hacking Team

1 Comment and 4 Shares

Read this:

Hacking Team asked its customers to shut down operations, but according to one of the leaked files, as part of Hacking Team's "crisis procedure," it could have killed their operations remotely. The company, in fact, has "a backdoor" into every customer's software, giving it ability to suspend it or shut it down­ -- something down­something that even customers aren't told about.

To make matters worse, every copy of Hacking Team's Galileo software is watermarked, according to the source, which means Hacking Team, and now everyone with access to this data dump, can find out who operates it and who they're targeting with it.

It's one thing to have dissatisfied customers. It's another to have dissatisfied customers with death squads. I don't think the company is going to survive this.

Read the whole story
acdha
20 days ago
reply
“It's one thing to have dissatisfied customers. It's another to have dissatisfied customers with death squads.”
Washington, DC
rafeco
15 days ago
reply
Share this story
Delete

Ding dong, the witch is dead: Microsoft AV gets tough on Ask Toolbar

2 Comments and 3 Shares

Microsoft has started classifying most versions of the Ask Toolbar as unwanted software and has updated its malware programs to automatically remove them.

The move drew applause from security and support professionals because the Ask Toolbar has long been a source of performance problems that can sometimes be hard to correct. Making the toolbar more vexing is its ability to sneak its way on to computers when end users aren't paying attention. Oracle's Java software framework, for instance, has long installed it automatically unless users remember to uncheck a hard-to-see box during updates. Even after unchecking the box during one update, the box would be checked during subsequent updates, requiring end users to remain vigilant each time they installed frequent security fixes for Java.

In a recent addition to Microsoft's Malware Protection Center, the company said all but the most recent version of the Ask Toolbar will be classified as unwanted software. As a result, Windows Defender, Microsoft Security Essentials, and Microsoft Security Scanner will automatically remove it when detected.

Read 3 2 remaining paragraphs | Comments

Read the whole story
rosskarchner
36 days ago
reply
I think Jeeves would be with Microsoft on this one.
aaronwe
46 days ago
reply
Good on Microsoft.
Denver
wreichard
45 days ago
Java ought to be ashamed. Even to be able to refuse it once--but to have it checked by default every time it updates...
aaronwe
45 days ago
Yes, but for that to be true, Larry Ellison would have to have to give a shit.
rafeco
37 days ago
reply
Share this story
Delete

Recurse Center

1 Share

Coding requires collaboration. As Andrew Bosworth said recently: doing anything meaningful past a certain point requires more than one person. So if you want to build, it’s important to do so as part of a welcoming, collaborative environment.

One environment I’ve long admired is that of the Recurse Center (formerly known as Hacker School). They’ve been unusually thoughtful about the dynamics of their culture. I’ve always thought that if I had three months to spare, I would attend a batch to experience the community directly (and hopefully contribute back however I can).

And, well, now I have that kind of time.


I was initially surprised by how many experienced engineers told me that they too would attend a Recurse Center batch, if only they could make the timing and logistics work. But I think it shows that no matter how long you’ve been coding, there will always be areas of programming you’ve been meaning to try, and it’s best to do so around other people.

I’m applying to the Summer 2 batch (starting in July). If you have the time to spare, you should apply too!

Read the whole story
rafeco
69 days ago
reply
Share this story
Delete

Evolution

1 Comment and 2 Shares

Mike Isaac and David Gelles, writing for the NYT Bits blog:

After the flurry of attention and just a few months later, Secret opted to raise another round of financing, this time seeking $25 million. Bill Maris, managing partner of Google Ventures, did not think it was a good idea and the company did not participate.

“We advised them against it,” Mr. Maris said in an interview, referring to Secret’s leaders. “We told them they didn’t need the money. And raising that much money that soon, it was going to be impossible to meet the expectations in the future.” […]

The company completed its $25 million financing led by Index Ventures and Redpoint Ventures, along with a variety of individual angel investors. In that round, the two founders each wanted to take $3 million off the table for themselves, a practice that is commonplace for more mature companies, but less so for very young start-ups.

“It’s like a bank heist,” Mr. Maris said. “That’s not how you do a start-up.”

Later in the day, in an email to Isaac he posted publicly on Medium, Bill Maris wrote:

I want to correct and amend a few things. I wanted to let you know how my views had evolved since we spoke. […] I do want to make clear that this was not a “bank heist,” and that was a poor choice of words on my part.

That implies that the founders were trying to line their pockets at the expense of others. After having a heart to heart with David, I don’t think that’s true. David rightly pointed out to me that he and Chrys worked extremely hard. They built something that captured the imagination of a lot of people and had a huge amount of users. The tone and content of my comments as printed don’t pay the appropriate respect to that fact.

I don’t know what motivated him to speak so openly to The Times, but I know which one of his views sounds more honest to me, and it isn’t the “evolved” one.

Read the whole story
rafeco
82 days ago
reply
Taking that money was a bank heist.
petrilli
56 days ago
A fool and his money are soon parted, and VCs are some of the biggest fools in the world.
Share this story
Delete

Consumer Reports’s Initial Apple Watch Test Results

2 Comments

Impressive scratch-resistance results, especially for the sapphire crystal on the steel Apple Watch. Water resistance was as good as better than promised, and the heart rate monitor was as accurate as their highest-rated dedicated chest-strap monitor.

More details here.

Read the whole story
rafeco
91 days ago
reply
Most of what people don't like about the Apple Watch is software issues. The hardware seems great.
Share this story
Delete
Next Page of Stories