Finding similar travelers

What up world,

taking a fun detour from my usual work i have been playing around with graph db's and publicly available travel datasets on kaggle. I have been really interested in learning how to connect groups of people together that share similar behavior on platforms, be it traveling to the same places, enjoying the same book, etc.

I have been exploring some fun ways to "recommend" similar nodes together recently using llms and figured it would be a good idea to write about my process.

The problem:

How would you find users that have similar travel patterns from some randomly sampled data

The solution:

Create a knowledge graph of travelers with the following schema:

From	Relationship	To	Properties	Description
`(trav:Traveler)`	`[:TOOK]`	`(trip:Trip)`	—	Traveler went on Trip
`(trip:Trip)`	`[:AT_DESTINATION]`	`(dest:Destination)`	—	Trip’s destination
`(trip:Trip)`	`[:STAYED_IN]`	`(acc:Accommodation)`	—	Accommodation used on the trip
`(trip:Trip)`	`[:TRAVELED_BY]`	`(trans:Transportation)`	—	Transportation mode for the trip

Take this data and create profiles by passing json representations of each user to llm's in order to create embedding profiles. Using cosine similarity, find travelers that have similar embedding profiles.

Whats the value of travelers grouped by similarity

If you are asking this, you probably like your food pre-chewed but i'll indulge you (jk this is a fair question). The value of grouping people online by their interests is as much of a social question as a business one.

Travel focused social media:

Picture yourself in a room full of 100 strangers. You might feel the very human tendency to try and relate to one another, in attempting to do so, you might also land on the conclusion that not everyone is your cup of tea. Be it for communication style, religious beliefs, MAYBE LACK OF SHARED INTEREST AND EXPERIENCE... What would the value be in all of a sudden knowing who you had a the most in common with? You might be able to strike up a conversation and make a friend based on commonalities.

Now picture this, you're a traveller in the middle of the country looking to connect with people that have been to similar places - had similar life experiences. How would you recommend these users to each other?

Recommending new trips

Okay social media isn't your Utah jazz™️ - thats fine, well do you like travel at all? Like are you opposed to the idea of getting out of your comfort zone and becoming slightly more cultured 🤔? If not, you sound LAME 🤡🤡🤡. Candidly, re-evaluate your taste and come back to me in a few months. Listen to a Barry White album, If your over 26 go to Japan (or Korea) since that seems to be the thing to do.

If the answer is yes, then great! We can use similarity among past travel patterns to cluster similar users. My hunch is that these clusters of similar users can be used as a basis to recommend past trips <similar user a/> has taken that <similar user b/> has not yet taken.

Once we have a candidate trip and destination, we can find nearby activities to it. So starting with similar users you can build entire travel itineraries.

What i have made so far is (a very naive) knowledge graph of a small dataset of users, trips and locations. My goal is to group people with similar travel patterns together by taking relevant relationships between connected nodes (i'll outline these below for more clarity), and passing them through a text-generation transformer instructed to create a summary of their patterns. After i create these summaries i compute their embeddings and set the computed embeddings back on the user node. What i am doing from a high level, is creating profiles for comparison.

Edit: rereading this in retrospect, this was a cool idea but i also want to combine it with embeddings created from raw data for a more consistent / relevant format. This is important so that i can make the best comparisons possible. i'm thinking about changing my prompt to try and capture qualitative features which are harder to get directly from the data (a traveler is vagabonding, or couch surfing or living large or going to destinations that consistently have access to the ocean) as a way to compliment the embeddings i create manually, but i digress.

I'm including some of my code below for my driver function:

Getting the data and the prompt

db = MemgraphDriver()
query = &quot;&quot;&quot;
	MATCH (traveler:Traveler {id: $user_id})-[:TOOK]-(trip:Trip)
	OPTIONAL MATCH (trip)-[:AT_DESTINATION]-&gt;(destination:Destination)
	OPTIONAL MATCH (trip)-[:STAYED_IN]-&gt;(accommodation:Accommodation)
	OPTIONAL MATCH (trip)-[:TRAVELED_BY]-&gt;(transport:Transportation)
	RETURN traveler, trip, destination, accommodation, transport
&quot;&quot;&quot;

params = {&quot;user_id&quot;: user_id}
result = db.execute_query(query, params)

if not result:
	print(&#39;no result found for this query&#39;)
	return

# We don&#39;t want to add any pii to a model.
# Soooo we should strip all that prior to creation.
graph_summary = []

# Track if we&#39;ve already added the traveler
traveler_node_added = False

for record in result:
	traveler = record.get(&quot;traveler&quot;)
	trip = record.get(&quot;trip&quot;)
	destination = record.get(&quot;destination&quot;)
	accommodation = record.get(&quot;accommodation&quot;)
	transport = record.get(&quot;transport&quot;)

	if traveler and not traveler_node_added:
		graph_summary.append({
			&quot;type&quot;: &quot;node&quot;,
			&quot;labels&quot;: list(traveler.labels),
			&quot;properties&quot;: {
				k: v for k, v in traveler.items() if k not in [&#39;email&#39;, &#39;name&#39;,&#39;traveller_name&#39;]
		}
		})

		traveler_node_added = True

	if trip:
		trip_properties = {k: get_serializable_value(v) for k, v in trip.items()}
		graph_summary.append({
			&quot;type&quot;: &quot;node&quot;,
			&quot;labels&quot;: list(trip.labels),
			&quot;properties&quot;: trip_properties
		})
		# These relationships are implied from the query shape

	if destination:
		destination_properties = {k: get_serializable_value(v) for k, v in destination.items()}

		graph_summary.append({
			&quot;type&quot;: &quot;relationship&quot;, 
			&quot;relationship_type&quot;: &quot;AT_DESTINATION&quot;
		})

		graph_summary.append({
			&quot;type&quot;: &quot;node&quot;,
			&quot;labels&quot;: list(destination.labels),
			&quot;properties&quot;: destination_properties
		})

	if accommodation:
		accommodation_properties = {k: get_serializable_value(v) for k, v in accommodation.items()}

		graph_summary.append({
			&quot;type&quot;: &quot;relationship&quot;, 
			&quot;relationship_type&quot;: &quot;STAYED_IN&quot;
		})

		graph_summary.append({
			&quot;type&quot;: &quot;node&quot;,
			&quot;labels&quot;: list(accommodation.labels),
			&quot;properties&quot;: accommodation_properties
		})

	if transport:
		transport_properties = {
			k: get_serializable_value(v) for k, v in transport.items()
		}

		graph_summary.append({
			&quot;type&quot;: &quot;relationship&quot;, 
			&quot;relationship_type&quot;: &quot;TRAVELED_BY&quot;
		})

		graph_summary.append({
			&quot;type&quot;: &quot;node&quot;,
			&quot;labels&quot;: list(transport.labels),
			&quot;properties&quot;: transport_properties
		})

graph_text = json.dumps({&quot;user_data&quot;: graph_summary}, indent=2)

What we have in memory is a json representation of a user in our knowledge graph including the relationships this user has to "trips" and "locations", etc. The next thing to do is to create a prompt.

prompt = f&quot;&quot;&quot;
	You are an ai agent for a travel recommendation app.
	Using this users graph data create a short summary that describes their behavior.
	Your summary should include descriptions of each trip as its own entity as well as an overview of the travellers habits with frequency of travel, locations visited, and activities they seem to like.

	Base your summary exlusively on their behavior.

	Note: Do not preface your response with any kind of message e.g: &#39;Based on the user&#39;s graph data, here&#39;s a summary of their travel behavior&#39;

	Only return the summary

	User&#39;s Graph Data:

	-----––––––––––
	{graph_text}
	--------------
&quot;&quot;&quot;

return prompt

For all you prompting wizards out there, i'm sure this could be improved - go touch grass nerd.

***Update: After getting my results - i know it can be improved - still touch grass. ***

UPDATE FOR THE UPDATE: I changed it to this 🍃🍃🍃

#promptv2

You are an expert travel behavior analyst. Based solely on the graph data provided, create a structured profile of this traveler that quantifies their travel preferences and archetypes.

INSTRUCTIONS:

1. Analyze the provided graph data carefully

2. Rate each travel archetype on a scale of 0-10 (where 0=not at all, 10=extremely strong match)

3. Base your ratings ONLY on evidence from the provided data

4. If insufficient evidence exists for a category, assign it a rating of 0-2

5. Include only factual observations in your summary

REQUIRED OUTPUT FORMAT:

Traveler is a [GENDER], aged [AGE] who has traveled to [NUMBER] destinations across [NUMBER] continents.

They show the following ratings for travel archetypes (0-10 scale):

- Budget Backpacker: Enjoys inexpensive travel, hostels, low-cost transportation. [RATING] - [brief evidence if &gt;2]

- Luxury Traveler: Prefers 5-star accommodations, fine dining, private tours. [RATING] - [brief evidence if &gt;2]

- Adventure Seeker: Interested in outdoor adventures, sports, thrill-seeking. [RATING] - [brief evidence if &gt;2]

- Cultural Explorer: Values museums, historical sites, local culture. [RATING] - [brief evidence if &gt;2]

- Relaxation-Focused: Seeks beach resorts, spas, wellness retreats. [RATING] - [brief evidence if &gt;2]

- Social Traveler: Enjoys group tours, nightlife, meeting new people. [RATING] - [brief evidence if &gt;2]

- Family-Oriented Traveler: Travels with family, chooses kid-friendly destinations. [RATING] - [brief evidence if &gt;2]

- Nature Lover: Prioritizes parks, hiking, natural scenery. [RATING] - [brief evidence if &gt;2]

- Urban Explorer: Loves major cities, architecture, shopping, dining. [RATING] - [brief evidence if &gt;2]

- Remote Retreat Enthusiast: Prefers quiet, remote, off-grid locations. [RATING] - [brief evidence if &gt;2]

- Digital Nomad: Loves new places with social scenes and nature with access to good internet. [RATING] - [brief evidence if &gt;2]

Key travel patterns:

[2-3 sentences identifying core patterns in their travel behavior, destinations, accommodations, transportation choices, trip duration, and spending patterns]

GRAPH DATA:

{graph_text}

&quot;&quot;&quot;

Getting a summary from our prompt

Now you have a pretty good prompt! Lets pass that into an ollama instance of llama:3.2 a transformer model. I am using Ollama because i am poor-ish in an expensive city and can't afford a $4000 Nvidia graphics card or mini super computer to blast my fledgling ideas into digital oblivian.

# Function that generates a summary of the prompt about a user. 
def create_summary_of_prompt(prompt:str):
	res = requests.post(&quot;http://localhost:11434/api/generate&quot;, json={
		&quot;model&quot;: &quot;llama3.2:latest&quot;,
		&quot;prompt&quot;: prompt,
		&quot;stream&quot;: False
	})

	res.raise_for_status()
	return res.json()[&quot;response&quot;]

Next thing, we need to generate an embedding from this summary, so once again, i am asking ollama for professional advice. Here i am using all-minilm. This model is professionally known as a tiny, real chill guy whose sole purpose in life is to generate embeddings from text. He is kind to my shitbook vram specs. I appreciate him for that.

# store an embedding on the user node.
def store_summary_on_user_node(user_id:str, summary:any):
	res = requests.post(&quot;http://localhost:11434/api/embeddings&quot;, json={
		&quot;model&quot;: &quot;all-minilm&quot;,
		&quot;prompt&quot;: summary
	})

	res.raise_for_status()	
	embedding = res.json()[&quot;embedding&quot;]

	print(embedding)
	if embedding:
		db = MemgraphDriver()
		query = &quot;&quot;&quot;
			MATCH (t:Traveler {id: $user_id})
			SET t.embedding = $embedding
			RETURN t
		&quot;&quot;&quot;

		user = db.execute_query(query, {
			&quot;user_id&quot;:user_id , 
			&quot;embedding&quot;: embedding
			}
		)

		return {
			&quot;user_id&quot;: user_id,
			&quot;embedding&quot;: embedding
		}

Voila, you have now generated some insights about your users via a knowledge graph from some high dimensional space. The next thing to do is to compare these embeddings (and some additional metadata thanks to qdrant's neat payload capabilities) in a vector db to find which users are similar to each other based on their travel preferences.

Once i have the ability to find similar nodes, i can pull each user down into a high dimensional space for similarity comparison of travel behavior. I have been looking at qdrant specifically for a vector db to do this.

A moment on cosine vs euclidean metrics - cosine is more representative of behavior / preferences. Euclidean is better for actual distances, i could try and find a way to use this on the actual geographic data for the locations visited. Maybe i enrich some of the location nodes with gps data. ^ source - trust me dude (chat gippity)

Finding similarity from our data

So now we have nodes with embeddings, we know we want to rely on cosine distance metrics for similarity comparison, and we know we have qdrant cloud to make our lives a little bit easier. Lets make some comparisons!

With Qdrant you create collections, collections are comprised of points. Points are objects that have an id, vectors of a specified shape, in our case 384 was the size of of our semantic embeddings.

SIDEBAR: #multiembedding Qdrant now supports multivector search where you can search for different vectors at the same time and assign weights to each. This is so fucking cool, because i can update existing points to include new vectors and adjust the weights to tailor in my results.

https://qdrant.tech/articles/storing-multiple-vectors-per-object-in-qdrant/

# This is example code, not relevant to my project yet
qdrant_client.search(
    collection_name=&quot;travelers&quot;,
    query_vector={
        &quot;semantic_embedding&quot;: (transformer_query_vector, 0.8),
        &quot;feature_embedding&quot;: (raw_query_vector, 0.2),
    },
    top=10
)

^ In v2 I'm trying to do weighted search but it is proving to be pain in asshole due to shit box feature embeddings.

DIGRESSING, back to blog

point = models.PointStruct(
	id=node[&#39;user_id&#39;],
	vector=padded_embedding,
	payload=get_tags_for_node(db=db, node=node),
)

Payloads are rad, since they let you add additional information about each point beyond the embedding. You can filter by these payloads. In our case, this is good for a few reasons.

Small differences in sentence structure can have a big difference in the similarity score given to each node. This means the difference between like or as can make two similar users seem more distant and unrelated. 1.5) Update: I can sort of get around this by using a more templated prompt.
We might decide to add new data to each node later in qdrant to experiment and don't want to generate entirely new embeddings. Some factors might be hard to prompt / feature engineer, so using the payload gives us a more deterministic approach to balance out our search.

So if: - our encoder is churning out bullshit (it was ™️) - our transformer is not accurately representing our data passed into the prompt (pain ™️) - or my feature embeddings are bullshit (they currently are ™️),

Then we can add in additional search parameters to each similarity search utilizing the payload off each of our nodes. TLDR – We get more flexibility with searching. beyond the similarity of vectors by adding more verbose payload info.

Back to the code this is what get_tags_for_node looks like:

# We want to provide some additional data on the users based on their relationships.
# qDrant allows us to provide this via payload to each node. So we are decorating
# our qdrant points with metadata about the traveler.
def get_tags_for_node(db:MemgraphDriver, node:dict):
	tags = {}
	relationships = [&#39;duration&#39;, &#39;destination&#39;, &#39;transportation_type&#39;]
	query = &quot;&quot;&quot;
		MATCH (t:Traveler {id: $user_id})
		optional match (t)-[r:TOOK]-(trip:Trip)
		optional match (trip)-[travBy:TRAVELED_BY]-(transportation:Transportation)
		optional match (trip)-[atDest:AT_DESTINATION]-(destination:Destination)
		return traveler, trip, r, atDest, destination, transportation, travBy
	&quot;&quot;&quot;

	result = db.execute_query(query, {&quot;user_id&quot;: node[&#39;user_id&#39;]})
	for record in result:
		# This is what we create our payload info from, so it prob makes sense to exand on it in the future.
		tags[relationships[0]] = record[&#39;trip&#39;][&#39;duration&#39;]
		tags[relationships[1]] = record[&#39;destination&#39;][&#39;name&#39;]
		tags[relationships[2]] = record[&#39;transportation&#39;][&#39;type&#39;]

	return tags

As you can see we are grabbing a few (probably not enough) relationships from our graph, namely the length of days a traveler spent on a trip, the name of the destination they visited and how they traveled. I'm going to go out on a limb and say that this could be improved substantially. But this will help us get a starting point at least.

Creating our collection

Heres how this works to create our collection:

if __name__ == &quot;__main__&quot;:
	nodes = get_nodes_with_embeddings()
	collection_count = len(nodes)

	collection_name = &quot;travelers&quot;
	client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
	collection_exists = client.collection_exists(collection_name=collection_name)

	# Only create this once michael...
	if not collection_exists:
		client.create_collection(
			collection_name=collection_name,
			vectors_config=models.VectorParams(size=500,
			distance=models.Distance.COSINE),
		)
	# create the actual points
	payload_for_points = create_payload_for_point(nodes=nodes)
	# insert them into the db
	client.upsert(
		collection_name=collection_name,
		points=payload_for_points
	)

Notice how we check that it doesn't already exist... Its the little things.

Sidebar of future Michael

#multiembedding #recreate_embeddings

After realizing i wanted to do multi search the set up looks a bit different. I'll include it here for context:

client.recreate_collection(
	collection_name=collection_name,
	vectors_config={
		&quot;feature_embedding&quot;: VectorParams(
			size=73,
			distance=models.Distance.COSINE,
		),
		&quot;semantic_embedding&quot;: VectorParams(
			size=384,
			distance=models.Distance.COSINE,
			),
	}
)

In this case i'm recreating because it already exists.

Searching for similar users

To insert points into an existing database you need two things, well really one.

A travelers embedding that has been padded to match the 500 shape of our collection, for reference.
A reference payload containing additional search parameters.

similar_users = search_users_for_similarity(client=client, target_embedding=target_embedding, reference_payload=reference_payload)

We can obtain these from a node whose behavior we've gippitied™️

Also heres our search_users_for_similarity function

def search_users_for_similarity(client, target_embedding, reference_payload):
	# Search for similar users
	search_result = client.search(
		collection_name=&quot;travelers&quot;,
		query_vector=target_embedding,
		limit=5,
		with_payload=True # This will include the metadata we stored
	)

	similar_users = []

	for hit in search_result:
		similar_users.append({
			&quot;user_id&quot;: hit.id,
			&quot;score&quot;: hit.score,
			&quot;payload&quot;: hit.payload
		})

	# Sort by final_score if needed
	return similar_users.sort(key=lambda x: x[&quot;score&quot;], reverse=True)

Netted us this.

[ScoredPoint(id=&#39;f68fe24e-91a2-42a5-9fa2-c0ca66178a16&#39;, version=6, score=1.0, payload={&#39;duration&#39;: 7, &#39;destination&#39;: &#39;London, UK&#39;, &#39;transportation_type&#39;: &#39;Flight&#39;}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=&#39;bc3b84c2-bd09-4098-b94d-74f1dd71a514&#39;, version=6, score=0.6646669, payload={&#39;duration&#39;: 7, &#39;destination&#39;: &#39;Bali, Indonesia&#39;, &#39;transportation_type&#39;: &#39;Flight&#39;}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=&#39;0fa8d56e-ee1f-4b39-a0fe-d2cc8c82b56b&#39;, version=6, score=0.5428275, payload={&#39;duration&#39;: 5, &#39;destination&#39;: &#39;Phuket, Thailand&#39;, &#39;transportation_type&#39;: &#39;Flight&#39;}, vector=None, shard_key=None, order_value=None)]

When looking at our results. lets disregard the first. a similarity score of 1.0 means the provided search_embedding and payload are the same user. Comparing the second result for user_id: 'bc3b84c2-bd09-4098-b94d-74f1dd71a514' to the first in our graph here is what we see.

David Lee graph

David Lee:

age: 45
embedding: Array[384] ...
gender: &quot;Male&quot;
id: &quot;bc3b84c2-bd09-4098-b94d-74f1dd71a514&quot;
name: &quot;David Lee&quot;
nationality: &quot;Korean&quot;

Took a trip:

accommodationCost: 1000
duration: 7
endDate: Object {&quot;year&quot;:2023,&quot;month&quot;:7,&quot;day&quot;:1}
id: 153599
startDate: Object {&quot;year&quot;:2023,&quot;month&quot;:7,&quot;day&quot;:1}
transportationCost: 700

to Bali, Indonesia:

name: Bali, Indonesia
# There is a real opportunity to enrich location data for better recomendations

and travelled by:

type: &quot;Flight&quot;

Now lets compare this to our relationships for traveler: f68fe24e-91a2-42a5-9fa2-c0ca66178a16:

John Smith Graph

John smith:

age: 35
embedding: Array[384] ...
gender: &quot;Male&quot;
id: &quot;f68fe24e-91a2-42a5-9fa2-c0ca66178a16&quot;
name: &quot;John Smith&quot;
nationality: &quot;American&quot;

Took a trip:

accommodationCost: 1200
duration: 7
endDate: Object {&quot;year&quot;:2023,&quot;month&quot;:5,&quot;day&quot;:1}
id: 140407
startDate: Object {&quot;year&quot;:2023,&quot;month&quot;:5,&quot;day&quot;:1}
transportationCost: 600

To:

name: &quot;London, UK&quot;

and traveled by:

type: &quot;Flight&quot;

Our similarity_search found a `.0.6646669` similarity between both John smith and David Lee.

Similarities:

Both travelers are early - middle-ish aged men who both took a 7 day trip in 2023 for around 1000 USD, opting to travel by Flight.

Least similar result

Least similar in our list was Traveler: 0fa8d56e-ee1f-4b39-a0fe-d2cc8c82b56b

Jane Doe:

age: 28
embedding: Array[384]...
gender: &quot;Female&quot;
id: &quot;0fa8d56e-ee1f-4b39-a0fe-d2cc8c82b56b&quot;
name: &quot;Jane Doe&quot;
nationality: &quot;Canadian&quot;

Took a trip:

accommodationCost: 800
duration: 5
endDate: Object {&quot;year&quot;:2023,&quot;month&quot;:6,&quot;day&quot;:15}
id: 280728
startDate: Object {&quot;year&quot;:2023,&quot;month&quot;:6,&quot;day&quot;:15}
transportationCost: 500

Even more results

Sarah Johnson:

&quot;properties&quot;: {
	age: 29
	embedding: Array[384]...
	gender: &quot;Female&quot;
	id: &quot;b6a229b9-8b77-46fb-9612-33f5ddfaa2e5&quot;
	name: &quot;Sarah Johnson&quot;
	nationality: &quot;British&quot;

Took a trip

accommodationCost: 2000
duration: 14
endDate: Object {&quot;year&quot;:2023,&quot;month&quot;:8,&quot;day&quot;:15}
id: 574047
startDate: Object {&quot;year&quot;:2023,&quot;month&quot;:8,&quot;day&quot;:15}
transportationCost: 1000

to Destination

name: &quot;New York, USA&quot;

Where she stayed in an accommodation

type: &quot;Hotel&quot;

and traveled by:

type: &quot;Flight&quot;

Emily Davis was the closest neighbor to Sarah Johnson and had a cosine similarity of 0.696097

cosine similarity qdrant

Emily Davis:

age: 33
embedding: Array[384]...
gender: &quot;Female&quot;
id: &quot;92e6330d-ad13-4451-8e90-b99d91764d38&quot;
name: &quot;Emily Davis&quot;
nationality: &quot;Australian&quot;

took a trip:

accommodationCost: 500
duration: 10
endDate: Object {&quot;year&quot;:2023,&quot;month&quot;:11,&quot;day&quot;:20}
id: 838803
startDate: Object {&quot;year&quot;:2023,&quot;month&quot;:11,&quot;day&quot;:20}
transportationCost: 1200

to destination:

name: &quot;Sydney, Australia&quot;

Where she stayed in accommodation:

type: &quot;Hostel&quot;

and traveled by

type: &quot;Flight&quot;

Qualitative check

Referring to my very academic approach of qualitative evaluation (eyeballing it). I notice some latent commonalities between Travelers.

For one, both Emily and Sarah are females in a similar age range, traveling by flight for over a week. They are not staying the same amount of days, but i'd argue that their trip lengths are more similar than not. The pricing is not similar. Emily spent a combined total of $1700 on accommodation and transportation cost, this means she was traveling on a budget.

Sarah was kind of boujie and spent $3000 in 14 days, which means she was balling out - academically speaking.

I think that could be something i work on with my payload. Attaching the median trip cost for each traveler (not the average, that could be misleading) would be a helpful way to filter in our search. I gippitied it and i could do so attaching this as our payload reference (assuming every node in qdrant has a medianTripCost stored on it):

&quot;filter&quot;: {
  &quot;must&quot;: [
    {
      &quot;key&quot;: &quot;medianTripCost&quot;,
      &quot;range&quot;: { &quot;gte&quot;: 1500, &quot;lte&quot;: 3500 }
    }
  ]
}

But wait, our similarity sucks

Well its probably worth acknowledging that for a production system we should explore creating embeddings more cost efficiently, directly from raw data. Why don't we try to add a new set of embeddings that have a more deterministically defined feature set. These will be known as our feature embeddings™️. I have a hunch that I should using the prompt to generated embeddings differently. I want to use transformers to qualify our travelers by prompting it to identify traveler themes or archetypes - like: is boujie, spares no expense, barely travels at all, prefers the coast, currently going through a midlife crisis (we can be kinder to our travelers - this is only for illustrating my point - i promise). If i can capture more quantiative similarity and the harder to search for qualitative contextual identity of our travelers and weight the two (Qdrant lets me) I might improve our results.

Again, the north star is finding the most similar person and recommending trips they have taken in order to create full blown itineraries.

#promptv2 #multiembedding

Our new and more structured embeddings

Manually creating these boys means we need to start by making some encoders.

# Start by defining our encoder functions, we only define these once in our 
# script to be called against multiple travelers.
def create_destination_encoder():
	db = MemgraphDriver()
	query = &quot;&quot;&quot;
		MATCH (dest:Destination)
		RETURN dest.name as destination
	&quot;&quot;&quot;

	result = db.execute_query(query, {})

	all_destinations = (record[&#39;destination&#39;] for record in result)
	destination_encoder = OneHotEncoder(sparse=False, handle_unknown=&#39;ignore&#39;)
	dest_encoder = destination_encoder.fit(np.array(all_destinations).reshape(-1))

	return dest_encoder

def create_accomodations_encoder():
	db = MemgraphDriver()
	query = &quot;&quot;&quot;
		MATCH (accom:Accomodation)
		RETURN accom.type as accomodation
	&quot;&quot;&quot;
	result = db.execute_query(query, {})

	all_accoms = (record[&#39;accomodation&#39;] for record in result)
	destination_encoder = OneHotEncoder(sparse=False, handle_unknown=&#39;ignore&#39;)
	accom_encoder = destination_encoder.fit(
		np.array(all_accoms).reshape(-1)
	)
	return accom_encoder

def create_transportation_encoder():
	db = MemgraphDriver()
	query = &quot;&quot;&quot;
		MATCH (transport:Transportation)
		RETURN transport.type as transport
	&quot;&quot;&quot;

	result = db.execute_query(query, {})
	all_transports = (record[&#39;transport&#39;] for record in result)
	destination_encoder = OneHotEncoder(sparse=False, handle_unknown=&#39;ignore&#39;)

	transport_encoder = destination_encoder.fit(
		np.array(all_transports).reshape(-1)
	)

	return transport_encoder

Then we gotta make our traveler features dict.


def create_traveler_features(user_id: str):

	db = MemgraphDriver()
	query = &quot;&quot;&quot;
		MATCH (t:Traveler {id: $user_id})
		OPTIONAL MATCH (t)-[tr:TOOK]-(trip:Trip)
		OPTIONAL MATCH (trip)-[:AT_DESTINATION]-&gt;(destination:Destination)
		OPTIONAL MATCH (trip)-[:STAYED_IN]-&gt;(accommodation:Accommodation)
		OPTIONAL MATCH (trip)-[:TRAVELED_BY]-&gt;(transport:Transportation)
		RETURN t as traveler, trip, count(trip) as trips_count, destination, accommodation, transport
	&quot;&quot;&quot;

	result = db.execute_query(query, {&quot;user_id&quot;: user_id})

	t_features = {
	&#39;avg_trip_duration&#39;: np.mean([
		record[&#39;trip&#39;][&#39;duration&#39;] for record in result
	]),
	&#39;total_trips&#39;: np.sum([record[&#39;trips_count&#39;] | 0 for record in result]),
	&#39;destinations_visited&#39;: [record[&#39;destination&#39;][&#39;name&#39;] for record in result],
	&#39;transportation_used&#39;: [record[&#39;transport&#39;][&#39;type&#39;] for record in result],
	&#39;accommodation_types&#39;: [record[&#39;accomodation&#39;][&#39;type&#39;] for record in result],
	&#39;avg_trip_cost&#39;: np.median([
		float(record[&#39;trip&#39;][&#39;transportation_cost&#39;]) + float(record[&#39;trip&#39;][&#39;accomodation_cost&#39;]) for record in result
	]),
	&#39;seasonality&#39;: [record[&#39;trip&#39;][&#39;start_date&#39;] for record in result],
	}

	return t_features

Then we encode our features dict from travelers and the references we define at the start of our script.

# Encode features from our traveler and our reference features.
def encode_categorical_features(
	features, 
	destination_encoder, 
	transport_encoder, 
	accommodation_encoder
):
	#destinations
	dest_array = np.array(features[&#39;destinations_visited&#39;]).reshape(-1, 1)
	dest_encoded = destination_encoder.transform(dest_array)
	dest_vector = np.sum(dest_encoded, axis=0)
	# Sum or mean depending on preference

	# transportation
	trans_array = np.array(features[&#39;transportation_used&#39;]).reshape(-1, 1)
	trans_encoded = transport_encoder.transform(trans_array)
	trans_vector = np.sum(trans_encoded, axis=0)

	# Accoms
	acc_array = np.array(features[&#39;accommodation_types&#39;]).reshape(-1, 1)
	acc_encoded = accommodation_encoder.transform(acc_array)
	acc_vector = np.sum(acc_encoded, axis=0)

	combined_vector = np.concatenate([dest_vector, trans_vector, acc_vector])

	return combined_vector

Finally we run the jaunt and generate some emeddings!

if __name__ == &quot;__main__&quot;:
	user_id = &quot;b6a229b9-8b77-46fb-9612-33f5ddfaa2e5&quot;

	destination_encoder = create_destination_encoder()
	accomodations_encoder = create_accomodations_encoder()
	transportation_encoder = create_transportation_encoder()

	traveler_features = create_traveler_features(user_id=user_id)

	cat_vector = encode_categorical_features(
		traveler_features, 
		destination_encoder, 
		transportation_encoder, 
		accomodations_encoder
	)

	numerical_vector = np.array([
		traveler_features[&#39;avg_trip_duration&#39;],
		traveler_features[&#39;total_trips&#39;],
		traveler_features[&#39;avg_trip_cost&#39;],
		np.mean(traveler_features[&#39;seasonality&#39;]),
	])

	# Reshape to 2D array for MinMaxScaler (required by sklearn)
	numerical_vector = numerical_vector.reshape(-1, 1)

	# Create a MinMaxScaler object to scale features to the range [0, 1]
	scaler = MinMaxScaler()

	# Fit and transform the numerical features

	normalized_numerical_data = scaler.fit_transform(numerical_vector)
	normalized_numerical_vector = normalized_numerical_data.flatten()
	# Final feature vector with normalized numerical features
	print(&quot;Normalized Numerical Vector:&quot;, normalized_numerical_vector)

	# Final feature vector
	final_vector = np.concatenate([cat_vector, numerical_vector])

	breakpoint()

Results of our deterministically generated embeddings

After playing around with various permutations of the above code. Adjusting the weights for the numeric vectors and normalizing everything my results were less than stellar. Part of this could be due to my misuse of the sklearn library.

Maybe we'd be better off with a different approach? Enter in metapth2vec

I also believe after researching more on graph recommenders a la suggestion of the gippity that a better route could be to rely on some heterogenous graph walking algorithm akin to metapath2vec. For those unfamiliar, it walks the different nodes in a graph traversing the different edges and gathering the metadata on each node to build up an high dimensional representation of the neighborhood. An anology to this is like moving to a new neighborhood. In order to get understand the dynamics and complex relationships, you randomly start knocking on doors. Every time you knock on a door (or visit a node in our graphs case), you traverse a conversation with the same directed questions.

(Neighbor A) -[HAD_CONVERSATION_ABOUT: { "hey i just moved" -> "is there anything i should know about the neighborhood" -> "I just met neighbor [b], they seem cool" -> "Who else should i talk to on the street" } -> (Neighbor B)

Then you go to neighbor B, C, D, etc and do the same thing until you have built up an entire mental map of the neighborhood.

Once you have explored your map, then you write down the all the information you've learned. You can create a summary (embedding) for each of the nodes in your graph or neighbors in your neighborhood.

You write these summaries by noticing which neighbors get mentioned together most often in your walk notes - like learning that two people must be close friends because they're always talked about in the same stories. At a high level thats what metanode2vec is doing.

Algorithm paper: https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf

Also, metapath2vec depends on skipgram models. This blog article gives a really good breakdown of how they work at a high level:

blog article: https://leshem-ido.medium.com/skip-gram-word2vec-algorithm-explained-85cd67a45ffa

Sounds useful Michael - why don't you stop writing this fucking book and write some code so you can make more dablooms?

GOTCHA, This is a great approach if your graph has edges which are bidirectional

Meaning, this algorithm shines when you can go from (traveler)-[:TOOK_TRIP]-(trip) AND (trip)-[:TAKEN_BY]-(traveler). But as i found out (through unsurprising failure) - when you have a graph whose nodes have unidirectional edges (traveler)-[:TOOK]->(trip), you shouldn't just add reverse edges to let you meta walk all over for some embeddings. Mostly because, directionality is really important in a semantic sense. Trips don't take travelers, travelers take trips. Misunderstanding that and adding in reverse edges can lead to an incorrect graph topology ergo netting you shitbox embeddings.

Other approaches to solving a similar problem:

H.O.P.E algorithm: https://www.kdd.org/kdd2016/papers/files/rfp0184-ouA.pdf

L.I.N.E algorithm: TODO: add in link

So where do we go from here, i'm tired of reading about your failures?

Metapath2vec2++ handles directionality

We make a few small adjustments to our script and rerun it to compute new similarity scores.

Metapath2Vec++ walk results

Here our metapath2vec++ found that Michael Chang and Lisa Chen were most closely related. Turns out their nodes share a ton of edges with each other. This is looking for co-occurrences to base similarity on, as well as the properties of each node. The nodes themselves are also similar, gender aside - both are almost 30, the two of them went on more than week long trips (10 and 8 day trips respectively), stayed in a Hotel, in Bali and traveled by Flight. If we are basing our similarity on co-occurrence this is as close as we are going to get.

Once I create the embeddings, for each traveler I do the following:

def get_traveler_embedding(
	traveler_id, 
	embeddings_dict, 
	traveler_id_to_idx
):
	# Get the index of the traveler in your graph
	traveler_idx = get_traveler_index(traveler_id, traveler_id_to_idx)

	# Combine embeddings from different meta-paths
	combined_embedding = torch.cat([
		embeddings_dict[&quot;metapath_0&quot;][traveler_idx] * 0.4, #destination
		embeddings_dict[&quot;metapath_1&quot;][traveler_idx] * 0.3, #accommodation
		embeddings_dict[&quot;metapath_2&quot;][traveler_idx] * 0.3 #transportation
	])

	return combined_embedding.numpy().reshape(1, -1)

def store_feature_keys_on_traveler_nodes(
	traveler_id, 
	embeddings_dict, 
	traveler_id_to_idx
):

	embedding_for_traveler = get_traveler_embedding(
		traveler_id, 
		embeddings_dict, 
		traveler_id_to_idx
	)

	base64_embedding = base64.b64encode(
		embedding_for_traveler.tobytes()
	).decode(&#39;utf-8&#39;)

	db = MemgraphDriver()
	query = &quot;&quot;&quot;
		MATCH (traveler:Traveler {id: $traveler_id})
		SET traveler.b64_graph_embeddings = $graph_embeddings
		RETURN traveler
	&quot;&quot;&quot;

	result = db.execute_query(
		query, 
		{&#39;traveler_id&#39;: traveler_id, &#39;graph_embeddings&#39;: base64_embedding}
	)

	return result

This is combining the different metapath embeddings into one for each traveler. It weights the embeddings based on my idea of whats most important before combining them a 2D array with exactly 1 row.

also, heres how to decode them:

def decode_embedding(b64_str, dtype_str, shape_str):
	dtype = np.dtype(dtype_str)
	shape = tuple(int(x) for x in shape_str.strip(&#39;()&#39;).split(&#39;,&#39;) if x)

	bytes_data = base64.b64decode(b64_str)
	array = np.frombuffer(bytes_data, dtype=dtype)

	if len(shape) &gt; 1:
		array = array.reshape(shape)

		return array

Thinking beyond similarity:

We found some similar nodes based on summaries then similar nodes based on co-occurrences - so what? Why does this matter?

How can we use this to recommend tours and activities based on the embeddings we created? If we think about the problem of recommending tours to a "David Lee", a good start is finding the most similar user in our db, John smith, then surface recent Trip nodes John has a :TOOK relationship with, meaning destinations John has visited recently that David has never been to

Once we have a candidate destination, lets build an itinerary based on activities that are available nearby, that don't greatly exceed David Lee's median trip cost. This could be an exciting opportunity to introduce some kind of front end that takes in user input - creates an embedding and spits out a bunch of tours and locations - but i digress. At this point we need to connect Activity nodes to our Destination nodes, in order to be able to start creating itineraries.

Finding similar travelers

The problem:

The solution:

Whats the value of travelers grouped by similarity

Travel focused social media:

Recommending new trips

Getting the data and the prompt

UPDATE FOR THE UPDATE: I changed it to this 🍃🍃🍃

Getting a summary from our prompt

Finding similarity from our data

DIGRESSING, back to blog

Creating our collection

Searching for similar users

Our similarity_search found a .0.6646669 similarity between both John smith and David Lee.

Similarities:

Least similar result

Even more results

Qualitative check

But wait, our similarity sucks

Our new and more structured embeddings

Results of our deterministically generated embeddings

Maybe we'd be better off with a different approach? Enter in metapth2vec

GOTCHA, This is a great approach if your graph has edges which are bidirectional

So where do we go from here, i'm tired of reading about your failures?

Thinking beyond similarity:

Our similarity_search found a `.0.6646669` similarity between both John smith and David Lee.