Recommending itineraries from similarity

### Building an itinerary - enter in the activity

Lets think of a new node type, Activity. An Activity has a schema that looks like the following (this was generated via ai prompt):

Activity {
  id: string
  name: string
  description: string
  location: LocationNode | string // link to location node
  start_coordinates: { lat: number; lng: number } // for map-based queries
  duration_minutes: number // key for time-based filters
  type: &quot;big water tour&quot; | &quot;land based tour&quot; | &quot;bike tour&quot; | &quot;aerial&quot; | ...
  activity_level: &quot;strenuous&quot; | &quot;leisurely&quot; | &quot;casual&quot; | &quot;relaxed&quot;
  cost_min: number
  cost_max: number
  tags: string[] // e.g., [&quot;sunset&quot;, &quot;nature&quot;, &quot;wildlife&quot;, &quot;family-friendly&quot;]
  seasonality: string[] // [&quot;summer&quot;, &quot;fall&quot;], or a date range
  group_size_min: number
  group_size_max: number
  languages_offered: string[] // useful for international users
  accessibility_features: string[] // e.g., [&quot;wheelchair_accessible&quot;]
  vendor_rating: number // average rating from reviews
  cancellation_policy: &quot;strict&quot; | &quot;moderate&quot; | &quot;flexible&quot;
  booking_volume: number // could be a rough proxy for popularity
  image_url: string
}

Using the start_coordinates of an Activity we can find Activities that are close to our candidate Destination from above.

Making a trip recommendation - approach 1:

This is great if we already have booking data about what tours our travelers have taken, however, we are starting with disconnected travelers / trips and activities. How do we combine both datasets into a functioning workflow that can suggest tours for our travelers

We will likely need to use location as the connective tissue between destinations and tours.

Then we might want to factor in latent features from our embeddings, i.e. whether someone is young and vagabonding and interested in doing risky shit like skydiving. For now, we lack insights on our fake users to do something cool so i will keep it stupid simple. I will fuzzy match activities to destinations and create relationships wherever there are matches in order to do proximity based recommendations.

Recommending naively

First things first, we need to fuzzy match our destination nodes to our newly generated activity nodes. Fuzzy matching is great because we don't need to rely on inference to do some weird and expensive check to see if an activity takes place at a specific destination. It lets us count non exact matches. For every activity node in the db, I run a script that does the following:

def create_activity_destination_relationships():
	db = MemgraphDriver()
	query = &quot;&quot;&quot;
		MATCH (a:Activity), (d:Destination)
		WHERE	
			// Exact match
			a.location = d.name
			OR
			// Case-insensitive match
			toLower(a.location) = toLower(d.name)
			OR	
			// Match without country suffix
			toLower(a.location) = toLower(split(d.name, &#39;,&#39;)[0])
			OR
			// Match country names
			toLower(a.location) = toLower(split(d.name, &#39;,&#39;)[1])
			OR
			// Match without spaces and special characters
			replace(replace(toLower(a.location), &#39; &#39;, &#39;&#39;), &#39;-&#39;, &#39;&#39;) = replace(replace(toLower(d.name), &#39; &#39;, &#39;&#39;), &#39;-&#39;, &#39;&#39;)
		MERGE (a)-[:AT_LOCATION]-&gt;(d)
	&quot;&quot;&quot;

	db.execute_query(query, {})

This creates a unidirectional edge AT_LOCATION between our destinations and our trips. We will use this to create our first naive recommendations. Purely based off of location matching, no fancy shit yet.

Once we have a node id, we can find similar travelers, then grab their trips like so:

# Starting with a node_id, find the most similar nodes using cosine similarity from our earlier embeddings
# then find the most similar from those
def recommend_itineraries(node_id:str, collection_name:str):
	qdrant_client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)

	results = qdrant_client.query_points(
		collection_name=collection_name, 
		query=node_id
	)

	# Only use the top 3 results in order to avoid too many options.
	ids = [point.id for point in results.points][:3] 

	db = MemgraphDriver()

	query = &quot;&quot;&quot;
		MATCH (trav:Traveler)
		WHERE trav.id in $ids
		OPTIONAL MATCH (trav)-[:TOOK]-&gt;(trip:Trip)
		OPTIONAL MATCH (trip)-[:AT_DESTINATION]-&gt;(dest:Destination)
		OPTIONAL MATCH (trip)-[:STAYED_IN]-&gt;(acc:Accommodation)
		OPTIONAL MATCH (trip)-[:TRAVELED_BY]-&gt;(trans:Transportation)
		OPTIONAL MATCH (dest)-[:AT_LOCATION]-(activity:Activity)
		WITH trav, trip, dest, acc, trans, activity
		ORDER BY trip.startDate
		RETURN
		trav.id as traveler_id,
		collect(DISTINCT {
			trip_id: trip.id,
			start_date: trip.startDate,
			end_date: trip.endDate,
			duration: trip.duration,
			destination: dest.name,
			accommodation: {
				type: acc.type,
				cost: trip.accommodationCost
			},
			transportation: {
				type: trans.type,
				cost: trip.transportationCost
			},
			activities: activity
		}) as itinerary
	&quot;&quot;&quot;

	result = db.execute_query(query, {&quot;ids&quot;: ids})
	itineraries = []

	for record in result:
		itineraries.append(record)

	return itineraries

Running this script will net us a response like this:

&lt;Record traveler_id=&#39;e77e92cf-feb8-49da-a6ca-802654558bf9&#39; itinerary=[{&#39;accommodation&#39;: {&#39;cost&#39;: 800.0, &#39;type&#39;: &#39;Airbnb&#39;}, &#39;activities&#39;: &lt;Node element_id=&#39;640&#39; labels=frozenset({&#39;Activity&#39;}) properties={&#39;accessibility_features&#39;: [&#39;[]&#39;], &#39;activity_level&#39;: &#39;Moderate&#39;, &#39;booking_volume&#39;: 2000, &#39;cancellation_policy&#39;: &#39;Strict&#39;, &#39;cost_max&#39;: 120, &#39;cost_min&#39;: 60, &#39;description&#39;: &#39;Experience the vibrant Carnival atmosphere&#39;, &#39;duration_minutes&#39;: 240, &#39;group_size_max&#39;: 15, &#39;group_size_min&#39;: 1, &#39;id&#39;: &#39;51&#39;, &#39;image_url&#39;: &#39;https://images.unsplash.com/photo-1483729558449-99ef09a8c325&#39;, &#39;languages_offered&#39;: [&#39;[&quot;English&quot;&#39;, &#39;&quot;Portuguese&quot;]&#39;], &#39;location&#39;: &#39;Rio de Janeiro&#39;, &#39;name&#39;: &#39;Rio Carnival Experience&#39;, &#39;seasonality&#39;: [&#39;[&quot;summer&quot;]&#39;], &#39;start_coordinates&#39;: &#39;{&quot;lat&quot;: -22.9068, &quot;lng&quot;: -43.1729}&#39;, &#39;tags&#39;: [&#39;[&quot;cultural&quot;&#39;, &#39;&quot;music&quot;&#39;, &#39;&quot;dance&quot;]&#39;], &#39;type&#39;: &#39;Cultural&#39;, &#39;vendor_rating&#39;: &#39;4.9&#39;}&gt;, &#39;destination&#39;: &#39;Rio de Janeiro, Brazil&#39;, &#39;duration&#39;: 9, &#39;end_date&#39;: neo4j.time.Date(2024, 1, 15), &#39;start_date&#39;: neo4j.time.Date(2024, 1, 15), &#39;transportation&#39;: {&#39;cost&#39;: 150.0, &#39;type&#39;: &#39;Train&#39;}, &#39;trip_id&#39;: 530813}]&gt;

So for our traveler Jessica Chen, who took a trip to Rome for 7 days

JessicaChenNode

A similar traveler took a trip to Rio de Janeiro, Brazil for 9 days, spent basically the same amount of money and had several of the same travel preferences along the way.

FelipeGraph

FelipeActivity

Granted Rio is vastly different from Rome (in both culture and for planning a trip), the two trips are both international - and provide many similarities. We also now have things to do!

Looking forward

Looking forward, we should probably consider ways to learn more about the latent features of our travelers prior to recommending tours. In a similar vein to how we generated this knowledge graph we can continue to simulate user data with llm's that gives us more in terms of preferences (activity level, geographic preferences including past tours taken) in order to see if we can make our recommender less naive.

Also handling seasonality, cost of tours, and potentially recommending group itineraries might give us some interesting challenges to try and tackle!