# Vector Database Introduction

This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.

The demo flow is:
- **Setup**: Import packages and set any required variables
- **Load data**: Load a dataset and embed it using OpenAI embeddings
- **Pinecone**
    - *Setup*: Here we setup the Python client for Pinecone. For more details go [here](https://docs.pinecone.io/docs/quickstart)
    - *Index Data*: We'll create an index with namespaces for __titles__ and __content__
    - *Search Data*: We'll test out both namespaces with search queries to confirm it works
- **Weaviate**
    - *Setup*: Here we setup the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)
    - *Index Data*: We'll create an index with __title__ search vectors in it
    - *Search Data*: We'll run a few searches to confirm it works

Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings

## Setup

Here we import the required libraries and set the embedding model that we'd like to use

In [1]:
import openai

import tiktoken
from tenacity import retry, wait_random_exponential, stop_after_attempt
from typing import List, Iterator
import concurrent
from tqdm import tqdm
import pandas as pd
from datasets import load_dataset
import numpy as np
import os

# Pinecone's client library for Python
import pinecone

# Weaviate's client library for Python
import weaviate

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
MODEL = "text-embedding-ada-002"

## Load data

In this section we'll source the data for this task, embed it and format it for insertion into a vector database

*Thanks to Ryan Greene for the template used for the batch ingestion

In [3]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))

# Simple function to take in a list of text objects and return them as a list of embeddings
def get_embeddings(input: List):
    response = openai.Embedding.create(
        input=input,
        model=MODEL,
    )["data"]
    return [data["embedding"] for data in response]

# Function for batching and parallel processing the embeddings
def embed_corpus(
    corpus: List[str],
    batch_size=64,
    num_workers=8,
    max_context_len=8191,
):
    def batchify(iterable, n=1):
        l = len(iterable)
        for ndx in range(0, l, n):
            yield iterable[ndx : min(ndx + n, l)]

    # Encode the corpus, truncating to max_context_len
    encoding = tiktoken.get_encoding("cl100k_base")
    encoded_corpus = [
        encoded_article[:max_context_len] for encoded_article in encoding.encode_batch(corpus)
    ]

    # Calculate corpus statistics: the number of inputs, the total number of tokens, and the estimated cost to embed
    num_tokens = sum(len(article) for article in encoded_corpus)
    cost_to_embed_tokens = num_tokens / 1_000 * 0.0004
    print(
        f"num_articles={len(encoded_corpus)}, num_tokens={num_tokens}, est_embedding_cost={cost_to_embed_tokens:.2f} USD"
    )

    # Embed the corpus
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [
            executor.submit(get_embeddings, text_batch)
            for text_batch in batchify(encoded_corpus, batch_size)
        ]

        with tqdm(total=len(encoded_corpus)) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(batch_size)

        embeddings = []
        for future in futures:
            data = future.result()
            embeddings.extend(data)
        return embeddings

In [4]:
# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding
dataset = list(load_dataset("wikipedia", "20220301.simple")["train"])
# Limited to 50k articles for demo purposes
dataset = dataset[:50_000]  

Found cached dataset wikipedia (/Users/colin.jarvis/.cache/huggingface/datasets/wikipedia/20220301.simple/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


  0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
%%time
# Embed the article text
dataset_embeddings = embed_corpus([article["text"] for article in dataset])
# Embed the article titles separately
title_embeddings = embed_corpus([article["title"] for article in dataset])

num_articles=50000, num_tokens=18272526, est_embedding_cost=7.31 USD


50048it [03:05, 269.52it/s]                                                                                                                                                                       


num_articles=50000, num_tokens=202363, est_embedding_cost=0.08 USD


50048it [00:52, 957.36it/s]                                                                                                                                                                       

CPU times: user 42.3 s, sys: 8.47 s, total: 50.8 s
Wall time: 4min 5s





In [13]:
# We then store the result in another dataframe, and prep the data for insertion into a vector DB
article_df = pd.DataFrame(dataset)
article_df['title_vector'] = title_embeddings
article_df['content_vector'] = dataset_embeddings
article_df['vector_id'] = article_df.index
article_df['vector_id'] = article_df['vector_id'].apply(str)
article_df.head()

Unnamed: 0,id,url,title,text,title_vector,content_vector,vector_id
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,"[0.00107035250402987, -0.02077057771384716, -0...","[-0.011253940872848034, -0.013491976074874401,...",0
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,"[0.0010461278725415468, 0.0008924593566916883,...","[0.0003609954728744924, 0.007262262050062418, ...",1
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,"[0.0033627033699303865, 0.006122018210589886, ...","[-0.004959689453244209, 0.015772193670272827, ...",2
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,"[0.015406121499836445, -0.013689860701560974, ...","[0.024894846603274345, -0.022186409682035446, ...",3
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,"[0.022219523787498474, -0.020443666726350784, ...","[0.021524671465158463, 0.018522677943110466, -...",4


## Pinecone

Now we'll look to index these embedded documents in a vector database and search them. The first option we'll look at is **Pinecone**, a managed vector database which offers a cloud-native option.

Before you proceed with this step you'll need to navigate to [Pinecone](pinecone.io), sign up and then save your API key as an environment variable titled ```PINECONE_API_KEY```.

For section we will:
- Create an index with multiple namespaces for article titles and content
- Store our data in the index with separate searchable "namespaces" for article **titles** and **content**
- Fire some similarity search queries to verify our setup is working

In [11]:
api_key = os.getenv("PINECONE_API_KEY")
pinecone.init(api_key=api_key)

### Create Index

First we need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [this article](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.).

In [12]:
class BatchGenerator:
    """ Models a simple batch generator that make chunks out of an input DataFrame. """
    
    def __init__(self, batch_size: int = 10) -> None:
        self.batch_size = batch_size
    
    def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:
        """ Makes chunks out of an input DataFrame. """
        splits = self.splits_num(df.shape[0])
        if splits <= 1:
            yield df
        else:
            for chunk in np.array_split(df, splits):
                yield chunk
    
    def splits_num(self, elements: int) -> int:
        """ Determines how many chunks DataFrame contians. """
        return round(elements / self.batch_size)
    
    __call__ = to_batches

df_batcher = BatchGenerator(300)

In [14]:
# Pick a name for the new index
index_name = 'wikipedia-articles'

In [15]:
# Check whether the index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

  return self.urllib3_response.getheader(name, default)
  if index_name in pinecone.list_indexes():
  pinecone.delete_index(index_name)


In [16]:
pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))
index = pinecone.Index(index_name=index_name)

  status = _get_status(name)
  status = _get_status(name)
  status = _get_status(name)
  status = _get_status(name)
  status = _get_status(name)
  pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))


In [17]:
# Confirm our index was created
pinecone.list_indexes()

  pinecone.list_indexes()


['wikipedia-articles']

In [18]:
# Upsert content vectors in content namespace
print("Uploading vectors to content namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')

Uploading vectors to content namespace..


In [19]:
# Upsert title vectors in title namespace
print("Uploading vectors to title namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')

Uploading vectors to title namespace..


In [20]:
# Check index size for each namespace
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.2,
 'namespaces': {'content': {'vector_count': 50000},
                'title': {'vector_count': 50000}},
 'total_vector_count': 100000}

### Search data

Now we'll enter some dummy searches and check we get decent results back

In [21]:
# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results
titles_mapped = dict(zip(article_df.vector_id,article_df.title))
content_mapped = dict(zip(article_df.vector_id,article_df.text))

In [22]:
def query_article(query, namespace, top_k=5):
    '''Queries an article using its title in the specified
     namespace and prints results.'''

    # Create vector embeddings based on the title column
    embedded_query = openai.Embedding.create(
                                                input=query,
                                                model=MODEL,
                                            )["data"][0]['embedding']

    # Query namespace passed as parameter using title vector
    query_result = index.query(embedded_query, 
                                      namespace=namespace, 
                                      top_k=top_k)

    # Print query results 
    print(f'\nMost similar results querying {query} in "{namespace}" namespace:\n')
    if not query_result.matches:
        print('no query result')
    
    matches = query_result.matches
    ids = [res.id for res in matches]
    scores = [res.score for res in matches]
    df = pd.DataFrame({'id':ids, 
                       'score':scores,
                       'title': [titles_mapped[_id] for _id in ids],
                       'content': [content_mapped[_id] for _id in ids],
                       })
    
    counter = 0
    for k,v in df.iterrows():
        counter += 1
        print(f'Result {counter} with a score of {v.score} is {v.title}')
    
    print('\n')

    return df

In [49]:
query_output = query_article('modern art in Europe','title')
#query_output


Most similar results querying modern art in Europe in "title" namespace:

Result 1 with a score of 0.891034067 is Early modern Europe
Result 2 with a score of 0.87504226 is Museum of Modern Art
Result 3 with a score of 0.867497 is Western Europe
Result 4 with a score of 0.864146471 is Renaissance art
Result 5 with a score of 0.860363305 is Pop art




  return self.urllib3_response.getheader(name, default)


In [50]:
content_query_output = query_article("Famous battles in Scottish history",'content')
#content_query_output


Most similar results querying Famous battles in Scottish history in "content" namespace:

Result 1 with a score of 0.869324744 is Battle of Bannockburn
Result 2 with a score of 0.861479 is Wars of Scottish Independence
Result 3 with a score of 0.852555931 is 1651
Result 4 with a score of 0.84969604 is First War of Scottish Independence
Result 5 with a score of 0.846192539 is Robert I of Scotland




## Weaviate

The other vector database option we'll explore here is **Weaviate**, which offers both a managed, SaaS option like Pinecone, as well as a self-hosted option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.

For this we will:
- Set up a local deployment of Weaviate
- Create indices in Weaviate
- Store our data there
- Fire some similarity search queries
- Try a real use case

### Setup

To get Weaviate running locally we used Docker and followed the instructions contained in this article: https://weaviate.io/developers/weaviate/current/installation/docker-compose.html

For an example docker-compose.yaml file please refer to `./weaviate/docker-compose.yaml` in this repo

You can start Weaviate up locally by navigating to this directory and running `docker-compose up -d `

In [29]:
client = weaviate.Client("http://localhost:8080/")

In [30]:
client.schema.delete_all()
client.schema.get()

{'classes': []}

In [31]:
client.is_ready()

True

### Index data

In Weaviate you create __schemas__ to capture each of the entities you will be searching. 

In this case we'll create a schema called **Article** with the **title** vector from above included for us to search by.

The next few steps closely follow the documents Weaviate provides [here](https://weaviate.io/developers/weaviate/current/tutorials/how-to-use-weaviate-without-modules.htm)

In [32]:
class_obj = {
    "class": "Article",
    "vectorizer": "none", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model
    "properties": [{
        "name": "title",
        "description": "Title of the article",
        "dataType": ["text"]
    },
        {
        "name": "content",
        "description": "Contents of the article",
        "dataType": ["text"]
    }]
}

In [33]:
client.schema.create_class(class_obj)

In [34]:
client.schema.get()

{'classes': [{'class': 'Article',
   'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},
    'cleanupIntervalSeconds': 60,
    'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},
   'properties': [{'dataType': ['text'],
     'description': 'Title of the article',
     'name': 'title',
     'tokenization': 'word'},
    {'dataType': ['text'],
     'description': 'Contents of the article',
     'name': 'content',
     'tokenization': 'word'}],
   'shardingConfig': {'virtualPerPhysical': 128,
    'desiredCount': 1,
    'actualCount': 1,
    'desiredVirtualCount': 128,
    'actualVirtualCount': 128,
    'key': '_id',
    'strategy': 'hash',
    'function': 'murmur3'},
   'vectorIndexConfig': {'skip': False,
    'cleanupIntervalSeconds': 300,
    'maxConnections': 64,
    'efConstruction': 128,
    'ef': -1,
    'dynamicEfMin': 100,
    'dynamicEfMax': 500,
    'dynamicEfFactor': 8,
    'vectorCacheMaxObjects': 2000000,
    'flatSearchCutoff': 40000,
    'distance': 'cos

In [35]:
client.batch.configure(
  # `batch_size` takes an `int` value to enable auto-batching
  # (`None` is used for manual batching)
  batch_size=100, 
  # dynamically update the `batch_size` based on import speed
  dynamic=False,
  # `timeout_retries` takes an `int` value to retry on time outs
  timeout_retries=3,
  # checks for batch-item creation errors
  # this is the default in weaviate-client >= 3.6.0
  callback=weaviate.util.check_batch_result,
)
#result = client.batch.create_objects(batch)

<weaviate.batch.crud_batch.Batch at 0x16ad2fe20>

In [37]:
# Make a list of tuples
data_objects = []
for k,v in article_df.iterrows():
    data_objects.append((v['title'],v['text'],v['title_vector'],v['vector_id']))

In [38]:
# Template function for setting up parallel upload process
def transcription_extractor(audio_filepath):
    response = call_asr(openai.api_key,audio_filepath)
    return(response)

In [39]:
# Upsert into article schema
print("Uploading vectors to article schema..")
uuids = []
for articles in data_objects:
    uuid = client.data_object.create(
                              {
                                  "title": articles[0],
                                  "content": articles[1]
                              },
                              "Article",
                              vector=articles[2]
                            )
    uuids.append(uuid)

Uploading vectors to article_schema..


In [48]:
client.data_object.get()['objects'][0]['properties']

{'content': 'Sociedade Esportiva Palmeiras, usually called Palmeiras, is a Brazilian football team. They are from São Paulo, Brazil. The team was founded  by an Italian-speaking community on August 26, 1914, as Palestra Itália. They changed to the name used now on September 14, 1942.\n\nThey play in green shirts, white shorts and green socks and are one of the most popular and traditional Brazilian clubs.\n\nPalmeiras plays at the Palestra Itália stadium, which has seats for 32,000. But in the past, local derbies against São Paulo or Corinthians were usually played in Morumbi stadium. However, the Arena Palestra Itália is under construction with capacity for 45,000 people, expected to be finalized in 2013.\n\nName \n 1914–1942 S.S. Palestra Italia\n 1942–present S.E. Palmeiras\n\nMain titles \n Copa Rio: 1951\n Libertadores Cup: 1999 and 2020\n Copa Mercosul: 1998\n Campeonato Brasileiro: 1960, 1967, 1967, 1969, 1972, 1973, 1993, 1994, 2016 and 2018 – greatest champion\n Copa do Brasil

### Search Data

In [41]:
def query_weaviate(query, schema, top_k=20):
    '''Queries an article using its title in the specified
     namespace and prints results.'''

    # Create vector embeddings based on the title column
    embedded_query = openai.Embedding.create(
                                                input=query,
                                                model=MODEL,
                                            )["data"][0]['embedding']
    
    near_vector = {"vector": embedded_query}

    # Query namespace passed as parameter using title vector
    query_result = client.query.get(schema,["title","content", "_additional {certainty}"]) \
    .with_near_vector(near_vector) \
    .with_limit(top_k) \
    .do()
    
    return query_result
    # Print query results 

In [42]:
query_result = query_weaviate('modern art in Europe','Article')
counter = 0
for article in query_result['data']['Get']['Article']:
    counter += 1
    print(f"{counter}. Title: {article['title']} Certainty: {article['_additional']['certainty']}")

1. Title: Early modern Europe Certainty: 0.9454971551895142
2. Title: Museum of Modern Art Certainty: 0.9375567138195038
3. Title: Western Europe Certainty: 0.9336977899074554
4. Title: Renaissance art Certainty: 0.9321110248565674
5. Title: Pop art Certainty: 0.9302356243133545
6. Title: Art exhibition Certainty: 0.9281864166259766
7. Title: History of Europe Certainty: 0.9278470575809479
8. Title: Northern Europe Certainty: 0.9273118078708649
9. Title: Concert of Europe Certainty: 0.9268475472927094
10. Title: Hellenistic art Certainty: 0.9264660775661469
11. Title: Piet Mondrian Certainty: 0.9235712587833405
12. Title: Modernist literature Certainty: 0.9235587120056152
13. Title: European Capital of Culture Certainty: 0.9228664338588715
14. Title: Art film Certainty: 0.9217151403427124
15. Title: Europa Certainty: 0.9216068089008331
16. Title: Art rock Certainty: 0.9212885200977325
17. Title: Central Europe Certainty: 0.9212862849235535
18. Title: Art Certainty: 0.9208334386348724
1

In [44]:
query_result = query_weaviate('Famous battles in Scottish history','Article')
counter = 0
for article in query_result['data']['Get']['Article']:
    counter += 1
    print(f"{counter}. Title: {article['title']} Certainty: {article['_additional']['certainty']}")

1. Title: Historic Scotland Certainty: 0.9464837908744812
2. Title: First War of Scottish Independence Certainty: 0.9461104869842529
3. Title: Battle of Bannockburn Certainty: 0.9455609619617462
4. Title: Wars of Scottish Independence Certainty: 0.944368839263916
5. Title: Second War of Scottish Independence Certainty: 0.9395008385181427
6. Title: List of Scottish monarchs Certainty: 0.9366503059864044
7. Title: Kingdom of Scotland Certainty: 0.935274213552475
8. Title: Scottish Borders Certainty: 0.9317866265773773
9. Title: List of rivers of Scotland Certainty: 0.9296278059482574
10. Title: Braveheart Certainty: 0.9294214248657227
11. Title: John of Scotland Certainty: 0.9292325675487518
12. Title: Duncan II of Scotland Certainty: 0.9291643798351288
13. Title: Bannockburn Certainty: 0.9291241466999054
14. Title: The Scotsman Certainty: 0.9280610680580139
15. Title: Flag of Scotland Certainty: 0.9270428121089935
16. Title: Banff and Macduff Certainty: 0.9267247915267944
17. Title: Gua

Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through the cookbook examples here:

TODO: Make other cool things to link to