Compare commits

...

1 Commits

Author SHA1 Message Date
Shyamal H Anadkat
5b6577845b remove comments - temp 2022-10-17 11:42:05 -07:00

View File

@ -2,7 +2,11 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Recommendation using embeddings and nearest neighbor search\n",
"\n",
@ -19,17 +23,23 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 1. Imports\n",
"\n",
"First, let's import the packages and functions we'll need for later. If you don't have these, you'll need to install them. You can install them via your terminal by running `pip install {package_name}`, e.g. `pip install pandas`."
"### 1. Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# imports\n",
@ -49,7 +59,11 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 2. Load data\n",
"\n",
@ -59,7 +73,11 @@
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
@ -161,7 +179,11 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's take a look at those same examples, but not truncated by ellipses."
]
@ -169,7 +191,11 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
@ -209,29 +235,30 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 3. Build cache to save embeddings\n",
"\n",
"Before getting embeddings for these articles, let's set up a cache to save the embeddings we generate. In general, it's a good idea to save your embeddings so you can re-use them later. If you don't save them, you'll pay again each time you compute them again.\n",
"\n",
"To save you the expense of computing the embeddings needed for this demo, we've provided a pre-filled cache via the URL below. The cache is a dictionary that maps tuples of `(text, engine)` to a `list of floats` embedding. The cache is saved as a Python pickle file."
"### 3. ???"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# establish a cache of embeddings to avoid recomputing\n",
"# cache is a dict of tuples (text, engine) -> embedding, saved as a pickle file\n",
"\n",
"# set path to embedding cache\n",
"embedding_cache_path_to_load = \"https://cdn.openai.com/API/examples/data/example_embeddings_cache.pkl\"\n",
"embedding_cache_path_to_save = \"example_embeddings_cache.pkl\"\n",
"\n",
"# load the cache if it exists, and save a copy to disk\n",
"# load the cache if it exists\n",
"try:\n",
" embedding_cache = pd.read_pickle(embedding_cache_path_to_load)\n",
"except FileNotFoundError:\n",
@ -239,7 +266,6 @@
"with open(embedding_cache_path_to_save, \"wb\") as embedding_cache_file:\n",
" pickle.dump(embedding_cache, embedding_cache_file)\n",
"\n",
"# define a function to retrieve embeddings from the cache if present, and otherwise request via the API\n",
"def embedding_from_string(\n",
" string: str,\n",
" engine: str = \"text-similarity-babbage-001\",\n",
@ -256,7 +282,11 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's check that it works by getting an embedding."
]
@ -264,7 +294,11 @@
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
@ -289,20 +323,23 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 4. Recommend similar articles based on embeddings\n",
"\n",
"To find similar articles, let's follow a three-step plan:\n",
"1. Get the similarity embeddings of all the article descriptions\n",
"2. Calculate the distance between a source title and all other articles\n",
"3. Print out the other articles closest to the source title"
"### 4. Recommend "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"def print_recommendations_from_strings(\n",
@ -348,7 +385,11 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 5. Example recommendations\n",
"\n",
@ -358,7 +399,11 @@
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
@ -400,14 +445,22 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Pretty good! All 5 of the recommendations look similar to the original article about Tony Blair. Interestingly, note that #4 doesn't mention the words Tony Blair, but is nonetheless recommended by the model, presumably because the model understands that Tony Blair tends to be related to President Bush or European pacts over Iran's nuclear program. This illustrates the potential power of using embeddings rather than basic string matching; our models understand what topics are related to one another, even when their words don't overlap."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's see how our recommender does on the second example article about NVIDIA's new chipset with more security."
]
@ -415,7 +468,11 @@
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
@ -455,14 +512,22 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"From the printed distances, you can see that the #1 recommendation is much closer than all the others (0.108 vs 0.160+). And the #1 recommendation looks very similar to the starting article - it's another article from PC World about increasing computer security. Pretty good! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Appendix: Using embeddings in more sophisticated recommenders\n",
"\n",
@ -471,14 +536,22 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Appendix: Using embeddings to visualize similar articles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"To get a sense of what our nearest neighbor recommender is doing, let's visualize the article embeddings. Although we can't plot the 2048 dimensions of each embedding vector, we can use techniques like [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) or [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) to compress the embeddings down into 2 or 3 dimensions, which we can chart.\n",
"\n",
@ -488,7 +561,11 @@
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stderr",
@ -11466,7 +11543,11 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"As you can see in the chart above, even the highly compressed embeddings do a good job of clustering article descriptions by category. And it's worth emphasizing: this clustering is done with no knowledge of the labels themselves!\n",
"\n",
@ -11475,7 +11556,11 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Next, let's recolor the points by whether they are a source article, its nearest neighbors, or other."
]
@ -11483,7 +11568,11 @@
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# create labels for the recommended articles\n",
@ -11509,7 +11598,11 @@
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
@ -22453,7 +22546,11 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Looking at the 2D chart above, we can see that the articles about Tony Blair are somewhat close together inside of the World news cluster. Interestingly, although the 5 nearest neighbors (red) were closest in high dimensional space, they are not the closest points in this compressed 2D space. Compressing the embeddings from 2048 dimensions to 2 dimensions discards much of their information, and the nearest neighbors in the 2D space don't seem to be as relevant as those in the full embedding space."
]
@ -22461,7 +22558,11 @@
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
@ -33405,14 +33506,22 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"For the chipset security example, the 4 closest nearest neighbors in the full embedding space remain nearest neighbors in this compressed 2D visualization. The fifth is displayed as more distant, despite being closer in the full embedding space."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Should you want to, you can also make an interactive 3D plot of the embeddings with the function `chart_from_components_3D`. (Doing so will require recomputing the t-SNE components with `n_components=3`.)"
]
@ -33423,7 +33532,7 @@
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
},
"kernelspec": {
"display_name": "Python 3.9.9 64-bit ('openai': virtualenv)",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -33437,9 +33546,8 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
"orig_nbformat": 4
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2