Compare commits
35 Commits
shyamal-te
...
ted/update
Author | SHA1 | Date | |
---|---|---|---|
fd181ec78f | |||
7de3d50816 | |||
aabbdbe28e | |||
0009da639d | |||
5e66437686 | |||
6b6e6323e4 | |||
2072d1a1fd | |||
e811878082 | |||
3c334e70dd | |||
e3395df981 | |||
e00797e3e5 | |||
4fd730e78f | |||
1a8111e0ef | |||
12ea77eb1b | |||
1f62a62102 | |||
06ac519c8b | |||
d932a36398 | |||
459afa7d9b | |||
0d4989245d | |||
fe60d7f2af | |||
c621b46924 | |||
0ad407b75a | |||
6b536c981a | |||
d968557408 | |||
209c1a12e8 | |||
3ad2df91d8 | |||
e383e243c2 | |||
5ce51d7b4d | |||
75aceae6b8 | |||
0528302f6d | |||
e3d7091d70 | |||
37e0136ce0 | |||
381070fa4e | |||
401f7c7ef0 | |||
b01900d5d9 |
5
.gitignore
vendored
5
.gitignore
vendored
@ -127,3 +127,8 @@ dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
|
||||
# Data
|
||||
*transactions*.jsonl
|
||||
/examples/data/transactions*
|
||||
*.DS_Store
|
||||
|
20
README.md
20
README.md
@ -120,7 +120,7 @@ An example of each is shown below.
|
||||
|
||||
### Instruction prompts
|
||||
|
||||
Instruction-following models (e.g., `text-davinci-002` or any model beginning with `text-`) are specially designed to follow instructions. Write your instruction at the top of the prompt (or at the bottom, or both), and the model will do its best to follow the instruction and then stop. Instructions can be detailed, so don't be afraid to write a paragraph explicitly detailing the output you want.
|
||||
Instruction-following models (e.g., `text-davinci-003` or any model beginning with `text-`) are specially designed to follow instructions. Write your instruction at the top of the prompt (or at the bottom, or both), and the model will do its best to follow the instruction and then stop. Instructions can be detailed, so don't be afraid to write a paragraph explicitly detailing the output you want.
|
||||
|
||||
Example instruction prompt:
|
||||
|
||||
@ -253,7 +253,7 @@ In general, writing can work with any style of prompt. Experiment to see what wo
|
||||
|
||||
| | Advantages | Disadvantages |
|
||||
| ---------------------------------------------------------- | ----------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
|
||||
| Instruction-following models<br>(e.g., `text-davinci-002`) | Easiest to use | Less creative; less diverse; harder to control tone, length, etc. |
|
||||
| Instruction-following models<br>(e.g., `text-davinci-003`) | Easiest to use | Less creative; less diverse; harder to control tone, length, etc. |
|
||||
| Base models<br>(e.g., `davinci`) | More creative | More expensive (as including examples demonstrations in prompt will cost tokens) |
|
||||
| Fine-tuned models | Can train off of many examples; cheaper than including examples in the prompt | Hard to gather training data; training makes iteration slower and more expensive |
|
||||
|
||||
@ -301,7 +301,7 @@ Output:
|
||||
One
|
||||
```
|
||||
|
||||
If the text you wish to ask about is longer than the token limit (~4,000 tokens for `text-davinci-002` and ~2,000 tokens for earlier models), we recommending splitting the text into smaller pieces, ranking them by relevance, and then asking the most-relevant-looking pieces.
|
||||
If the text you wish to ask about is longer than the token limit (~4,000 tokens for `text-davinci-003` and ~2,000 tokens for earlier models), we recommending splitting the text into smaller pieces, ranking them by relevance, and then asking the most-relevant-looking pieces.
|
||||
|
||||
#### Summarization
|
||||
|
||||
@ -446,11 +446,11 @@ Embeddings can be used for search either by themselves or as a feature in a larg
|
||||
The simplest way to use embeddings for search is as follows:
|
||||
|
||||
* Before the search (precompute):
|
||||
* Split your text corpus into chunks smaller than the token limit (e.g., ~2,000 tokens)
|
||||
* Embed each chunk using a 'doc' model (e.g., `text-search-curie-doc-001`)
|
||||
* Split your text corpus into chunks smaller than the token limit (e.g., <8,000 tokens)
|
||||
* Embed each chunk
|
||||
* Store those embeddings in your own database or in a vector search provider like [Pinecone](https://www.pinecone.io) or [Weaviate](https://weaviate.io)
|
||||
* At the time of the search (live compute):
|
||||
* Embed the search query using the correponding 'query' model (e.g. `text-search-curie-query-001`)
|
||||
* Embed the search query
|
||||
* Find the closest embeddings in your database
|
||||
* Return the top results, ranked by cosine similarity
|
||||
|
||||
@ -460,7 +460,7 @@ In more advanced search systems, the the cosine similarity of embeddings can be
|
||||
|
||||
#### Recommendations
|
||||
|
||||
Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set. And instead of using pairs of doc-query models, you can use a single symmetric similarity model (e.g., `text-similarity-curie-001`).
|
||||
Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set.
|
||||
|
||||
An example of how to use embeddings for recommendations is shown in [Recommendation_using_embeddings.ipynb](examples/Recommendation_using_embeddings.ipynb).
|
||||
|
||||
@ -470,7 +470,7 @@ Similar to search, these cosine similarity scores can either be used on their ow
|
||||
|
||||
Although OpenAI's embedding model weights cannot be fine-tuned, you can still use training data to customize embeddings to your application.
|
||||
|
||||
In the following notebook, we provide an example method for customizing your embeddings using training data. The idea of the method is to train a custom matrix to multiply embedding vectors by in order to get new customized embeddings. With good training data, this custom matrix will highlight the features relevant to your training labels and suppress the rest. You can equivalently consider the matrix mulitplication as (a) a modification of the embeddings or (b) a modification of the distance function used to measure the distances between embeddings.
|
||||
In the following notebook, we provide an example method for customizing your embeddings using training data. The idea of the method is to train a custom matrix to multiply embedding vectors by in order to get new customized embeddings. With good training data, this custom matrix will highlight the features relevant to your training labels and suppress the rest. You can equivalently consider the matrix multiplication as (a) a modification of the embeddings or (b) a modification of the distance function used to measure the distances between embeddings.
|
||||
|
||||
* [Customizing_embeddings.ipynb](examples/Customizing_embeddings.ipynb)
|
||||
|
||||
@ -486,7 +486,7 @@ Codex powers [more than 70 products][Codex Apps Blog Post], including:
|
||||
* [Warp](https://www.warp.dev/) (a smart terminal with AI command search)
|
||||
* [Machinet](https://machinet.net/) (writes Java unit test templates)
|
||||
|
||||
Note that unlike instruction-following text models (e.g., `text-davinci-002`), Codex is *not* trained to follow instructions. As a result, designing good prompts can take more care.
|
||||
Note that unlike instruction-following text models (e.g., `text-davinci-003`), Codex is *not* trained to follow instructions. As a result, designing good prompts can take more care.
|
||||
|
||||
### 1. Write code
|
||||
|
||||
@ -523,7 +523,7 @@ Code explanation can be applied to many use cases:
|
||||
* Generating in-code documentation (e.g., Python docstrings, git commit messages)
|
||||
* Generating out-of-code documentation (e.g., man pages)
|
||||
* In an interactive code exploration tool
|
||||
* Communicating program results back to users via a natural langauge interface
|
||||
* Communicating program results back to users via a natural language interface
|
||||
|
||||
An example prompt for explaining code with `code-davinci-002`:
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
289
examples/Clustering_for_transaction_classification.ipynb
Normal file
289
examples/Clustering_for_transaction_classification.ipynb
Normal file
File diff suppressed because one or more lines are too long
@ -1,12 +1,13 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Code search\n",
|
||||
"\n",
|
||||
"We index our own openai-python code repository, and show how it can be searched. We implement a simple version of file parsing and extracting of functions from python files."
|
||||
"We index our own [openai-python code repository](https://github.com/openai/openai-python), and show how it can be searched. We implement a simple version of file parsing and extracting of functions from python files."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -18,8 +19,8 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Total number of py files: 40\n",
|
||||
"Total number of functions extracted: 64\n"
|
||||
"Total number of py files: 51\n",
|
||||
"Total number of functions extracted: 97\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -63,18 +64,24 @@
|
||||
"\n",
|
||||
"# get user root directory\n",
|
||||
"root_dir = os.path.expanduser(\"~\")\n",
|
||||
"# note: for this code to work, the openai-python repo must be downloaded and placed in your root directory\n",
|
||||
"\n",
|
||||
"# path to code repository directory\n",
|
||||
"code_root = root_dir + \"/openai-python\"\n",
|
||||
"\n",
|
||||
"code_files = [y for x in os.walk(code_root) for y in glob(os.path.join(x[0], '*.py'))]\n",
|
||||
"print(\"Total number of py files:\", len(code_files))\n",
|
||||
"\n",
|
||||
"if len(code_files) == 0:\n",
|
||||
" print(\"Double check that you have downloaded the openai-python repo and set the code_root variable correctly.\")\n",
|
||||
"\n",
|
||||
"all_funcs = []\n",
|
||||
"for code_file in code_files:\n",
|
||||
" funcs = list(get_functions(code_file))\n",
|
||||
" for func in funcs:\n",
|
||||
" all_funcs.append(func)\n",
|
||||
"\n",
|
||||
"print(\"Total number of functions extracted:\", len(all_funcs))\n"
|
||||
"print(\"Total number of functions extracted:\", len(all_funcs))"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -119,64 +126,57 @@
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>def semantic_search(engine, query, documents):...</td>\n",
|
||||
" <td>semantic_search</td>\n",
|
||||
" <td>/examples/semanticsearch/semanticsearch.py</td>\n",
|
||||
" <td>[-0.038976121693849564, -0.0031428150832653046...</td>\n",
|
||||
" <td>def _console_log_level():\\n if openai.log i...</td>\n",
|
||||
" <td>_console_log_level</td>\n",
|
||||
" <td>/openai/util.py</td>\n",
|
||||
" <td>[0.03389773145318031, -0.004390408284962177, 0...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>def main():\\n parser = argparse.ArgumentPar...</td>\n",
|
||||
" <td>main</td>\n",
|
||||
" <td>/examples/semanticsearch/semanticsearch.py</td>\n",
|
||||
" <td>[-0.024289356544613838, -0.017748363316059113,...</td>\n",
|
||||
" <td>def log_debug(message, **params):\\n msg = l...</td>\n",
|
||||
" <td>log_debug</td>\n",
|
||||
" <td>/openai/util.py</td>\n",
|
||||
" <td>[-0.004034275189042091, 0.004895383026450872, ...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>def get_candidates(\\n prompt: str,\\n sto...</td>\n",
|
||||
" <td>get_candidates</td>\n",
|
||||
" <td>/examples/codex/backtranslation.py</td>\n",
|
||||
" <td>[-0.04161201789975166, -0.0169310811907053, 0....</td>\n",
|
||||
" <td>def log_info(message, **params):\\n msg = lo...</td>\n",
|
||||
" <td>log_info</td>\n",
|
||||
" <td>/openai/util.py</td>\n",
|
||||
" <td>[0.004882764536887407, 0.0033515947870910168, ...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>def rindex(lst: List, value: str) -> int:\\n ...</td>\n",
|
||||
" <td>rindex</td>\n",
|
||||
" <td>/examples/codex/backtranslation.py</td>\n",
|
||||
" <td>[-0.027255680412054062, -0.007931121625006199,...</td>\n",
|
||||
" <td>def log_warn(message, **params):\\n msg = lo...</td>\n",
|
||||
" <td>log_warn</td>\n",
|
||||
" <td>/openai/util.py</td>\n",
|
||||
" <td>[0.002535992069169879, -0.010829543694853783, ...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>def eval_candidate(\\n candidate_answer: str...</td>\n",
|
||||
" <td>eval_candidate</td>\n",
|
||||
" <td>/examples/codex/backtranslation.py</td>\n",
|
||||
" <td>[-0.00999179296195507, -0.01640152558684349, 0...</td>\n",
|
||||
" <td>def logfmt(props):\\n def fmt(key, val):\\n ...</td>\n",
|
||||
" <td>logfmt</td>\n",
|
||||
" <td>/openai/util.py</td>\n",
|
||||
" <td>[0.016732551157474518, 0.017367802560329437, 0...</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" code function_name \\\n",
|
||||
"0 def semantic_search(engine, query, documents):... semantic_search \n",
|
||||
"1 def main():\\n parser = argparse.ArgumentPar... main \n",
|
||||
"2 def get_candidates(\\n prompt: str,\\n sto... get_candidates \n",
|
||||
"3 def rindex(lst: List, value: str) -> int:\\n ... rindex \n",
|
||||
"4 def eval_candidate(\\n candidate_answer: str... eval_candidate \n",
|
||||
" code function_name \\\n",
|
||||
"0 def _console_log_level():\\n if openai.log i... _console_log_level \n",
|
||||
"1 def log_debug(message, **params):\\n msg = l... log_debug \n",
|
||||
"2 def log_info(message, **params):\\n msg = lo... log_info \n",
|
||||
"3 def log_warn(message, **params):\\n msg = lo... log_warn \n",
|
||||
"4 def logfmt(props):\\n def fmt(key, val):\\n ... logfmt \n",
|
||||
"\n",
|
||||
" filepath \\\n",
|
||||
"0 /examples/semanticsearch/semanticsearch.py \n",
|
||||
"1 /examples/semanticsearch/semanticsearch.py \n",
|
||||
"2 /examples/codex/backtranslation.py \n",
|
||||
"3 /examples/codex/backtranslation.py \n",
|
||||
"4 /examples/codex/backtranslation.py \n",
|
||||
"\n",
|
||||
" code_embedding \n",
|
||||
"0 [-0.038976121693849564, -0.0031428150832653046... \n",
|
||||
"1 [-0.024289356544613838, -0.017748363316059113,... \n",
|
||||
"2 [-0.04161201789975166, -0.0169310811907053, 0.... \n",
|
||||
"3 [-0.027255680412054062, -0.007931121625006199,... \n",
|
||||
"4 [-0.00999179296195507, -0.01640152558684349, 0... "
|
||||
" filepath code_embedding \n",
|
||||
"0 /openai/util.py [0.03389773145318031, -0.004390408284962177, 0... \n",
|
||||
"1 /openai/util.py [-0.004034275189042091, 0.004895383026450872, ... \n",
|
||||
"2 /openai/util.py [0.004882764536887407, 0.0033515947870910168, ... \n",
|
||||
"3 /openai/util.py [0.002535992069169879, -0.010829543694853783, ... \n",
|
||||
"4 /openai/util.py [0.016732551157474518, 0.017367802560329437, 0... "
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
@ -188,12 +188,109 @@
|
||||
"from openai.embeddings_utils import get_embedding\n",
|
||||
"\n",
|
||||
"df = pd.DataFrame(all_funcs)\n",
|
||||
"df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, engine='code-search-babbage-code-001'))\n",
|
||||
"df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))\n",
|
||||
"df['filepath'] = df['filepath'].apply(lambda x: x.replace(code_root, \"\"))\n",
|
||||
"df.to_csv(\"output/code_search_openai-python.csv\", index=False)\n",
|
||||
"df.to_csv(\"data/code_search_openai-python.csv\", index=False)\n",
|
||||
"df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/openai/tests/test_endpoints.py:test_completions score=0.826\n",
|
||||
"def test_completions():\n",
|
||||
" result = openai.Completion.create(prompt=\"This was a test\", n=5, engine=\"ada\")\n",
|
||||
" assert len(result.choices) == 5\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"----------------------------------------------------------------------\n",
|
||||
"/openai/tests/test_endpoints.py:test_completions_model score=0.811\n",
|
||||
"def test_completions_model():\n",
|
||||
" result = openai.Completion.create(prompt=\"This was a test\", n=5, model=\"ada\")\n",
|
||||
" assert len(result.choices) == 5\n",
|
||||
" assert result.model.startswith(\"ada\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"----------------------------------------------------------------------\n",
|
||||
"/openai/tests/test_endpoints.py:test_completions_multiple_prompts score=0.808\n",
|
||||
"def test_completions_multiple_prompts():\n",
|
||||
" result = openai.Completion.create(\n",
|
||||
" prompt=[\"This was a test\", \"This was another test\"], n=5, engine=\"ada\"\n",
|
||||
" )\n",
|
||||
" assert len(result.choices) == 10\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"----------------------------------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from openai.embeddings_utils import cosine_similarity\n",
|
||||
"\n",
|
||||
"def search_functions(df, code_query, n=3, pprint=True, n_lines=7):\n",
|
||||
" embedding = get_embedding(code_query, engine='text-embedding-ada-002')\n",
|
||||
" df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))\n",
|
||||
"\n",
|
||||
" res = df.sort_values('similarities', ascending=False).head(n)\n",
|
||||
" if pprint:\n",
|
||||
" for r in res.iterrows():\n",
|
||||
" print(r[1].filepath+\":\"+r[1].function_name + \" score=\" + str(round(r[1].similarities, 3)))\n",
|
||||
" print(\"\\n\".join(r[1].code.split(\"\\n\")[:n_lines]))\n",
|
||||
" print('-'*70)\n",
|
||||
" return res\n",
|
||||
"\n",
|
||||
"res = search_functions(df, 'Completions API tests', n=3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/openai/validators.py:format_inferrer_validator score=0.751\n",
|
||||
"def format_inferrer_validator(df):\n",
|
||||
" \"\"\"\n",
|
||||
" This validator will infer the likely fine-tuning format of the data, and display it to the user if it is classification.\n",
|
||||
" It will also suggest to use ada and explain train/validation split benefits.\n",
|
||||
" \"\"\"\n",
|
||||
" ft_type = infer_task_type(df)\n",
|
||||
" immediate_msg = None\n",
|
||||
"----------------------------------------------------------------------\n",
|
||||
"/openai/validators.py:get_validators score=0.748\n",
|
||||
"def get_validators():\n",
|
||||
" return [\n",
|
||||
" num_examples_validator,\n",
|
||||
" lambda x: necessary_column_validator(x, \"prompt\"),\n",
|
||||
" lambda x: necessary_column_validator(x, \"completion\"),\n",
|
||||
" additional_column_validator,\n",
|
||||
" non_empty_field_validator,\n",
|
||||
"----------------------------------------------------------------------\n",
|
||||
"/openai/validators.py:infer_task_type score=0.738\n",
|
||||
"def infer_task_type(df):\n",
|
||||
" \"\"\"\n",
|
||||
" Infer the likely fine-tuning task type from the data\n",
|
||||
" \"\"\"\n",
|
||||
" CLASSIFICATION_THRESHOLD = 3 # min_average instances of each class\n",
|
||||
" if sum(df.prompt.str.len()) == 0:\n",
|
||||
" return \"open-ended generation\"\n",
|
||||
"----------------------------------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"res = search_functions(df, 'fine-tuning input data validation logic', n=3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
@ -203,48 +300,35 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/openai/tests/test_endpoints.py:test_completions_multiple_prompts score=0.681\n",
|
||||
"def test_completions_multiple_prompts():\n",
|
||||
" result = openai.Completion.create(\n",
|
||||
" prompt=[\"This was a test\", \"This was another test\"], n=5, engine=\"ada\"\n",
|
||||
" )\n",
|
||||
" assert len(result.choices) == 10\n",
|
||||
"\n",
|
||||
"/openai/validators.py:get_common_xfix score=0.793\n",
|
||||
"def get_common_xfix(series, xfix=\"suffix\"):\n",
|
||||
" \"\"\"\n",
|
||||
" Finds the longest common suffix or prefix of all the values in a series\n",
|
||||
" \"\"\"\n",
|
||||
" common_xfix = \"\"\n",
|
||||
" while True:\n",
|
||||
" common_xfixes = (\n",
|
||||
" series.str[-(len(common_xfix) + 1) :]\n",
|
||||
" if xfix == \"suffix\"\n",
|
||||
" else series.str[: len(common_xfix) + 1]\n",
|
||||
"----------------------------------------------------------------------\n",
|
||||
"/openai/tests/test_endpoints.py:test_completions score=0.675\n",
|
||||
"def test_completions():\n",
|
||||
" result = openai.Completion.create(prompt=\"This was a test\", n=5, engine=\"ada\")\n",
|
||||
" assert len(result.choices) == 5\n",
|
||||
"/openai/validators.py:common_completion_suffix_validator score=0.778\n",
|
||||
"def common_completion_suffix_validator(df):\n",
|
||||
" \"\"\"\n",
|
||||
" This validator will suggest to add a common suffix to the completion if one doesn't already exist in case of classification or conditional generation.\n",
|
||||
" \"\"\"\n",
|
||||
" error_msg = None\n",
|
||||
" immediate_msg = None\n",
|
||||
" optional_msg = None\n",
|
||||
" optional_fn = None\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"----------------------------------------------------------------------\n",
|
||||
"/openai/tests/test_api_requestor.py:test_requestor_sets_request_id score=0.635\n",
|
||||
"def test_requestor_sets_request_id(mocker: MockerFixture) -> None:\n",
|
||||
" # Fake out 'requests' and confirm that the X-Request-Id header is set.\n",
|
||||
"\n",
|
||||
" got_headers = {}\n",
|
||||
"\n",
|
||||
" def fake_request(self, *args, **kwargs):\n",
|
||||
" nonlocal got_headers\n",
|
||||
" ft_type = infer_task_type(df)\n",
|
||||
"----------------------------------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from openai.embeddings_utils import cosine_similarity\n",
|
||||
"\n",
|
||||
"def search_functions(df, code_query, n=3, pprint=True, n_lines=7):\n",
|
||||
" embedding = get_embedding(code_query, engine='code-search-babbage-text-001')\n",
|
||||
" df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))\n",
|
||||
"\n",
|
||||
" res = df.sort_values('similarities', ascending=False).head(n)\n",
|
||||
" if pprint:\n",
|
||||
" for r in res.iterrows():\n",
|
||||
" print(r[1].filepath+\":\"+r[1].function_name + \" score=\" + str(round(r[1].similarities, 3)))\n",
|
||||
" print(\"\\n\".join(r[1].code.split(\"\\n\")[:n_lines]))\n",
|
||||
" print('-'*70)\n",
|
||||
" return res\n",
|
||||
"res = search_functions(df, 'Completions API tests', n=3)\n"
|
||||
"res = search_functions(df, 'find common suffix', n=2, n_lines=10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -256,90 +340,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/openai/validators.py:format_inferrer_validator score=0.655\n",
|
||||
"def format_inferrer_validator(df):\n",
|
||||
" \"\"\"\n",
|
||||
" This validator will infer the likely fine-tuning format of the data, and display it to the user if it is classification.\n",
|
||||
" It will also suggest to use ada and explain train/validation split benefits.\n",
|
||||
" \"\"\"\n",
|
||||
" ft_type = infer_task_type(df)\n",
|
||||
" immediate_msg = None\n",
|
||||
"----------------------------------------------------------------------\n",
|
||||
"/openai/validators.py:long_examples_validator score=0.649\n",
|
||||
"def long_examples_validator(df):\n",
|
||||
" \"\"\"\n",
|
||||
" This validator will suggest to the user to remove examples that are too long.\n",
|
||||
" \"\"\"\n",
|
||||
" immediate_msg = None\n",
|
||||
" optional_msg = None\n",
|
||||
" optional_fn = None\n",
|
||||
"----------------------------------------------------------------------\n",
|
||||
"/openai/validators.py:non_empty_completion_validator score=0.646\n",
|
||||
"def non_empty_completion_validator(df):\n",
|
||||
" \"\"\"\n",
|
||||
" This validator will ensure that no completion is empty.\n",
|
||||
" \"\"\"\n",
|
||||
" necessary_msg = None\n",
|
||||
" necessary_fn = None\n",
|
||||
" immediate_msg = None\n",
|
||||
"----------------------------------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"res = search_functions(df, 'fine-tuning input data validation logic', n=3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/openai/validators.py:common_completion_suffix_validator score=0.665\n",
|
||||
"def common_completion_suffix_validator(df):\n",
|
||||
" \"\"\"\n",
|
||||
" This validator will suggest to add a common suffix to the completion if one doesn't already exist in case of classification or conditional generation.\n",
|
||||
" \"\"\"\n",
|
||||
" error_msg = None\n",
|
||||
" immediate_msg = None\n",
|
||||
" optional_msg = None\n",
|
||||
" optional_fn = None\n",
|
||||
"\n",
|
||||
" ft_type = infer_task_type(df)\n",
|
||||
"----------------------------------------------------------------------\n",
|
||||
"/openai/validators.py:get_outfnames score=0.66\n",
|
||||
"def get_outfnames(fname, split):\n",
|
||||
" suffixes = [\"_train\", \"_valid\"] if split else [\"\"]\n",
|
||||
" i = 0\n",
|
||||
" while True:\n",
|
||||
" index_suffix = f\" ({i})\" if i > 0 else \"\"\n",
|
||||
" candidate_fnames = [\n",
|
||||
" fname.split(\".\")[0] + \"_prepared\" + suffix + index_suffix + \".jsonl\"\n",
|
||||
" for suffix in suffixes\n",
|
||||
" ]\n",
|
||||
" if not any(os.path.isfile(f) for f in candidate_fnames):\n",
|
||||
"----------------------------------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"res = search_functions(df, 'find common suffix', n=2, n_lines=10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/openai/cli.py:tools_register score=0.651\n",
|
||||
"/openai/cli.py:tools_register score=0.773\n",
|
||||
"def tools_register(parser):\n",
|
||||
" subparsers = parser.add_subparsers(\n",
|
||||
" title=\"Tools\", help=\"Convenience client side tools\"\n",
|
||||
@ -374,8 +375,9 @@
|
||||
"hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.7.3 64-bit ('base': conda)",
|
||||
"name": "python3"
|
||||
"display_name": "openai-cookbook",
|
||||
"language": "python",
|
||||
"name": "openai-cookbook"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@ -387,7 +389,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.3"
|
||||
"version": "3.9.6"
|
||||
},
|
||||
"orig_nbformat": 4
|
||||
},
|
||||
|
@ -17,7 +17,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"12288"
|
||||
"1536"
|
||||
]
|
||||
},
|
||||
"execution_count": 1,
|
||||
@ -29,8 +29,8 @@
|
||||
"import openai\n",
|
||||
"\n",
|
||||
"embedding = openai.Embedding.create(\n",
|
||||
" input=\"Sample document text goes here\",\n",
|
||||
" engine=\"text-similarity-davinci-001\"\n",
|
||||
" input=\"Your text goes here\",\n",
|
||||
" engine=\"text-embedding-ada-002\"\n",
|
||||
")[\"data\"][0][\"embedding\"]\n",
|
||||
"len(embedding)\n"
|
||||
]
|
||||
@ -44,7 +44,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1024\n"
|
||||
"1536\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -54,7 +54,7 @@
|
||||
"\n",
|
||||
"\n",
|
||||
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
|
||||
"def get_embedding(text: str, engine=\"text-similarity-davinci-001\") -> list[float]:\n",
|
||||
"def get_embedding(text: str, engine=\"text-embedding-ada-002\") -> list[float]:\n",
|
||||
"\n",
|
||||
" # replace newlines, which can negatively affect performance.\n",
|
||||
" text = text.replace(\"\\n\", \" \")\n",
|
||||
@ -62,25 +62,7 @@
|
||||
" return openai.Embedding.create(input=[text], engine=engine)[\"data\"][0][\"embedding\"]\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"embedding = get_embedding(\"Sample query text goes here\", engine=\"text-search-ada-query-001\")\n",
|
||||
"print(len(embedding))\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"1024\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"embedding = get_embedding(\"Sample document text goes here\", engine=\"text-search-ada-doc-001\")\n",
|
||||
"embedding = get_embedding(\"Your text goes here\", engine=\"text-embedding-ada-002\")\n",
|
||||
"print(len(embedding))\n"
|
||||
]
|
||||
}
|
||||
|
2192
examples/Multiclass_classification_for_transactions.ipynb
Normal file
2192
examples/Multiclass_classification_for_transactions.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
@ -11,6 +11,14 @@
|
||||
"We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
@ -131,7 +139,7 @@
|
||||
"\n",
|
||||
"# remove reviews that are too long\n",
|
||||
"df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))\n",
|
||||
"df = df[df.n_tokens<2000].tail(1_000)\n",
|
||||
"df = df[df.n_tokens<8000].tail(1_000)\n",
|
||||
"len(df)"
|
||||
]
|
||||
},
|
||||
@ -148,20 +156,22 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import openai\n",
|
||||
"from openai.embeddings_utils import get_embedding\n",
|
||||
"# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage\n",
|
||||
"\n",
|
||||
"# This will take just under 10 minutes\n",
|
||||
"df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))\n",
|
||||
"df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-search-babbage-doc-001'))\n",
|
||||
"# This will take just between 5 and 10 minutes\n",
|
||||
"df['ada_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))\n",
|
||||
"df['ada_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))\n",
|
||||
"df.to_csv('data/fine_food_reviews_with_embeddings_1k.csv')"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.9.9 ('openai')",
|
||||
"display_name": "openai-cookbook",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
"name": "openai-cookbook"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@ -173,12 +183,12 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.9"
|
||||
"version": "3.9.6"
|
||||
},
|
||||
"orig_nbformat": 4,
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
|
||||
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
|
||||
}
|
||||
}
|
||||
},
|
||||
|
@ -195,7 +195,7 @@
|
||||
"\n",
|
||||
"We plan to use document embeddings to fetch the most relevant part of parts of our document library and insert them into the prompt that we provide to GPT-3. We therefore need to break up the document library into \"sections\" of context, which can be searched and retrieved separately. \n",
|
||||
"\n",
|
||||
"Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3 prompt. We find that approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into semantically related headers, so we will use these to define our sections. This preprocessing has already been done in [this notebook](examples/fine-tuned_qa/olympics-1-collect-data.ipynb), so we will load the results and use them."
|
||||
"Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3 prompt. We find that approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into semantically related headers, so we will use these to define our sections. This preprocessing has already been done in [this notebook](fine-tuned_qa/olympics-1-collect-data.ipynb), so we will load the results and use them."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -316,11 +316,11 @@
|
||||
"id": "a17b88b9-7ea2-491e-9727-12617c74a77d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We preprocess the document sections by creating an embedding vector for each section. An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents. See the [documentation on OpenAI embeddings](https://beta.api.openai.org/docs/guides/embeddings/) for more information.\n",
|
||||
"We preprocess the document sections by creating an embedding vector for each section. An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents. See the [documentation on OpenAI embeddings](https://beta.openai.com/docs/guides/embeddings) for more information.\n",
|
||||
"\n",
|
||||
"This indexing stage can be executed offline and only runs once to precompute the indexes for the dataset so that each piece of content can be retrieved later. Since this is a small example, we will store and search the embeddings locally. If you have a larger dataset, consider using a vector search engine like [Pinecone](https://www.pinecone.io/) or [Weaviate](https://github.com/semi-technologies/weaviate) to power the search.\n",
|
||||
"\n",
|
||||
"For the purposes of this tutorial we chose to use Curie embeddings, which are 4096-dimensional embeddings at a very good price and performance point. Since we will be using these embeddings for retrieval, we’ll use the \"search\" embeddings (see the [documentation](https://beta.api.openai.org/docs/guides/embeddings/))."
|
||||
"For the purposes of this tutorial we chose to use Curie embeddings, which are 4096-dimensional embeddings at a very good price and performance point. Since we will be using these embeddings for retrieval, we’ll use the \"search\" embeddings (see the [documentation](https://beta.openai.com/docs/guides/embeddings))."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -20,7 +20,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Babbage similarity embedding performance on 1k Amazon reviews: mse=0.39, mae=0.38\n"
|
||||
"Ada similarity embedding performance on 1k Amazon reviews: mse=0.60, mae=0.51\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -32,11 +32,13 @@
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
|
||||
"\n",
|
||||
"datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\" # for your convenience, we precomputed the embeddings\n",
|
||||
"df = pd.read_csv(datafile_path)\n",
|
||||
"df[\"babbage_similarity\"] = df.babbage_similarity.apply(eval).apply(np.array)\n",
|
||||
"# If you have not run the \"Obtain_dataset.ipynb\" notebook, you can download the datafile from here: https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\n",
|
||||
"datafile_path = \"./data/fine_food_reviews_with_embeddings_1k.csv\"\n",
|
||||
"\n",
|
||||
"X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size=0.2, random_state=42)\n",
|
||||
"df = pd.read_csv(datafile_path)\n",
|
||||
"df[\"ada_similarity\"] = df.ada_similarity.apply(eval).apply(np.array)\n",
|
||||
"\n",
|
||||
"X_train, X_test, y_train, y_test = train_test_split(list(df.ada_similarity.values), df.Score, test_size=0.2, random_state=42)\n",
|
||||
"\n",
|
||||
"rfr = RandomForestRegressor(n_estimators=100)\n",
|
||||
"rfr.fit(X_train, y_train)\n",
|
||||
@ -45,7 +47,7 @@
|
||||
"mse = mean_squared_error(y_test, preds)\n",
|
||||
"mae = mean_absolute_error(y_test, preds)\n",
|
||||
"\n",
|
||||
"print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")\n"
|
||||
"print(f\"Ada similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -57,7 +59,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Dummy mean prediction performance on Amazon reviews: mse=1.81, mae=1.08\n"
|
||||
"Dummy mean prediction performance on Amazon reviews: mse=1.73, mae=1.03\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -70,10 +72,11 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error."
|
||||
"We can see that the embeddings are able to predict the scores with an average error of 0.60 per score prediction. This is roughly equivalent to predicting 1 out of 3 reviews perfectly, and 1 out of two reviews by a one star error."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -86,9 +89,9 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.9.9 ('openai')",
|
||||
"display_name": "openai-cookbook",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
"name": "openai-cookbook"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@ -100,7 +103,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.9"
|
||||
"version": "3.9.6"
|
||||
},
|
||||
"orig_nbformat": 4,
|
||||
"vscode": {
|
||||
|
@ -18,9 +18,11 @@
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\" # for your convenience, we precomputed the embeddings\n",
|
||||
"# If you have not run the \"Obtain_dataset.ipynb\" notebook, you can download the datafile from here: https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\n",
|
||||
"datafile_path = \"./data/fine_food_reviews_with_embeddings_1k.csv\"\n",
|
||||
"\n",
|
||||
"df = pd.read_csv(datafile_path)\n",
|
||||
"df[\"babbage_search\"] = df.babbage_search.apply(eval).apply(np.array)\n"
|
||||
"df[\"ada_search\"] = df.ada_search.apply(eval).apply(np.array)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -39,7 +41,7 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Fantastic Instant Refried beans: Fantastic Instant Refried Beans have been a staple for my family now for nearly 20 years. All 7 of us love it and my grown kids are passing on the tradition.\n",
|
||||
"Good Buy: I liked the beans. They were vacuum sealed, plump and moist. Would recommend them for any use. I personally split and stuck them in some vodka to make vanilla extract. Yum!\n",
|
||||
"\n",
|
||||
"Jamaican Blue beans: Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and befor\n",
|
||||
"\n",
|
||||
@ -55,9 +57,9 @@
|
||||
"def search_reviews(df, product_description, n=3, pprint=True):\n",
|
||||
" embedding = get_embedding(\n",
|
||||
" product_description,\n",
|
||||
" engine=\"text-search-babbage-query-001\"\n",
|
||||
" engine=\"text-embedding-ada-002\"\n",
|
||||
" )\n",
|
||||
" df[\"similarities\"] = df.babbage_search.apply(lambda x: cosine_similarity(x, embedding))\n",
|
||||
" df[\"similarities\"] = df.ada_search.apply(lambda x: cosine_similarity(x, embedding))\n",
|
||||
"\n",
|
||||
" res = (\n",
|
||||
" df.sort_values(\"similarities\", ascending=False)\n",
|
||||
@ -84,17 +86,17 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"sooo good: tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.\n",
|
||||
"\n",
|
||||
"Tasty and Quick Pasta: Barilla Whole Grain Fusilli with Vegetable Marinara is tasty and has an excellent chunky vegetable marinara. I just wish there was more of it. If you aren't starving or on a \n",
|
||||
"\n",
|
||||
"Rustichella ROCKS!: Anything this company makes is worthwhile eating! My favorite is their Trenne.<br />Their whole wheat pasta is the best I have ever had.\n",
|
||||
"sooo good: tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.\n",
|
||||
"\n",
|
||||
"Handy: Love the idea of ready in a minute pasta and for that alone this product gets praise. The pasta is whole grain so that's a big plus and it actually comes out al dente. The vegetable marinara\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"res = search_reviews(df, \"whole wheat pasta\", n=3)\n"
|
||||
"res = search_reviews(df, \"whole wheat pasta\", n=3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -119,7 +121,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"res = search_reviews(df, \"bad delivery\", n=1)\n"
|
||||
"res = search_reviews(df, \"bad delivery\", n=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -144,7 +146,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"res = search_reviews(df, \"spoilt\", n=1)\n"
|
||||
"res = search_reviews(df, \"spoilt\", n=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -158,21 +160,21 @@
|
||||
"text": [
|
||||
"Good food: The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.\n",
|
||||
"\n",
|
||||
"Good product: I like that this is a better product for my pets but really for the price of it I couldn't afford to buy this all the time. My cat isn't very picky usually and she ate this, we usually \n",
|
||||
"The cats like it: My 7 cats like this food but it is a little yucky for the human. Pieces of mackerel swimming in a dark broth. It is billed as a \"complete\" food and contains carrots, peas and pasta.\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"res = search_reviews(df, \"pet food\", n=2)\n"
|
||||
"res = search_reviews(df, \"pet food\", n=2)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.9.9 ('openai')",
|
||||
"display_name": "openai-cookbook",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
"name": "openai-cookbook"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@ -184,12 +186,12 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.9"
|
||||
"version": "3.9.6"
|
||||
},
|
||||
"orig_nbformat": 4,
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
|
||||
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
|
||||
}
|
||||
}
|
||||
},
|
||||
|
452
examples/Unit_test_writing_using_a_multi-step_prompt.ipynb
Normal file
452
examples/Unit_test_writing_using_a_multi-step_prompt.ipynb
Normal file
@ -0,0 +1,452 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Unit test writing using a multi-step prompt\n",
|
||||
"\n",
|
||||
"Complex tasks, such as writing unit tests, can benefit from multi-step prompts. In contrast to a single prompt, a multi-step prompt generates text from GPT-3 and then feeds that text back into subsequent prompts. This can help in cases where you want GPT-3 to explain its reasoning before answering, or brainstorm a plan before executing it.\n",
|
||||
"\n",
|
||||
"In this notebook, we use a 3-step prompt to write unit tests in Python using the following steps:\n",
|
||||
"\n",
|
||||
"1. Given a Python function, we first prompt GPT-3 to explain what the function is doing.\n",
|
||||
"2. Second, we prompt GPT-3 to plan a set of unit tests for the function.\n",
|
||||
" - If the plan is too short, we ask GPT-3 to elaborate with more ideas for unit tests.\n",
|
||||
"3. Finally, we prompt GPT-3 to write the unit tests.\n",
|
||||
"\n",
|
||||
"The code example illustrates a few optional embellishments on the chained, multi-step prompt:\n",
|
||||
"\n",
|
||||
"- Conditional branching (e.g., only asking for elaboration if the first plan is too short)\n",
|
||||
"- Different models for different steps (e.g., `text-davinci-002` for the text planning steps and `code-davinci-002` for the code writing step)\n",
|
||||
"- A check that re-runs the function if the output is unsatisfactory (e.g., if the output code cannot be parsed by Python's `ast` module)\n",
|
||||
"- Streaming output so that you can start reading the output before it's fully generated (useful for long, multi-step outputs)\n",
|
||||
"\n",
|
||||
"The full 3-step prompt looks like this (using as an example `pytest` for the unit test framework and `is_palindrome` as the function):\n",
|
||||
"\n",
|
||||
" # How to write great unit tests with pytest\n",
|
||||
"\n",
|
||||
" In this advanced tutorial for experts, we'll use Python 3.9 and `pytest` to write a suite of unit tests to verify the behavior of the following function.\n",
|
||||
" ```python\n",
|
||||
" def is_palindrome(s):\n",
|
||||
" return s == s[::-1]\n",
|
||||
" ```\n",
|
||||
"\n",
|
||||
" Before writing any unit tests, let's review what each element of the function is doing exactly and what the author's intentions may have been.\n",
|
||||
" - First,{GENERATED IN STEP 1}\n",
|
||||
" \n",
|
||||
" A good unit test suite should aim to:\n",
|
||||
" - Test the function's behavior for a wide range of possible inputs\n",
|
||||
" - Test edge cases that the author may not have foreseen\n",
|
||||
" - Take advantage of the features of `pytest` to make the tests easy to write and maintain\n",
|
||||
" - Be easy to read and understand, with clean code and descriptive names\n",
|
||||
" - Be deterministic, so that the tests always pass or fail in the same way\n",
|
||||
"\n",
|
||||
" `pytest` has many convenient features that make it easy to write and maintain unit tests. We'll use them to write unit tests for the function above.\n",
|
||||
"\n",
|
||||
" For this particular function, we'll want our unit tests to handle the following diverse scenarios (and under each scenario, we include a few examples as sub-bullets):\n",
|
||||
" -{GENERATED IN STEP 2}\n",
|
||||
"\n",
|
||||
" [OPTIONALLY APPENDED]In addition to the scenarios above, we'll also want to make sure we don't forget to test rare or unexpected edge cases (and under each edge case, we include a few examples as sub-bullets):\n",
|
||||
" -{GENERATED IN STEP 2B}\n",
|
||||
"\n",
|
||||
" Before going into the individual tests, let's first look at the complete suite of unit tests as a cohesive whole. We've added helpful comments to explain what each line does.\n",
|
||||
" ```python\n",
|
||||
" import pytest # used for our unit tests\n",
|
||||
"\n",
|
||||
" def is_palindrome(s):\n",
|
||||
" return s == s[::-1]\n",
|
||||
"\n",
|
||||
" #Below, each test case is represented by a tuple passed to the @pytest.mark.parametrize decorator\n",
|
||||
" {GENERATED IN STEP 3}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# imports needed to run the code in this notebook\n",
|
||||
"import ast # used for detecting whether generated Python code is valid\n",
|
||||
"import openai # used for calling the OpenAI API\n",
|
||||
"\n",
|
||||
"# example of a function that uses a multi-step prompt to write unit tests\n",
|
||||
"def unit_test_from_function(\n",
|
||||
" function_to_test: str, # Python function to test, as a string\n",
|
||||
" unit_test_package: str = \"pytest\", # unit testing package; use the name as it appears in the import statement\n",
|
||||
" approx_min_cases_to_cover: int = 7, # minimum number of test case categories to cover (approximate)\n",
|
||||
" print_text: bool = False, # optionally prints text; helpful for understanding the function & debugging\n",
|
||||
" text_model: str = \"text-davinci-002\", # model used to generate text plans in steps 1, 2, and 2b\n",
|
||||
" code_model: str = \"code-davinci-002\", # if you don't have access to code models, you can use text models here instead\n",
|
||||
" max_tokens: int = 1000, # can set this high, as generations should be stopped earlier by stop sequences\n",
|
||||
" temperature: float = 0.4, # temperature = 0 can sometimes get stuck in repetitive loops, so we use 0.4\n",
|
||||
" reruns_if_fail: int = 1, # if the output code cannot be parsed, this will re-run the function up to N times\n",
|
||||
") -> str:\n",
|
||||
" \"\"\"Outputs a unit test for a given Python function, using a 3-step GPT-3 prompt.\"\"\"\n",
|
||||
"\n",
|
||||
" # Step 1: Generate an explanation of the function\n",
|
||||
"\n",
|
||||
" # create a markdown-formatted prompt that asks GPT-3 to complete an explanation of the function, formatted as a bullet list\n",
|
||||
" prompt_to_explain_the_function = f\"\"\"# How to write great unit tests with {unit_test_package}\n",
|
||||
"\n",
|
||||
"In this advanced tutorial for experts, we'll use Python 3.9 and `{unit_test_package}` to write a suite of unit tests to verify the behavior of the following function.\n",
|
||||
"```python\n",
|
||||
"{function_to_test}\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Before writing any unit tests, let's review what each element of the function is doing exactly and what the author's intentions may have been.\n",
|
||||
"- First,\"\"\"\n",
|
||||
" if print_text:\n",
|
||||
" text_color_prefix = \"\\033[30m\" # black; if you read against a dark background \\033[97m is white\n",
|
||||
" print(text_color_prefix + prompt_to_explain_the_function, end=\"\") # end='' prevents a newline from being printed\n",
|
||||
"\n",
|
||||
" # send the prompt to the API, using \\n\\n as a stop sequence to stop at the end of the bullet list\n",
|
||||
" explanation_response = openai.Completion.create(\n",
|
||||
" model=text_model,\n",
|
||||
" prompt=prompt_to_explain_the_function,\n",
|
||||
" stop=[\"\\n\\n\", \"\\n\\t\\n\", \"\\n \\n\"],\n",
|
||||
" max_tokens=max_tokens,\n",
|
||||
" temperature=temperature,\n",
|
||||
" stream=True,\n",
|
||||
" )\n",
|
||||
" explanation_completion = \"\"\n",
|
||||
" if print_text:\n",
|
||||
" completion_color_prefix = \"\\033[92m\" # green\n",
|
||||
" print(completion_color_prefix, end=\"\")\n",
|
||||
" for event in explanation_response:\n",
|
||||
" event_text = event[\"choices\"][0][\"text\"]\n",
|
||||
" explanation_completion += event_text\n",
|
||||
" if print_text:\n",
|
||||
" print(event_text, end=\"\")\n",
|
||||
"\n",
|
||||
" # Step 2: Generate a plan to write a unit test\n",
|
||||
"\n",
|
||||
" # create a markdown-formatted prompt that asks GPT-3 to complete a plan for writing unit tests, formatted as a bullet list\n",
|
||||
" prompt_to_explain_a_plan = f\"\"\"\n",
|
||||
" \n",
|
||||
"A good unit test suite should aim to:\n",
|
||||
"- Test the function's behavior for a wide range of possible inputs\n",
|
||||
"- Test edge cases that the author may not have foreseen\n",
|
||||
"- Take advantage of the features of `{unit_test_package}` to make the tests easy to write and maintain\n",
|
||||
"- Be easy to read and understand, with clean code and descriptive names\n",
|
||||
"- Be deterministic, so that the tests always pass or fail in the same way\n",
|
||||
"\n",
|
||||
"`{unit_test_package}` has many convenient features that make it easy to write and maintain unit tests. We'll use them to write unit tests for the function above.\n",
|
||||
"\n",
|
||||
"For this particular function, we'll want our unit tests to handle the following diverse scenarios (and under each scenario, we include a few examples as sub-bullets):\n",
|
||||
"-\"\"\"\n",
|
||||
" if print_text:\n",
|
||||
" print(text_color_prefix + prompt_to_explain_a_plan, end=\"\")\n",
|
||||
"\n",
|
||||
" # append this planning prompt to the results from step 1\n",
|
||||
" prior_text = prompt_to_explain_the_function + explanation_completion\n",
|
||||
" full_plan_prompt = prior_text + prompt_to_explain_a_plan\n",
|
||||
"\n",
|
||||
" # send the prompt to the API, using \\n\\n as a stop sequence to stop at the end of the bullet list\n",
|
||||
" plan_response = openai.Completion.create(\n",
|
||||
" model=text_model,\n",
|
||||
" prompt=full_plan_prompt,\n",
|
||||
" stop=[\"\\n\\n\", \"\\n\\t\\n\", \"\\n \\n\"],\n",
|
||||
" max_tokens=max_tokens,\n",
|
||||
" temperature=temperature,\n",
|
||||
" stream=True,\n",
|
||||
" )\n",
|
||||
" plan_completion = \"\"\n",
|
||||
" if print_text:\n",
|
||||
" print(completion_color_prefix, end=\"\")\n",
|
||||
" for event in plan_response:\n",
|
||||
" event_text = event[\"choices\"][0][\"text\"]\n",
|
||||
" plan_completion += event_text\n",
|
||||
" if print_text:\n",
|
||||
" print(event_text, end=\"\")\n",
|
||||
"\n",
|
||||
" # Step 2b: If the plan is short, ask GPT-3 to elaborate further\n",
|
||||
" # this counts top-level bullets (e.g., categories), but not sub-bullets (e.g., test cases)\n",
|
||||
" elaboration_needed = plan_completion.count(\"\\n-\") +1 < approx_min_cases_to_cover # adds 1 because the first bullet is not counted\n",
|
||||
" if elaboration_needed:\n",
|
||||
" prompt_to_elaborate_on_the_plan = f\"\"\"\n",
|
||||
"\n",
|
||||
"In addition to the scenarios above, we'll also want to make sure we don't forget to test rare or unexpected edge cases (and under each edge case, we include a few examples as sub-bullets):\n",
|
||||
"-\"\"\"\n",
|
||||
" if print_text:\n",
|
||||
" print(text_color_prefix + prompt_to_elaborate_on_the_plan, end=\"\")\n",
|
||||
"\n",
|
||||
" # append this elaboration prompt to the results from step 2\n",
|
||||
" prior_text = full_plan_prompt + plan_completion\n",
|
||||
" full_elaboration_prompt = prior_text + prompt_to_elaborate_on_the_plan\n",
|
||||
"\n",
|
||||
" # send the prompt to the API, using \\n\\n as a stop sequence to stop at the end of the bullet list\n",
|
||||
" elaboration_response = openai.Completion.create(\n",
|
||||
" model=text_model,\n",
|
||||
" prompt=full_elaboration_prompt,\n",
|
||||
" stop=[\"\\n\\n\", \"\\n\\t\\n\", \"\\n \\n\"],\n",
|
||||
" max_tokens=max_tokens,\n",
|
||||
" temperature=temperature,\n",
|
||||
" stream=True,\n",
|
||||
" )\n",
|
||||
" elaboration_completion = \"\"\n",
|
||||
" if print_text:\n",
|
||||
" print(completion_color_prefix, end=\"\")\n",
|
||||
" for event in elaboration_response:\n",
|
||||
" event_text = event[\"choices\"][0][\"text\"]\n",
|
||||
" elaboration_completion += event_text\n",
|
||||
" if print_text:\n",
|
||||
" print(event_text, end=\"\")\n",
|
||||
"\n",
|
||||
" # Step 3: Generate the unit test\n",
|
||||
"\n",
|
||||
" # create a markdown-formatted prompt that asks GPT-3 to complete a unit test\n",
|
||||
" starter_comment = \"\"\n",
|
||||
" if unit_test_package == \"pytest\":\n",
|
||||
" starter_comment = \"Below, each test case is represented by a tuple passed to the @pytest.mark.parametrize decorator\"\n",
|
||||
" prompt_to_generate_the_unit_test = f\"\"\"\n",
|
||||
"\n",
|
||||
"Before going into the individual tests, let's first look at the complete suite of unit tests as a cohesive whole. We've added helpful comments to explain what each line does.\n",
|
||||
"```python\n",
|
||||
"import {unit_test_package} # used for our unit tests\n",
|
||||
"\n",
|
||||
"{function_to_test}\n",
|
||||
"\n",
|
||||
"#{starter_comment}\"\"\"\n",
|
||||
" if print_text:\n",
|
||||
" print(text_color_prefix + prompt_to_generate_the_unit_test, end=\"\")\n",
|
||||
"\n",
|
||||
" # append this unit test prompt to the results from step 3\n",
|
||||
" if elaboration_needed:\n",
|
||||
" prior_text = full_elaboration_prompt + elaboration_completion\n",
|
||||
" else:\n",
|
||||
" prior_text = full_plan_prompt + plan_completion\n",
|
||||
" full_unit_test_prompt = prior_text + prompt_to_generate_the_unit_test\n",
|
||||
"\n",
|
||||
" # send the prompt to the API, using ``` as a stop sequence to stop at the end of the code block\n",
|
||||
" unit_test_response = openai.Completion.create(\n",
|
||||
" model=code_model,\n",
|
||||
" prompt=full_unit_test_prompt,\n",
|
||||
" stop=\"```\",\n",
|
||||
" max_tokens=max_tokens,\n",
|
||||
" temperature=temperature,\n",
|
||||
" stream=True\n",
|
||||
" )\n",
|
||||
" unit_test_completion = \"\"\n",
|
||||
" if print_text:\n",
|
||||
" print(completion_color_prefix, end=\"\")\n",
|
||||
" for event in unit_test_response:\n",
|
||||
" event_text = event[\"choices\"][0][\"text\"]\n",
|
||||
" unit_test_completion += event_text\n",
|
||||
" if print_text:\n",
|
||||
" print(event_text, end=\"\")\n",
|
||||
"\n",
|
||||
" # check the output for errors\n",
|
||||
" code_start_index = prompt_to_generate_the_unit_test.find(\"```python\\n\") + len(\"```python\\n\")\n",
|
||||
" code_output = prompt_to_generate_the_unit_test[code_start_index:] + unit_test_completion\n",
|
||||
" try:\n",
|
||||
" ast.parse(code_output)\n",
|
||||
" except SyntaxError as e:\n",
|
||||
" print(f\"Syntax error in generated code: {e}\")\n",
|
||||
" if reruns_if_fail > 0:\n",
|
||||
" print(\"Rerunning...\")\n",
|
||||
" return unit_test_from_function(\n",
|
||||
" function_to_test=function_to_test,\n",
|
||||
" unit_test_package=unit_test_package,\n",
|
||||
" approx_min_cases_to_cover=approx_min_cases_to_cover,\n",
|
||||
" print_text=print_text,\n",
|
||||
" text_model=text_model,\n",
|
||||
" code_model=code_model,\n",
|
||||
" max_tokens=max_tokens,\n",
|
||||
" temperature=temperature,\n",
|
||||
" reruns_if_fail=reruns_if_fail-1, # decrement rerun counter when calling again\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # return the unit test as a string\n",
|
||||
" return unit_test_completion\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\u001b[30m# How to write great unit tests with pytest\n",
|
||||
"\n",
|
||||
"In this advanced tutorial for experts, we'll use Python 3.9 and `pytest` to write a suite of unit tests to verify the behavior of the following function.\n",
|
||||
"```python\n",
|
||||
"def is_palindrome(s):\n",
|
||||
" return s == s[::-1]\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Before writing any unit tests, let's review what each element of the function is doing exactly and what the author's intentions may have been.\n",
|
||||
"- First,\u001b[92m we have a function definition. This is where we give the function a name, `is_palindrome`, and specify the arguments that the function accepts. In this case, the function accepts a single string argument, `s`.\n",
|
||||
"- Next, we have a return statement. This is where we specify the value that the function returns. In this case, the function returns `s == s[::-1]`.\n",
|
||||
"- Finally, we have a function call. This is where we actually call the function with a specific set of arguments. In this case, we're calling the function with the string `\"racecar\"`.\u001b[30m\n",
|
||||
" \n",
|
||||
"A good unit test suite should aim to:\n",
|
||||
"- Test the function's behavior for a wide range of possible inputs\n",
|
||||
"- Test edge cases that the author may not have foreseen\n",
|
||||
"- Take advantage of the features of `pytest` to make the tests easy to write and maintain\n",
|
||||
"- Be easy to read and understand, with clean code and descriptive names\n",
|
||||
"- Be deterministic, so that the tests always pass or fail in the same way\n",
|
||||
"\n",
|
||||
"`pytest` has many convenient features that make it easy to write and maintain unit tests. We'll use them to write unit tests for the function above.\n",
|
||||
"\n",
|
||||
"For this particular function, we'll want our unit tests to handle the following diverse scenarios (and under each scenario, we include a few examples as sub-bullets):\n",
|
||||
"-\u001b[92m The input is a palindrome\n",
|
||||
" - `\"racecar\"`\n",
|
||||
" - `\"madam\"`\n",
|
||||
" - `\"anna\"`\n",
|
||||
"- The input is not a palindrome\n",
|
||||
" - `\"python\"`\n",
|
||||
" - `\"test\"`\n",
|
||||
" - `\"1234\"`\n",
|
||||
"- The input is an empty string\n",
|
||||
" - `\"\"`\n",
|
||||
"- The input is `None`\n",
|
||||
"- The input is not a string\n",
|
||||
" - `1`\n",
|
||||
" - `1.0`\n",
|
||||
" - `True`\n",
|
||||
" - `False`\n",
|
||||
" - `[]`\n",
|
||||
" - `{}`\u001b[30m\n",
|
||||
"\n",
|
||||
"In addition to the scenarios above, we'll also want to make sure we don't forget to test rare or unexpected edge cases (and under each edge case, we include a few examples as sub-bullets):\n",
|
||||
"-\u001b[92m The input is a palindrome with spaces\n",
|
||||
" - `\"race car\"`\n",
|
||||
" - `\" madam \"`\n",
|
||||
" - `\" anna \"`\n",
|
||||
"- The input is not a palindrome with spaces\n",
|
||||
" - `\" python \"`\n",
|
||||
" - `\" test \"`\n",
|
||||
" - `\" 1234 \"`\n",
|
||||
"- The input is a palindrome with punctuation\n",
|
||||
" - `\"racecar!\"`\n",
|
||||
" - `\"Madam, I'm Adam.\"`\n",
|
||||
" - `\"Anna's\"`\n",
|
||||
"- The input is not a palindrome with punctuation\n",
|
||||
" - `\"python!\"`\n",
|
||||
" - `\"test.\"`\n",
|
||||
" - `\"1234!\"`\n",
|
||||
"- The input is a palindrome with mixed case\n",
|
||||
" - `\"Racecar\"`\n",
|
||||
" - `\"Madam\"`\n",
|
||||
" - `\"Anna\"`\n",
|
||||
"- The input is not a palindrome with mixed case\n",
|
||||
" - `\"Python\"`\n",
|
||||
" - `\"Test\"`\n",
|
||||
" - `\"1234\"`\u001b[30m\n",
|
||||
"\n",
|
||||
"Before going into the individual tests, let's first look at the complete suite of unit tests as a cohesive whole. We've added helpful comments to explain what each line does.\n",
|
||||
"```python\n",
|
||||
"import pytest # used for our unit tests\n",
|
||||
"\n",
|
||||
"def is_palindrome(s):\n",
|
||||
" return s == s[::-1]\n",
|
||||
"\n",
|
||||
"#Below, each test case is represented by a tuple passed to the @pytest.mark.parametrize decorator\u001b[92m.\n",
|
||||
"#The first element of the tuple is a name for the test case, and the second element is a list of arguments for the test case.\n",
|
||||
"#The @pytest.mark.parametrize decorator will generate a separate test function for each test case.\n",
|
||||
"#The generated test function will be named test_is_palindrome_<name> where <name> is the name of the test case.\n",
|
||||
"#The generated test function will be given the arguments specified in the list of arguments for the test case.\n",
|
||||
"#The generated test function will be given the fixture specified in the decorator, in this case the function itself.\n",
|
||||
"#The generated test function will call the function with the arguments and assert that the result is equal to the expected value.\n",
|
||||
"@pytest.mark.parametrize(\n",
|
||||
" \"name,args,expected\",\n",
|
||||
" [\n",
|
||||
" # Test the function's behavior for a wide range of possible inputs\n",
|
||||
" (\"palindrome\", [\"racecar\"], True),\n",
|
||||
" (\"palindrome\", [\"madam\"], True),\n",
|
||||
" (\"palindrome\", [\"anna\"], True),\n",
|
||||
" (\"non-palindrome\", [\"python\"], False),\n",
|
||||
" (\"non-palindrome\", [\"test\"], False),\n",
|
||||
" (\"non-palindrome\", [\"1234\"], False),\n",
|
||||
" (\"empty string\", [\"\"], True),\n",
|
||||
" (\"None\", [None], False),\n",
|
||||
" (\"non-string\", [1], False),\n",
|
||||
" (\"non-string\", [1.0], False),\n",
|
||||
" (\"non-string\", [True], False),\n",
|
||||
" (\"non-string\", [False], False),\n",
|
||||
" (\"non-string\", [[]], False),\n",
|
||||
" (\"non-string\", [{}], False),\n",
|
||||
" # Test edge cases that the author may not have foreseen\n",
|
||||
" (\"palindrome with spaces\", [\"race car\"], True),\n",
|
||||
" (\"palindrome with spaces\", [\" madam \"], True),\n",
|
||||
" (\"palindrome with spaces\", [\" anna \"], True),\n",
|
||||
" (\"non-palindrome with spaces\", [\" python \"], False),\n",
|
||||
" (\"non-palindrome with spaces\", [\" test \"], False),\n",
|
||||
" (\"non-palindrome with spaces\", [\" 1234 \"], False),\n",
|
||||
" (\"palindrome with punctuation\", [\"racecar!\"], True),\n",
|
||||
" (\"palindrome with punctuation\", [\"Madam, I'm Adam.\"], True),\n",
|
||||
" (\"palindrome with punctuation\", [\"Anna's\"], True),\n",
|
||||
" (\"non-palindrome with punctuation\", [\"python!\"], False),\n",
|
||||
" (\"non-palindrome with punctuation\", [\"test.\"], False),\n",
|
||||
" (\"non-palindrome with punctuation\", [\"1234!\"], False),\n",
|
||||
" (\"palindrome with mixed case\", [\"Racecar\"], True),\n",
|
||||
" (\"palindrome with mixed case\", [\"Madam\"], True),\n",
|
||||
" (\"palindrome with mixed case\", [\"Anna\"], True),\n",
|
||||
" (\"non-palindrome with mixed case\", [\"Python\"], False),\n",
|
||||
" (\"non-palindrome with mixed case\", [\"Test\"], False),\n",
|
||||
" (\"non-palindrome with mixed case\", [\"1234\"], False),\n",
|
||||
" ],\n",
|
||||
")\n",
|
||||
"def test_is_palindrome(is_palindrome, args, expected):\n",
|
||||
" assert is_palindrome(*args) == expected\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'.\\n#The first element of the tuple is a name for the test case, and the second element is a list of arguments for the test case.\\n#The @pytest.mark.parametrize decorator will generate a separate test function for each test case.\\n#The generated test function will be named test_is_palindrome_<name> where <name> is the name of the test case.\\n#The generated test function will be given the arguments specified in the list of arguments for the test case.\\n#The generated test function will be given the fixture specified in the decorator, in this case the function itself.\\n#The generated test function will call the function with the arguments and assert that the result is equal to the expected value.\\n@pytest.mark.parametrize(\\n \"name,args,expected\",\\n [\\n # Test the function\\'s behavior for a wide range of possible inputs\\n (\"palindrome\", [\"racecar\"], True),\\n (\"palindrome\", [\"madam\"], True),\\n (\"palindrome\", [\"anna\"], True),\\n (\"non-palindrome\", [\"python\"], False),\\n (\"non-palindrome\", [\"test\"], False),\\n (\"non-palindrome\", [\"1234\"], False),\\n (\"empty string\", [\"\"], True),\\n (\"None\", [None], False),\\n (\"non-string\", [1], False),\\n (\"non-string\", [1.0], False),\\n (\"non-string\", [True], False),\\n (\"non-string\", [False], False),\\n (\"non-string\", [[]], False),\\n (\"non-string\", [{}], False),\\n # Test edge cases that the author may not have foreseen\\n (\"palindrome with spaces\", [\"race car\"], True),\\n (\"palindrome with spaces\", [\" madam \"], True),\\n (\"palindrome with spaces\", [\" anna \"], True),\\n (\"non-palindrome with spaces\", [\" python \"], False),\\n (\"non-palindrome with spaces\", [\" test \"], False),\\n (\"non-palindrome with spaces\", [\" 1234 \"], False),\\n (\"palindrome with punctuation\", [\"racecar!\"], True),\\n (\"palindrome with punctuation\", [\"Madam, I\\'m Adam.\"], True),\\n (\"palindrome with punctuation\", [\"Anna\\'s\"], True),\\n (\"non-palindrome with punctuation\", [\"python!\"], False),\\n (\"non-palindrome with punctuation\", [\"test.\"], False),\\n (\"non-palindrome with punctuation\", [\"1234!\"], False),\\n (\"palindrome with mixed case\", [\"Racecar\"], True),\\n (\"palindrome with mixed case\", [\"Madam\"], True),\\n (\"palindrome with mixed case\", [\"Anna\"], True),\\n (\"non-palindrome with mixed case\", [\"Python\"], False),\\n (\"non-palindrome with mixed case\", [\"Test\"], False),\\n (\"non-palindrome with mixed case\", [\"1234\"], False),\\n ],\\n)\\ndef test_is_palindrome(is_palindrome, args, expected):\\n assert is_palindrome(*args) == expected\\n'"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"example_function = \"\"\"def is_palindrome(s):\n",
|
||||
" return s == s[::-1]\"\"\"\n",
|
||||
"\n",
|
||||
"unit_test_from_function(example_function, print_text=True)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.9.9 ('openai')",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.9"
|
||||
},
|
||||
"orig_nbformat": 4,
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -58,7 +58,7 @@
|
||||
"from azure.identity import DefaultAzureCredential\n",
|
||||
"\n",
|
||||
"default_credential = DefaultAzureCredential()\n",
|
||||
"token = default_credential.get_token(\"https://cognitiveservices.azure.com\")\n",
|
||||
"token = default_credential.get_token(\"https://cognitiveservices.azure.com/.default\")\n",
|
||||
"\n",
|
||||
"openai.api_type = 'azure_ad'\n",
|
||||
"openai.api_key = token.token\n",
|
||||
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
362
examples/data/25000_spend_dataset_current.csv
Normal file
362
examples/data/25000_spend_dataset_current.csv
Normal file
@ -0,0 +1,362 @@
|
||||
Date,Supplier,Description,Transaction value (<28>)
|
||||
21/04/2016,M & J Ballantyne Ltd,George IV Bridge Work,35098
|
||||
26/04/2016,Private Sale,Literary & Archival Items,30000
|
||||
30/04/2016,City Of Edinburgh Council,Non Domestic Rates ,40800
|
||||
09/05/2016,Computacenter Uk,Kelvin Hall,72835
|
||||
09/05/2016,John Graham Construction Ltd,Causewayside Refurbishment,64361
|
||||
09/05/2016,A McGillivray,Causewayside Refurbishment,53690
|
||||
16/05/2016,John Graham Construction Ltd,Causewayside Refurbishment,365344
|
||||
23/05/2016,Computacenter Uk,Kelvin Hall,26506
|
||||
23/05/2016,ECG Facilities Service,Facilities Management Charge,32777
|
||||
23/05/2016,ECG Facilities Service,Facilities Management Charge,32777
|
||||
30/05/2016,ALDL,ALDL Charges,32317
|
||||
10/06/2016,Wavetek Ltd,Kelvin Hall,87589
|
||||
10/06/2016,John Graham Construction Ltd,Causewayside Refurbishment,381803
|
||||
28/06/2016,ECG Facilities Service,Facilities Management Charge,32832
|
||||
30/06/2016,Glasgow City Council,Kelvin Hall,1700000
|
||||
11/07/2016,Wavetek Ltd,Kelvin Hall,65692
|
||||
11/07/2016,John Graham Construction Ltd,Causewayside Refurbishment,139845
|
||||
15/07/2016,Sotheby'S,Literary & Archival Items,28500
|
||||
18/07/2016,Christies,Literary & Archival Items,33800
|
||||
25/07/2016,A McGillivray,Causewayside Refurbishment,30113
|
||||
31/07/2016,ALDL,ALDL Charges,32317
|
||||
08/08/2016,ECG Facilities Service,Facilities Management Charge,32795
|
||||
15/08/2016,Creative Video Productions Ltd,Kelvin Hall,26866
|
||||
15/08/2016,John Graham Construction Ltd,Causewayside Refurbishment,196807
|
||||
24/08/2016,ECG Facilities Service,Facilities Management Charge,32795
|
||||
05/09/2016,John Graham Construction Ltd,Causewayside Refurbishment,36359
|
||||
12/09/2016,Flexiform,Kelvin Hall,42623
|
||||
12/09/2016,City Of Edinburgh Council,Non Domestic Rates ,144330
|
||||
12/09/2016,City Of Edinburgh Council,Non Domestic Rates ,49827
|
||||
12/09/2016,John Graham Construction Ltd,Causewayside Refurbishment,228689
|
||||
19/09/2016,Jisc Services Ltd Subscription Account,Literary & Archival Items,42629
|
||||
26/09/2016,Senator International,Kelvin Hall,35706
|
||||
26/09/2016,ECG Facilities Service,Facilities Management Charge,32795
|
||||
26/09/2016,John Graham Construction Ltd,Causewayside Refurbishment,28378
|
||||
30/09/2016,A McGillivray,Causewayside Refurbishment,44392
|
||||
10/10/2016,Cengage Learning (Emea )Ltd,Literary & Archival Items,86604
|
||||
10/10/2016,John Graham Construction Ltd,Causewayside Refurbishment,303999
|
||||
24/10/2016,ECG Facilities Service,Facilities Management Charge,32795
|
||||
24/10/2016,ALDL,ALDL Charges,32317
|
||||
31/10/2016,John Graham Construction Ltd,Causewayside Refurbishment,74245
|
||||
07/11/2016,CBRE,Kelvin Hall,83736
|
||||
14/11/2016,University Of Glasgow,Kelvin Hall,188682
|
||||
14/11/2016,John Graham Construction Ltd,Causewayside Refurbishment,362326
|
||||
08/12/2016,Sothebys,Literary & Archival Items,166000
|
||||
08/12/2016,Private Sale,Literary & Archival Items,87500
|
||||
08/12/2016,ECG Facilities Service,Facilities Management Charge,32795
|
||||
12/12/2016,John Graham Construction Ltd,Causewayside Refurbishment,385310
|
||||
30/12/2016,ECG Facilities Service,Facilities Management Charge,32795
|
||||
30/12/2016,John Graham Construction Ltd,Causewayside Refurbishment,253618
|
||||
30/12/2016,John Graham Construction Ltd,Causewayside Refurbishment,45127
|
||||
23/01/2017,ALDL,ALDL Charges,27730
|
||||
07/02/2017,ECG Facilities Service,Facilities Management Charge,32795
|
||||
07/02/2017,John Graham Construction Ltd,Causewayside Refurbishment,52404
|
||||
13/02/2017,John Graham Construction Ltd,Causewayside Refurbishment,272390
|
||||
27/02/2017,Cengage Learning (Emea )Ltd,Literary & Archival Items,43302
|
||||
27/02/2017,ECG Facilities Service,Facilities Management Charge,32795
|
||||
06/03/2017,Private Sale,Literary & Archival Items,72500
|
||||
06/03/2017,John Graham Construction Ltd,Causewayside Refurbishment,31781
|
||||
06/03/2017,John Graham Construction Ltd,Causewayside Refurbishment,198048
|
||||
27/03/2017,ECG Facilities Service,Facilities Management Charge,32795
|
||||
31/03/2017,NLS Foundation,Grant Payment,177500
|
||||
31/03/2017,Private Sale,Literary & Archival Items,3422500
|
||||
31/03/2017,Nicholson Bros(Electrical Contractors) Ltd,Causewayside Refurbishment,33666
|
||||
31/03/2017,John Graham Construction Ltd,Causewayside Refurbishment,222090
|
||||
31/03/2017,John Graham Construction Ltd,Causewayside Refurbishment,63971
|
||||
31/03/2017,XMA Scotland Ltd,IT equipment,33450
|
||||
31/03/2017,XMA Scotland Ltd,IT equipment,84524
|
||||
24/04/2017,Cengage Learning (Emea )Ltd,Literary & Archival Items,43302
|
||||
24/04/2017,Scottish Historic Buildings Trust,Lawnmarket Work,50057
|
||||
24/04/2017,Insight Direct (UK) Ltd,IT equipment,56768
|
||||
30/04/2017,Morris & Spottiswood Ltd,George IV Bridge Work,63716
|
||||
08/05/2017,Anglian Water Business,Water,26832
|
||||
15/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,245381
|
||||
22/05/2017,ECG Facilities Service,Facilities Management Charge,33386
|
||||
22/05/2017,ALDL,Legal Deposit Services,27067
|
||||
29/05/2017,ECG Facilities Service,Facilities Management Charge,33386
|
||||
29/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,74806
|
||||
29/05/2017,Morris & Spottiswood Ltd,George IV Bridge Work,56448
|
||||
31/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,164691
|
||||
26/06/2017,ECG Facilities Service,Facilities Management Charge,33386
|
||||
26/06/2017,British Library,Legal Deposit Services,50056
|
||||
24/07/2017,John Graham Construction Ltd,Causewayside Refurbishment,27926
|
||||
24/07/2017,John Graham Construction Ltd,Causewayside Refurbishment,212690
|
||||
24/07/2017,ALDL,Legal Deposit Services,27067
|
||||
24/07/2017,AM Phillip,Vehicle Purchase,26604
|
||||
16/08/2017,ECG Facilities Service,Facilities Management Charge,33386
|
||||
16/08/2017,John Graham Construction Ltd,Causewayside Refurbishment,59021
|
||||
16/08/2017,John Graham Construction Ltd,Causewayside Refurbishment,136379
|
||||
16/08/2017,Ex Libris,IT equipment,76610
|
||||
23/08/2017,Culture And Sport Glasgow,Kelvin Hall,60503
|
||||
23/08/2017,XMA Scotland Ltd,Kelvin Hall,31830
|
||||
23/08/2017,ECG Facilities Service,Facilities Management Charge,33386
|
||||
31/08/2017,John Graham Construction Ltd,Causewayside Refurbishment,36313
|
||||
31/08/2017,Insight Direct (UK) Ltd,Causewayside Refurbishment,68222
|
||||
31/08/2017,Mark Finn Laboratory,George IV Bridge Work,53884
|
||||
11/09/2017,John Graham Construction Ltd,Causewayside Refurbishment,189483
|
||||
15/09/2017,City Of Edinburgh Council,Non Domestic Rates ,57662
|
||||
15/09/2017,City Of Edinburgh Council,Non Domestic Rates ,142680
|
||||
09/10/2017,Frost And Sullivan Ltd,Literary & Archival Items,28125
|
||||
09/10/2017,JISC Services Ltd ,Literary & Archival Items,43481
|
||||
23/10/2017,John Graham Construction Ltd,Causewayside Refurbishment,151659
|
||||
23/10/2017,City Building LLP,Causewayside Refurbishment,53147
|
||||
30/10/2017,ECG Facilities Service,Facilities Management Charge,35758
|
||||
30/10/2017,ECG Facilities Service,Facilities Management Charge,35758
|
||||
06/11/2017,John Graham Construction Ltd,Causewayside Refurbishment,134208
|
||||
06/11/2017,ALDL,Legal Deposit Services,27067
|
||||
27/11/2017,Maggs Bros Ltd,Literary & Archival Items,26500
|
||||
30/11/2017,Glasgow City Council,Kelvin Hall,42345
|
||||
11/12/2017,ECG Facilities Service,Facilities Management Charge,35758
|
||||
11/12/2017,John Graham Construction Ltd,Causewayside Refurbishment,159275
|
||||
08/01/2018,ECG Facilities Service,Facilities Management Charge,35758
|
||||
15/01/2018,Proquest Information And Learn,Literary & Archival Items,42199
|
||||
15/01/2018,John Graham Construction Ltd,Causewayside Refurbishment,123244
|
||||
29/01/2018,ECG Facilities Service,Facilities Management Charge,35758
|
||||
05/02/2018,John Graham Construction Ltd,Causewayside Refurbishment,102659
|
||||
27/02/2018,ALDL,Legal Deposit Services,27067
|
||||
07/03/2018,John Graham Construction Ltd,Causewayside Refurbishment,89559
|
||||
14/03/2018,Bernard Quaritch Ltd,Literary & Archival Items,372500
|
||||
14/03/2018,ECG Facilities Service,Facilities Management Charge,35758
|
||||
21/03/2018,Site Sealants Ltd,Causewayside Refurbishment,27747
|
||||
30/03/2018,Private Sale,Literary & Archival Items,100000
|
||||
30/03/2018,ECG Facilities Service,Facilities Management Charge,35758
|
||||
30/04/2018,ECG FACILITIES SERVICE,Causewayside IT Work,25634.7
|
||||
30/04/2018,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
14/05/2018,GLASGOW CITY COUNCIL,Kelvin Hall,90946
|
||||
11/06/2018,ALDL,ALDL Charges,27067
|
||||
11/06/2018,JOHN GRAHAM CONSTRUCTION LTD,Causewayisde Refurbishment,127753.31
|
||||
22/06/2018,BONHAMS - LONDON,Literary & Archival Items,25025
|
||||
22/06/2018,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
22/06/2018,EX LIBRIS,IT equipment,39000
|
||||
30/06/2018,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
16/07/2018,EX LIBRIS,IT equipment,80057.83
|
||||
18/07/2018,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
18/07/2018,Sotheby's,Literary & Archival Items,41600
|
||||
31/08/2018,AUTOMATED DOCUMENT SERVICES,IT equipment,84480
|
||||
31/08/2018,XMA SCOTLAND LTD,IT equipment,313000
|
||||
13/09/2018,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
13/09/2018,CITY OF EDINBURGH COUNCIL,Non Domestic Rates,59303.2
|
||||
13/09/2018,CITY OF EDINBURGH COUNCIL,Non Domestic Rates,146740
|
||||
20/09/2018,FROST AND SULLIVAN LTD,Literary & Archival Items,28125
|
||||
20/09/2018,SJS Property Services,George IV Bridge Work,44684.2
|
||||
20/09/2018,CENGAGE LEARNING (EMEA )LTD,Literary & Archival Items,64791
|
||||
30/09/2018,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
30/09/2018,SJS Property Services,George IV Bridge Work,51635.35
|
||||
24/10/2018,XMA SCOTLAND LTD,IT equipment,35313.48
|
||||
24/10/2018,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
21/11/2018,EX LIBRIS,IT equipment,39000
|
||||
21/11/2018,EX LIBRIS,IT equipment,53327.09
|
||||
26/11/2018,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
26/11/2018,SJS Property Services,George IV Bridge Work,66818.25
|
||||
11/12/2018,CALEDONIAN LIFT SERVICES LTD,Causewayside Work,47944.8
|
||||
31/12/2018,SOFTCAT,IT equipment,37064.3
|
||||
14/01/2019,m-hance,IT Work,33164.4
|
||||
14/01/2019,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
24/01/2019,ARTHUR MCKAY BUILDING SERVICES,Causewayside Work,100235.17
|
||||
31/01/2019,ECG FACILITIES SERVICE,Causewayside Work,32517.45
|
||||
31/01/2019,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
31/01/2019,CENGAGE LEARNING (EMEA )LTD,Literary & Archival Items,66443
|
||||
14/02/2019,Private Sale,Literary & Archival Items,50000
|
||||
27/02/2019,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
31/03/2019,ECG FACILITIES SERVICE,Facilities Management Charge,35757.91
|
||||
31/03/2019,ECG FACILITIES SERVICE,George IV Bridge Work,37320.15
|
||||
31/03/2019,HP INC UK LTD,IT equipment,40746
|
||||
31/03/2019,INSIGHT DIRECT (UK) LTD,IT equipment,56223.35
|
||||
23/04/2019,EX LIBRIS,"IT equipment
|
||||
",129584.58
|
||||
30/04/2019,ECG FACILITIES SERVICE,Facilities Management Charge,36907.14
|
||||
30/04/2019,COMPUTACENTER UK,"IT equipment
|
||||
",139571.14
|
||||
13/05/2019,GLASGOW LIFE,Kelvin Hall Service Charge,120335
|
||||
04/06/2019,ECG FACILITIES SERVICE,Facilities Management Charge,36907.14
|
||||
24/06/2019,Private Sale,Literary & Archival Items,34400
|
||||
25/06/2019,ECG FACILITIES SERVICE,Facilities Management Charge,36907.14
|
||||
31/07/2019,ECG FACILITIES SERVICE,Facilities Management Charge,36907.14
|
||||
26/08/2019,MICROBOX GmbH,Digital equipment,65881.58
|
||||
27/08/2019,ECG FACILITIES SERVICE,Facilities Management Charge,36907.14
|
||||
27/08/2019,FROST AND SULLIVAN LTD,Literary & Archival Items,28687.5
|
||||
18/09/2019,CITY OF EDINBURGH COUNCIL,Annual Property Rates 2019/20 for three buildings,221467.2
|
||||
25/09/2019,LOTHIAN HEATING SERVICES LTD,Payment 1 - GB Boiler replacement ,57114.18
|
||||
25/09/2019,ECG FACILITIES SERVICE,Facilities Management Charge,34021.61
|
||||
25/09/2019,EDF Energy,Electricity,33122.06
|
||||
18/09/2019,INSTITUTE OF CONSERVATION,Bursary Recruitment and Professional Services costs for intern,26805.2
|
||||
10/10/2019,ECG FACILITIES SERVICE,"CB Bolier Replacement (1),USP Batteries,Gutter Works & Cleaning of pigeon fouling",112794
|
||||
23/10/2019,ECG FACILITIES SERVICE,"CB Bolier Replacement (2),Facilities Management Charge October 19, intumescent strips & unblocking toilets",103462.39
|
||||
23/10/2019,Private Sale,Purchase of Manuscripts,45000
|
||||
04/10/2019,ECG FACILITIES SERVICE,Facilities Management Charge September 19,44288.57
|
||||
10/10/2019,GLASGOW LIFE,Service Charges Kelvin Hall,39100.16
|
||||
15/10/2019,EDF ENERGY,Electricity,26805.74
|
||||
04/10/2019,JISC SERVICES LTD SUBSCRIPTION ACCOUNT,Annual Subscription,25731
|
||||
23/10/2019,ALDL,Oct19-Dec19 charge from Agency for Legal Deposit Libraries,25155.6
|
||||
27/11/2019,ECG FACILITIES SERVICE,"Paymnet for 31 invoices including Facilities Managemenr Charge Nov 19, Lift Repairs, replacement refrigerant gas detection system & data cabling and install of WIFI devices",104526.09
|
||||
05/11/2019,LOTHIAN HEATING SERVICES LTD,GB Bolier Replacement - application 2,45728.9
|
||||
27/11/2019,GLASGOW LIFE,Service Charges Kelvin Hall 01/07/19-30/09/19,41541.47
|
||||
19/11/2019,EDF ENERGY,Electricity Oct 2019 3 buildings,26660.9
|
||||
10/12/2019,PRIVATE SALE,Collection of papers of an individual,125000
|
||||
06/12/2019,PROQUEST,Purchase of 9 subscriptions 01/11/19-31/10/20,61638
|
||||
18/12/2019,ECG,"Payment of 19 separate invoice including for service of chiller, re-route return pipes, data cabling and install of WifI devices, sprinkler work",44556.15
|
||||
22/01/2020,ECG,"Payment of 28 separate invoices including for supply and fit aluminium screen, upgrade boilerhouse electrical panels,CCTV components, pump casting & lift repairs",89297.94
|
||||
09/01/2020,ECG,Payment of 18 separate invoices including for December facilities services and boiler replacement CB,78585.73
|
||||
14/01/2020,LM Information Delivery UK LTD,Payment of 18 separate invoice for Online/Print subscriptions Jan 20-Dec 20,27822.54
|
||||
14/01/2020,EDF,Electricity,25172.34
|
||||
14/01/2020,ALDL,Jan20-Mar 20 charge from Agency for Legal Deposit Libraries,25155.6
|
||||
06/02/2020,XMA Scotland,Scality Ring Maintenance,68464.62
|
||||
06/02/2020,Trustmarque,Miscrosoft Software Licenses,38069.66
|
||||
11/02/2020,Studio MB,Concept Design Semi-Permanent Exhibtion,27000
|
||||
11/02/2020,EDF,Electricity,25484.03
|
||||
06/03/2020,British Library,Governance and Management Costs,27766.6
|
||||
10/03/2020,Proquest,Subscriptions,50309.81
|
||||
10/03/2020,ECG,Two months maintance contracts,80041.02
|
||||
17/03/2020,BSI,Subscription,30951.6
|
||||
17/03/2020,Glasgow Life,Kelvin Hall Service Charges,55857.04
|
||||
17/03/2020,Private Collection,Collection of literary papers,60000
|
||||
20/03/2020,EDF,Electricity,25829.65
|
||||
20/03/2020,ECG,This payment covers 16 invoices including upgrade to boiler control panel & remedial works following 5 year test,32025.98
|
||||
06/04/2020,Gardiner and Theobald,GB Feasibility Study,49508
|
||||
06/04/2020,ECG,This payment covers 8 invocies including monthly facilities management fees & site inspection fees,51822.68
|
||||
23/04/2020,OCLC UK,Cataloging and Metadata subscription,26251.2
|
||||
23/04/2020,John Graham,Stonework Retention Payment,25104.56
|
||||
23/04/2020,EDF,Electricity,25025.89
|
||||
23/04/2020,Studio MB,Exhibition design,63000
|
||||
23/04/2020,ECG,"This payment covers 5 invocies including monthly facilities management fees, software and hardware maintenance & Lighting Upgrades",65200.11
|
||||
14/05/2020,GARDINER AND THEOBALD LLP,GB Feasibility Study,26291.48
|
||||
14/05/2020,HP INC UK LTD,IT equipment purchase,30640.32
|
||||
14/05/2020,XMA SCOTLAND LTD,Purchase of IT equipment and renewal of maintenance agreement. This payment covers 2 invoices,139167.6
|
||||
14/05/2020,CENGAGE LEARNING EMEA LTD,Annual hosting fee,28800
|
||||
21/05/2020,ECG FACILITIES SERVICE,CB Boiler replacement plus monthly maintenance fee. This payment covers 2 invoices,47899.83
|
||||
29/05/2020,EDF ENERGY,Electricity for April in Causewayside and George IV Bridge buildings. This payment covers 2 invoices.,30175.09
|
||||
29/05/2020,SOFTCAT,Software Licence,42866.5
|
||||
09/06/2020,Ex Libris,Annual subsriptions. This payment covers 2 invoices.,189036.11
|
||||
09/06/2020,Glasgow Life,Service Charges,49509.2
|
||||
09/06/2020,XMA Scotland Ltd,IT equipment,25371.84
|
||||
18/06/2020,JISC SERVICES LTD SUBSCRIPTION ACCOUNT,Annual subscription,25896
|
||||
25/06/2020,ECG FACILITIES SERVICE,Facility Management fees,49000
|
||||
25/06/2020,GARDINER AND THEOBALD LLP,GB Feasibility Study,26291.48
|
||||
25/06/2020,THE LEARNING POOL,E-Learning Resources,25344
|
||||
07/07/2020,Agency for the Legal Deposit Libraries,Agency services,26007.95
|
||||
07/07/2020,Lyon and Turnball,Various collection items,54094
|
||||
09/07/2020,XMA Scotland Ltd,Computer equipment,33327
|
||||
14/07/2020,EDF Energy,Utilities,25768.85
|
||||
23/07/2020,Computer Centre UK Ltd,Computer equipment,27750.79
|
||||
23/07/2020,ECG Facility Services,Facility Management fees,49000
|
||||
23/07/2020,GARDINER AND THEOBALD LLP,GB Feasibility Study,26291.48
|
||||
13/08/2020,EDF Energy,Utilities. This transaction is made up of 3 invoices.,26688.27
|
||||
13/08/2020,Frost & Sullivan Ltd,Annual subscription,34425
|
||||
27/08/2020,Agency for Legal Deposit Libaries,Agency services,26007.95
|
||||
27/08/2020,ECG Facilities Services,Facility Management fees,49000
|
||||
27/08/2020,Gardiner and Theobald LLP,GB Feasibility Study,26291.48
|
||||
17/09/2020,EDF Energy,This payment covers 3 invoices for utility services,34283.03
|
||||
17/09/2020,JISC Services Ltd,Subscription,26179.72
|
||||
17/09/2020,XMA Scotland Ltd,IT equipment,26533.92
|
||||
24/09/2020,ECG Facilities Services,Facility Management fees,55450.58
|
||||
24/09/2020,Glasgow Life,Service charges,25211.17
|
||||
08/10/2020,EDF Energy,This payment covers 5 invoices for utility services,27625.53
|
||||
08/10/2020,ALDL,Agency services,26007.95
|
||||
08/10/2020,Institute of Conservation,This payment covers 2 invoices for student bursary costs,31654
|
||||
08/10/2020,Studio MB,Exhibition build works,36000
|
||||
22/10/2020,ECG Facilities,This payment covers 11 invoices for facility Management fees,55672.9
|
||||
22/10/2020,Glasgow City Council,Capital works,34802.4
|
||||
19/11/2020,DTEK DIGITAL SOLUTIONS LTD,Computer equipment,39348
|
||||
19/11/2020,ECG FACILITIES SERVICE,This payment covers multiple invoices for facility Management fees,31888.51
|
||||
19/11/2020,GLASGOW LIFE,Builidng service charges,47690.16
|
||||
26/11/2020,ECG FACILITIES SERVICE,This payment covers multiple invoices for facility Management fees,55299.92
|
||||
26/11/2020,LEE BOYD LIMITED,This payment covers 7 invoices for project management fees,26440.98
|
||||
03/12/2020,PROQUEST INFORMATION AND LEARN,This payment covers multiple invoices for collection items,50232.54
|
||||
10/12/2020,STUDIO MB,This payment covers 2 invoices for exhibition services and equipment,55902
|
||||
17/12/2020,ECG FACILITIES SERVICE,Facility Management Fees,49000
|
||||
17/12/2020,LEE BOYD LIMITED,This payment covers multiple invoices for project management fees,28922.8
|
||||
07/01/2021,ECG FACILITIES SERVICE,This payment covers multiple invoices for facility management fees,39150.26
|
||||
14/01/2021,EDF ENERGY,This payment covers multiple invoices for electricity,28711.17
|
||||
14/01/2021,ALDL,Legal deposit services,26007.95
|
||||
14/01/2021,EXCHANGE COMMUNICATIONS INSTALLATIONS LTD,Telecom services,31878
|
||||
21/01/2021,ECG FACILITIES SERVICE,This payment covers multiple invoices for facility management fees,28797.1
|
||||
28/01/2021,ECG FACILITIES SERVICE,This payment covers multiple invoices for facility management fees,54875.74
|
||||
04/02/2021,PROQUEST INFORMATION AND LEARN,One invoice for collection items,40000
|
||||
18/02/2021,ECG FACILITIES SERVICE,This payment covers multiple invoices for facility management fees,54931.68
|
||||
25/02/2021,ECG FACILITIES SERVICE,This payment covers multiple invoices for facility management fees,51283.39
|
||||
25/02/2021,HP INC UK LTD,IT Equipment,37868.04
|
||||
10/03/2021,BSI,BSOL Modular Subscription,30510
|
||||
16/03/2021,PHOENIX SOFTWARE LTD,IT Hardware plus 5 year licence,74432.04
|
||||
16/03/2021,ECG FACILITIES SERVICE,This payment covers multiple invoices for facility management fees,134758.64
|
||||
23/03/2021,ECG FACILITIES SERVICE,Maintenance Contract - March,49000
|
||||
23/03/2021,ICAM ARCHIVE SYSTEMS,Camera System - phase 1,39120
|
||||
25/03/2021,ECG FACILITIES SERVICE,This payment covers multiple invoices for facility management fees,108450.85
|
||||
31/03/2021,GLASGOW LIFE,Oct 20 to Dec 20 service charge - Kelvin Hall,54840.53
|
||||
31/03/2021,ECG FACILITIES SERVICE,Replacement Humidifer units,76751
|
||||
31/03/2021,ECG FACILITIES SERVICE,Cooling and Humidifer system upgrade,26943.84
|
||||
31/03/2021,ECG FACILITIES SERVICE,Installation of CCTV,29404.62
|
||||
29/04/2021,ECG FACILITIES SERVICE,This payment covers April 21 Maintenance Contract and the installation of battery rack and batteries plus smaller maintenance invoices,71604.07
|
||||
29/04/2021,GLASGOW LIFE,Jan 21 to Mar 21 service charge - Kelvin Hall,46657.33
|
||||
20/05/2021,ECG FACILITIES SERVICE,Routine inspection and maintenance of all NLS properties,52584.2
|
||||
27/05/2021,XMA SCOTLAND LTD,2 invoices one for the replacement of obsolete hardware and the other for a new laptop,28587.59
|
||||
13/05/2021,ALDL,"Claiming, receipting and onward distribution of legal deposit on behalf of NLS",26376.68
|
||||
27/05/2021,LYON AND TURNBULL,Purchase of a manuscript,26000
|
||||
27/05/2021,ARNOLD CLARK,Purchase of an electric van,25949.5
|
||||
28/06/2021,XMA Scotland Ltd,Purchase of IT hardware for cloud and maintenance of hardware,72061.92
|
||||
08/07/2021,EX LIBRIS,Subscription April to Oct 21 cloud based library services,95045.31
|
||||
08/07/2021,ECG FACILITIES SERVICE,Maintenance contract - June 21 period,52459.25
|
||||
08/07/2021,XMA SCOTLAND LTD,IT hardware equipment,37620.86
|
||||
22/07/2021,ALDL,Quarterly invoice legal deposit materials - July to Sept 21,26400.68
|
||||
12/08/2021,ECG FACILITIES SERVICE,Maintenance contract - July 21 period,52459.25
|
||||
27/08/2021,ECG FACILITIES SERVICE,Maintenance contract - August 21 period,52459.25
|
||||
27/08/2021,ECG FACILITIES SERVICE,Water penetration works - part 2,28350
|
||||
27/08/2021,ECG FACILITIES SERVICE,Water penetration works - part 3,28350
|
||||
22/09/2021,GLASGOW LIFE,Kelvin Hall Service Charge - April to June 21,35420.45
|
||||
29/09/2021,ECG FACILITIES SERVICE,Maintenance contract - all properties,52459.25
|
||||
29/09/2021,FROST AND SULLIVAN LTD,Annual Subscription - Sept 21 to Oct 22,35147.09
|
||||
21/10/2021,ECG FACILITIES SERVICE,Maintenance contract - October,52459.25
|
||||
31/10/2021,SOFTCAT,It purchases for server,42282.72
|
||||
14/10/2021,ALDL,"Claiming, receipting and onward distribution for quarter Oct to Dec 21",26400.68
|
||||
04/11/2021,Web of Science JISC SHEDL subs ,Subscription 2021 to 2021 SHEDL,28361.78
|
||||
11/11/2021,M and J Kelman Ltd,Literary and personal papers of James Kelman,40000
|
||||
11/11/2021,John Graham Constrution Ltd,External fabric repairs - Causeway Side building,75262.75
|
||||
11/11/2021,Robert Harland,Correspondance and Literary papers - Thomas Carlyle,94000
|
||||
11/11/2021,Jisc Services Ltd,IT Subscription and router service charge,25896
|
||||
25/11/2021,ECG Facilities,Maintenance Contract - November,52459.25
|
||||
25/11/2021,Ex Libris,IT Subscription ,81729.02
|
||||
31/12/2021,ECG FACILITIES SERVICE,Electrical and mechanical works,28071.17
|
||||
16/12/2021,JAMES BRECK LTD,Re-slating of roof LB,28572.28
|
||||
23/12/2021,CENGAGE LEARNING EMEA LTD,Subscription - Historical Archive,32460
|
||||
31/12/2021,GLASGOW LIFE,Quarterly service charge KH,45541.34
|
||||
31/12/2021,ECG FACILITIES SERVICE,Maintenance Contract - December,52459.25
|
||||
16/12/2021,ECG FACILITIES SERVICE,"Electrical, mechanical and building works",82227.96
|
||||
27/01/2022,ECG FACILITIES SERVICE,January maintenance contract,52459.25
|
||||
31/01/2022,ALDL,1st January to 31st March 22 - receipting and onward distribution of UK legal deposit materials on behalf of National Library of Scotland,26388.68
|
||||
03/02/2022,ECG FACILITIES SERVICE,"Monthly maintenance contract, drainage jetting and cctv remedials, patio roofing wash",62411.69
|
||||
10/02/2022,JAMES BRECK LTD,Roof uplifting and re-slating,31890.41
|
||||
10/02/2022,LEE BOYD LIMITED,Various invoices smoke extract system and rateable value review,30552
|
||||
17/02/2022,LEE BOYD LIMITED,"Various invoices for CB smoke extract system, project work - FM maintenance framework, sprinkler system",57766.9
|
||||
24/02/2022,ECG FACILITIES SERVICE,"Carry out tanking works, supply and fit mini drive unit, balustrade repairs",27723.16
|
||||
24/02/2022,ADAM MATTHEW DIGITAL LTD,Resource - slavery abolution and social justice,37080
|
||||
10/03/2022,ECG FACILITIES SERVICE,Maintenance contract - March,52459.25
|
||||
10/03/2022,XMA SCOTLAND LTD,It equipment,61885.56
|
||||
17/03/2022,EDF ENERGY,Electricity bill for various sites,57220.55
|
||||
17/03/2022,ECG FACILITIES SERVICE,Maintenance contract - Feb plus various smaller invoices for maintenance jobs,71653.47
|
||||
17/03/2022,XMA010,IT equipment,77208.77
|
||||
17/03/2022,OXFORD UNIVERSITY PRESS,Annual subscription,28576.89
|
||||
24/03/2022,ECG FACILITIES SERVICE,Various small maintenance jobs around library sites,34055.73
|
||||
24/03/2022,GLASGOW LIFE,Kelvin Hall quarterly service charge,41637.96
|
||||
24/03/2022,LEE BOYD LIMITED,Sprinkler system project and lift refurb George IV,55234
|
||||
24/03/2022,BSI,Annual subscription,31425
|
||||
31/03/2022,ECG FACILITIES SERVICE,Various small maintenance jobs around library sites,28760.32
|
||||
31/03/2022,XMA SCOTLAND LTD,It equipment,47461.25
|
||||
31/03/2022,JAMES BRECK LTD,Roof uplift and reslating,28230.64
|
||||
31/03/2022,LEE BOYD LIMITED,Various small maintenance jobs around library sites,26396.1
|
||||
31/03/2022,UNIVERSITY OF DUNDEE,Salary costs for SCURL Scottish Universities press project,39726.44
|
||||
30/04/2022,JISC Services Ltd,Managed router service charge annual subscription 01/04/22 to 31/03/23,25896
|
||||
30/04/2022,EX Libris,Subscription Alma and Primo 01/04/22 to 31/10/22,114420.65
|
||||
11/05/2022,KENNYS BOOKSHOP&ART GALLERIES,Purchase of Smillie Archive,30000
|
||||
12/05/2022,ECG FACILITIES SERVICE,Inspection and Maintenance of all Library properties,55711.72
|
||||
19/05/2022,CAE TECHNOLOGY SERVICES LIMITED,Subscription renewal,25041.31
|
||||
19/05/2022,GLASGOW LIFE,Kelvin Hall service charge Jan to Mar 22,59084.95
|
||||
31/05/2022,ECG FACILITIES SERVICE,Fit pre-purchased humidifiers,29710.8
|
||||
31/05/2022,ECG FACILITIES SERVICE,Routine inspection and maintenance May 22,55711.72
|
||||
31/05/2022,ALDL,Legal deposit materials April to July 22,27013.18
|
||||
09/06/2022,LEE BOYD LIMITED,Architectural Works,93690
|
||||
16/06/2022,CITY OF EDINBURGH COUNCIL,Rates for 33 Salisbury Place,136240
|
||||
16/06/2022,CITY OF EDINBURGH COUNCIL,Rates 57 George IV Bridge,41920
|
||||
23/06/2022,ECG FACILITIES SERVICE,Maintenance contract - June 22,55711.72
|
||||
21/07/2022,ALDL,"Claiming,receipting and onward distribution of UK legal deposit materials July to Sept 22",27013.16
|
||||
21/07/2022,RICK GEKOSKI,Papers 1970's to 2019 Alisdair Gray,125000
|
||||
28/07/2022,SONYA LEONARD,Literary and personal papers of Tom Leonard 1961 to 2018,40000
|
|
102
examples/data/labelled_transactions.csv
Normal file
102
examples/data/labelled_transactions.csv
Normal file
@ -0,0 +1,102 @@
|
||||
Date,Supplier,Description,Transaction value (£),Classification
|
||||
15/08/2016,Creative Video Productions Ltd,Kelvin Hall,26866,Other
|
||||
29/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,74806,Building Improvement
|
||||
29/05/2017,Morris & Spottiswood Ltd,George IV Bridge Work,56448,Building Improvement
|
||||
31/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,164691,Building Improvement
|
||||
24/07/2017,John Graham Construction Ltd,Causewayside Refurbishment,27926,Building Improvement
|
||||
24/07/2017,John Graham Construction Ltd,Causewayside Refurbishment,212690,Building Improvement
|
||||
16/08/2017,John Graham Construction Ltd,Causewayside Refurbishment,59021,Building Improvement
|
||||
16/08/2017,John Graham Construction Ltd,Causewayside Refurbishment,136379,Building Improvement
|
||||
23/08/2017,Culture And Sport Glasgow,Kelvin Hall,60503,Building Improvement
|
||||
23/08/2017,XMA Scotland Ltd,Kelvin Hall,31830,Building Improvement
|
||||
31/08/2017,John Graham Construction Ltd,Causewayside Refurbishment,36313,Building Improvement
|
||||
31/08/2017,Insight Direct (UK) Ltd,Causewayside Refurbishment,68222,Building Improvement
|
||||
31/08/2017,Mark Finn Laboratory,George IV Bridge Work,53884,Building Improvement
|
||||
11/09/2017,John Graham Construction Ltd,Causewayside Refurbishment,189483,Building Improvement
|
||||
23/10/2017,John Graham Construction Ltd,Causewayside Refurbishment,151659,Building Improvement
|
||||
23/10/2017,City Building LLP,Causewayside Refurbishment,53147,Building Improvement
|
||||
07/02/2017,John Graham Construction Ltd,Causewayside Refurbishment,52404,Building Improvement
|
||||
13/02/2017,John Graham Construction Ltd,Causewayside Refurbishment,272390,Building Improvement
|
||||
06/03/2017,John Graham Construction Ltd,Causewayside Refurbishment,31781,Building Improvement
|
||||
06/03/2017,John Graham Construction Ltd,Causewayside Refurbishment,198048,Building Improvement
|
||||
31/03/2017,Nicholson Bros(Electrical Contractors) Ltd,Causewayside Refurbishment,33666,Building Improvement
|
||||
31/03/2017,John Graham Construction Ltd,Causewayside Refurbishment,222090,Building Improvement
|
||||
31/03/2017,John Graham Construction Ltd,Causewayside Refurbishment,63971,Building Improvement
|
||||
24/04/2017,Scottish Historic Buildings Trust,Lawnmarket Work,50057,Building Improvement
|
||||
30/04/2017,Morris & Spottiswood Ltd,George IV Bridge Work,63716,Building Improvement
|
||||
15/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,245381,Building Improvement
|
||||
12/09/2016,Flexiform,Kelvin Hall,42623,Building Improvement
|
||||
12/09/2016,John Graham Construction Ltd,Causewayside Refurbishment,228689,Building Improvement
|
||||
26/09/2016,Senator International,Kelvin Hall,35706,Building Improvement
|
||||
26/09/2016,John Graham Construction Ltd,Causewayside Refurbishment,28378,Building Improvement
|
||||
30/09/2016,A McGillivray,Causewayside Refurbishment,44392,Building Improvement
|
||||
10/10/2016,John Graham Construction Ltd,Causewayside Refurbishment,303999,Building Improvement
|
||||
31/10/2016,John Graham Construction Ltd,Causewayside Refurbishment,74245,Building Improvement
|
||||
07/11/2016,CBRE,Kelvin Hall,83736,Building Improvement
|
||||
14/11/2016,University Of Glasgow,Kelvin Hall,188682,Building Improvement
|
||||
14/11/2016,John Graham Construction Ltd,Causewayside Refurbishment,362326,Building Improvement
|
||||
12/12/2016,John Graham Construction Ltd,Causewayside Refurbishment,385310,Building Improvement
|
||||
30/12/2016,John Graham Construction Ltd,Causewayside Refurbishment,253618,Building Improvement
|
||||
30/12/2016,John Graham Construction Ltd,Causewayside Refurbishment,45127,Building Improvement
|
||||
21/04/2016,M & J Ballantyne Ltd,George IV Bridge Work,35098,Building Improvement
|
||||
09/05/2016,John Graham Construction Ltd,Causewayside Refurbishment,64361,Building Improvement
|
||||
09/05/2016,A McGillivray,Causewayside Refurbishment,53690,Building Improvement
|
||||
16/05/2016,John Graham Construction Ltd,Causewayside Refurbishment,365344,Building Improvement
|
||||
10/06/2016,Wavetek Ltd,Kelvin Hall,87589,Building Improvement
|
||||
10/06/2016,John Graham Construction Ltd,Causewayside Refurbishment,381803,Building Improvement
|
||||
30/06/2016,Glasgow City Council,Kelvin Hall,1700000,Building Improvement
|
||||
11/07/2016,Wavetek Ltd,Kelvin Hall,65692,Building Improvement
|
||||
11/07/2016,John Graham Construction Ltd,Causewayside Refurbishment,139845,Building Improvement
|
||||
25/07/2016,A McGillivray,Causewayside Refurbishment,30113,Building Improvement
|
||||
15/08/2016,John Graham Construction Ltd,Causewayside Refurbishment,196807,Building Improvement
|
||||
06/11/2017,John Graham Construction Ltd,Causewayside Refurbishment,134208,Building Improvement
|
||||
31/03/2017,NLS Foundation,Grant Payment,177500,Other
|
||||
09/10/2017,Frost And Sullivan Ltd,Literary & Archival Items,28125,Literature & Archive
|
||||
09/10/2017,JISC Services Ltd ,Literary & Archival Items,43481,Literature & Archive
|
||||
27/02/2017,Cengage Learning (Emea )Ltd,Literary & Archival Items,43302,Literature & Archive
|
||||
06/03/2017,Private Sale,Literary & Archival Items,72500,Literature & Archive
|
||||
31/03/2017,Private Sale,Literary & Archival Items,3422500,Literature & Archive
|
||||
24/04/2017,Cengage Learning (Emea )Ltd,Literary & Archival Items,43302,Literature & Archive
|
||||
22/05/2017,ALDL,Legal Deposit Services,27067,Literature & Archive
|
||||
19/09/2016,Jisc Services Ltd Subscription Account,Literary & Archival Items,42629,Literature & Archive
|
||||
10/10/2016,Cengage Learning (Emea )Ltd,Literary & Archival Items,86604,Literature & Archive
|
||||
24/10/2016,ALDL,ALDL Charges,32317,Literature & Archive
|
||||
26/04/2016,Private Sale,Literary & Archival Items,30000,Literature & Archive
|
||||
30/05/2016,ALDL,ALDL Charges,32317,Literature & Archive
|
||||
15/07/2016,Sotheby'S,Literary & Archival Items,28500,Literature & Archive
|
||||
18/07/2016,Christies,Literary & Archival Items,33800,Literature & Archive
|
||||
31/07/2016,ALDL,ALDL Charges,32317,Literature & Archive
|
||||
08/12/2016,Sothebys,Literary & Archival Items,166000,Literature & Archive
|
||||
08/12/2016,Private Sale,Literary & Archival Items,87500,Literature & Archive
|
||||
26/06/2017,ECG Facilities Service,Facilities Management Charge,33386,Utility Bills
|
||||
26/06/2017,British Library,Legal Deposit Services,50056,Other
|
||||
24/07/2017,ALDL,Legal Deposit Services,27067,Other
|
||||
16/08/2017,ECG Facilities Service,Facilities Management Charge,33386,Utility Bills
|
||||
23/08/2017,ECG Facilities Service,Facilities Management Charge,33386,Utility Bills
|
||||
07/02/2017,ECG Facilities Service,Facilities Management Charge,32795,Utility Bills
|
||||
27/02/2017,ECG Facilities Service,Facilities Management Charge,32795,Utility Bills
|
||||
27/03/2017,ECG Facilities Service,Facilities Management Charge,32795,Utility Bills
|
||||
22/05/2017,ECG Facilities Service,Facilities Management Charge,33386,Utility Bills
|
||||
26/09/2016,ECG Facilities Service,Facilities Management Charge,32795,Utility Bills
|
||||
24/10/2016,ECG Facilities Service,Facilities Management Charge,32795,Utility Bills
|
||||
08/12/2016,ECG Facilities Service,Facilities Management Charge,32795,Utility Bills
|
||||
30/12/2016,ECG Facilities Service,Facilities Management Charge,32795,Utility Bills
|
||||
23/05/2016,ECG Facilities Service,Facilities Management Charge,32777,Utility Bills
|
||||
23/05/2016,ECG Facilities Service,Facilities Management Charge,32777,Utility Bills
|
||||
28/06/2016,ECG Facilities Service,Facilities Management Charge,32832,Utility Bills
|
||||
08/08/2016,ECG Facilities Service,Facilities Management Charge,32795,Utility Bills
|
||||
24/08/2016,ECG Facilities Service,Facilities Management Charge,32795,Utility Bills
|
||||
30/10/2017,ECG Facilities Service,Facilities Management Charge,35758,Utility Bills
|
||||
16/08/2017,Ex Libris,IT equipment,76610,Software/IT
|
||||
31/03/2017,XMA Scotland Ltd,IT equipment,33450,Software/IT
|
||||
31/03/2017,XMA Scotland Ltd,IT equipment,84524,Software/IT
|
||||
24/04/2017,Insight Direct (UK) Ltd,IT equipment,56768,Software/IT
|
||||
09/05/2016,Computacenter Uk,Kelvin Hall,72835,Software/IT
|
||||
23/05/2016,Computacenter Uk,Kelvin Hall,26506,Software/IT
|
||||
15/09/2017,City Of Edinburgh Council,Non Domestic Rates ,57662,Utility Bills
|
||||
15/09/2017,City Of Edinburgh Council,Non Domestic Rates ,142680,Utility Bills
|
||||
08/05/2017,Anglian Water Business,Water,26832,Utility Bills
|
||||
30/04/2016,City Of Edinburgh Council,Non Domestic Rates ,40800,Utility Bills
|
||||
12/09/2016,City Of Edinburgh Council,Non Domestic Rates ,144330,Utility Bills
|
||||
12/09/2016,City Of Edinburgh Council,Non Domestic Rates ,49827,Utility Bills
|
||||
24/07/2017,AM Phillip,Vehicle Purchase,26604,Other
|
|
156
examples/fine-tuned_qa/answers_with_ft.py
Normal file
156
examples/fine-tuned_qa/answers_with_ft.py
Normal file
@ -0,0 +1,156 @@
|
||||
"""
|
||||
Note: To answer questions based on text documents, we recommend the procedure in
|
||||
[Question Answering using Embeddings](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb).
|
||||
Some of the code below may rely on [deprecated API endpoints](https://github.com/openai/openai-cookbook/tree/main/transition_guides_for_deprecated_API_endpoints).
|
||||
"""
|
||||
|
||||
import argparse
|
||||
|
||||
import openai
|
||||
|
||||
|
||||
def create_context(
|
||||
question, search_file_id, max_len=1800, search_model="ada", max_rerank=10
|
||||
):
|
||||
"""
|
||||
Create a context for a question by finding the most similar context from the search file.
|
||||
:param question: The question
|
||||
:param search_file_id: The file id of the search file
|
||||
:param max_len: The maximum length of the returned context (in tokens)
|
||||
:param search_model: The search model to use
|
||||
:param max_rerank: The maximum number of reranking
|
||||
:return: The context
|
||||
"""
|
||||
results = openai.Engine(search_model).search(
|
||||
search_model=search_model,
|
||||
query=question,
|
||||
max_rerank=max_rerank,
|
||||
file=search_file_id,
|
||||
return_metadata=True,
|
||||
)
|
||||
returns = []
|
||||
cur_len = 0
|
||||
for result in results["data"]:
|
||||
cur_len += int(result["metadata"]) + 4
|
||||
if cur_len > max_len:
|
||||
break
|
||||
returns.append(result["text"])
|
||||
return "\n\n###\n\n".join(returns)
|
||||
|
||||
|
||||
def answer_question(
|
||||
search_file_id="<SEARCH_FILE_ID>",
|
||||
fine_tuned_qa_model="<FT_QA_MODEL_ID>",
|
||||
question="Which country won the European Football championship in 2021?",
|
||||
max_len=1800,
|
||||
search_model="ada",
|
||||
max_rerank=10,
|
||||
debug=False,
|
||||
stop_sequence=["\n", "."],
|
||||
max_tokens=100,
|
||||
):
|
||||
"""
|
||||
Answer a question based on the most similar context from the search file, using your fine-tuned model.
|
||||
:param question: The question
|
||||
:param fine_tuned_qa_model: The fine tuned QA model
|
||||
:param search_file_id: The file id of the search file
|
||||
:param max_len: The maximum length of the returned context (in tokens)
|
||||
:param search_model: The search model to use
|
||||
:param max_rerank: The maximum number of reranking
|
||||
:param debug: Whether to output debug information
|
||||
:param stop_sequence: The stop sequence for Q&A model
|
||||
:param max_tokens: The maximum number of tokens to return
|
||||
:return: The answer
|
||||
"""
|
||||
context = create_context(
|
||||
question,
|
||||
search_file_id,
|
||||
max_len=max_len,
|
||||
search_model=search_model,
|
||||
max_rerank=max_rerank,
|
||||
)
|
||||
if debug:
|
||||
print("Context:\n" + context)
|
||||
print("\n\n")
|
||||
try:
|
||||
# fine-tuned models requires model parameter, whereas other models require engine parameter
|
||||
model_param = (
|
||||
{"model": fine_tuned_qa_model}
|
||||
if ":" in fine_tuned_qa_model
|
||||
and fine_tuned_qa_model.split(":")[1].startswith("ft")
|
||||
else {"engine": fine_tuned_qa_model}
|
||||
)
|
||||
response = openai.Completion.create(
|
||||
prompt=f"Answer the question based on the context below\n\nText: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
|
||||
temperature=0,
|
||||
max_tokens=max_tokens,
|
||||
top_p=1,
|
||||
frequency_penalty=0,
|
||||
presence_penalty=0,
|
||||
stop=stop_sequence,
|
||||
**model_param,
|
||||
)
|
||||
return response["choices"][0]["text"]
|
||||
except Exception as e:
|
||||
print(e)
|
||||
return ""
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Rudimentary functionality of the answers endpoint with a fine-tuned Q&A model.",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--search_file_id", help="Search file id", required=True, type=str
|
||||
)
|
||||
parser.add_argument(
|
||||
"--fine_tuned_qa_model", help="Fine-tuned QA model id", required=True, type=str
|
||||
)
|
||||
parser.add_argument(
|
||||
"--question", help="Question to answer", required=True, type=str
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_len",
|
||||
help="Maximum length of the returned context (in tokens)",
|
||||
default=1800,
|
||||
type=int,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--search_model", help="Search model to use", default="ada", type=str
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_rerank",
|
||||
help="Maximum number of reranking for the search",
|
||||
default=10,
|
||||
type=int,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--debug", help="Print debug information (context used)", action="store_true"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--stop_sequence",
|
||||
help="Stop sequences for the Q&A model",
|
||||
default=["\n", "."],
|
||||
nargs="+",
|
||||
type=str,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_tokens",
|
||||
help="Maximum number of tokens to return",
|
||||
default=100,
|
||||
type=int,
|
||||
)
|
||||
args = parser.parse_args()
|
||||
response = answer_question(
|
||||
search_file_id=args.search_file_id,
|
||||
fine_tuned_qa_model=args.fine_tuned_qa_model,
|
||||
question=args.question,
|
||||
max_len=args.max_len,
|
||||
search_model=args.search_model,
|
||||
max_rerank=args.max_rerank,
|
||||
debug=args.debug,
|
||||
stop_sequence=args.stop_sequence,
|
||||
max_tokens=args.max_tokens,
|
||||
)
|
||||
print(f"Answer:{response}")
|
523
examples/fine-tuned_qa/olympics-1-collect-data.ipynb
Normal file
523
examples/fine-tuned_qa/olympics-1-collect-data.ipynb
Normal file
File diff suppressed because one or more lines are too long
763
examples/fine-tuned_qa/olympics-2-create-qa.ipynb
Normal file
763
examples/fine-tuned_qa/olympics-2-create-qa.ipynb
Normal file
File diff suppressed because one or more lines are too long
647
examples/fine-tuned_qa/olympics-3-train-qa.ipynb
Normal file
647
examples/fine-tuned_qa/olympics-3-train-qa.ipynb
Normal file
@ -0,0 +1,647 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<span style=\"color:orange; font-weight:bold\">Note: To answer questions based on text documents, we recommend the procedure in <a href=\"https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb\">Question Answering using Embeddings</a>. Some of the code below may rely on <a href=\"https://github.com/openai/openai-cookbook/tree/main/transition_guides_for_deprecated_API_endpoints\">deprecated API endpoints</a>.</span>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 3. Train a fine-tuning model specialized for Q&A\n",
|
||||
"This notebook will utilize the dataset of context, question and answer pairs to additionally create adversarial questions and context pairs, where the question was not generated on that context. In those cases the model will be prompted to answer \"No sufficient context for answering the question\". We will also train a discriminator model, which predicts whether the question can be answered based on the context or not.\n",
|
||||
"\n",
|
||||
"We will add hard adversarial examples as well, which will be based either on semantically similar sections, or neighbouring sections, originating from the same article."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>title</th>\n",
|
||||
" <th>heading</th>\n",
|
||||
" <th>content</th>\n",
|
||||
" <th>tokens</th>\n",
|
||||
" <th>context</th>\n",
|
||||
" <th>questions</th>\n",
|
||||
" <th>answers</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>2020 Summer Olympics</td>\n",
|
||||
" <td>Summary</td>\n",
|
||||
" <td>The 2020 Summer Olympics (Japanese: 2020年夏季オリン...</td>\n",
|
||||
" <td>713</td>\n",
|
||||
" <td>2020 Summer Olympics\\nSummary\\n\\nThe 2020 Summ...</td>\n",
|
||||
" <td>1. What is the 2020 Summer Olympics?\\n2. When ...</td>\n",
|
||||
" <td>1. The 2020 Summer Olympics is an internationa...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2020 Summer Olympics</td>\n",
|
||||
" <td>Host city selection</td>\n",
|
||||
" <td>The International Olympic Committee (IOC) vote...</td>\n",
|
||||
" <td>126</td>\n",
|
||||
" <td>2020 Summer Olympics\\nHost city selection\\n\\nT...</td>\n",
|
||||
" <td>1. \\n2. \\n3. \\n4.</td>\n",
|
||||
" <td>1. What is the International Olympic Committee...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2020 Summer Olympics</td>\n",
|
||||
" <td>Impact of the COVID-19 pandemic</td>\n",
|
||||
" <td>In January 2020, concerns were raised about th...</td>\n",
|
||||
" <td>369</td>\n",
|
||||
" <td>2020 Summer Olympics\\nImpact of the COVID-19 p...</td>\n",
|
||||
" <td>1. What was the COVID-19 pandemic?\\n2. How did...</td>\n",
|
||||
" <td>1. The COVID-19 pandemic was a pandemic that o...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>2020 Summer Olympics</td>\n",
|
||||
" <td>Qualifying event cancellation and postponement</td>\n",
|
||||
" <td>Concerns about the pandemic began to affect qu...</td>\n",
|
||||
" <td>298</td>\n",
|
||||
" <td>2020 Summer Olympics\\nQualifying event cancell...</td>\n",
|
||||
" <td>1. What was the original location of the Asia ...</td>\n",
|
||||
" <td>1. The original location of the Asia & Oceania...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>2020 Summer Olympics</td>\n",
|
||||
" <td>Effect on doping tests</td>\n",
|
||||
" <td>Mandatory doping tests were being severely res...</td>\n",
|
||||
" <td>163</td>\n",
|
||||
" <td>2020 Summer Olympics\\nEffect on doping tests\\n...</td>\n",
|
||||
" <td>1. What was the COVID-19 pandemic?\\n2. What di...</td>\n",
|
||||
" <td>1. The COVID-19 pandemic was a pandemic that o...</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" title heading \\\n",
|
||||
"0 2020 Summer Olympics Summary \n",
|
||||
"1 2020 Summer Olympics Host city selection \n",
|
||||
"2 2020 Summer Olympics Impact of the COVID-19 pandemic \n",
|
||||
"3 2020 Summer Olympics Qualifying event cancellation and postponement \n",
|
||||
"4 2020 Summer Olympics Effect on doping tests \n",
|
||||
"\n",
|
||||
" content tokens \\\n",
|
||||
"0 The 2020 Summer Olympics (Japanese: 2020年夏季オリン... 713 \n",
|
||||
"1 The International Olympic Committee (IOC) vote... 126 \n",
|
||||
"2 In January 2020, concerns were raised about th... 369 \n",
|
||||
"3 Concerns about the pandemic began to affect qu... 298 \n",
|
||||
"4 Mandatory doping tests were being severely res... 163 \n",
|
||||
"\n",
|
||||
" context \\\n",
|
||||
"0 2020 Summer Olympics\\nSummary\\n\\nThe 2020 Summ... \n",
|
||||
"1 2020 Summer Olympics\\nHost city selection\\n\\nT... \n",
|
||||
"2 2020 Summer Olympics\\nImpact of the COVID-19 p... \n",
|
||||
"3 2020 Summer Olympics\\nQualifying event cancell... \n",
|
||||
"4 2020 Summer Olympics\\nEffect on doping tests\\n... \n",
|
||||
"\n",
|
||||
" questions \\\n",
|
||||
"0 1. What is the 2020 Summer Olympics?\\n2. When ... \n",
|
||||
"1 1. \\n2. \\n3. \\n4. \n",
|
||||
"2 1. What was the COVID-19 pandemic?\\n2. How did... \n",
|
||||
"3 1. What was the original location of the Asia ... \n",
|
||||
"4 1. What was the COVID-19 pandemic?\\n2. What di... \n",
|
||||
"\n",
|
||||
" answers \n",
|
||||
"0 1. The 2020 Summer Olympics is an internationa... \n",
|
||||
"1 1. What is the International Olympic Committee... \n",
|
||||
"2 1. The COVID-19 pandemic was a pandemic that o... \n",
|
||||
"3 1. The original location of the Asia & Oceania... \n",
|
||||
"4 1. The COVID-19 pandemic was a pandemic that o... "
|
||||
]
|
||||
},
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import openai\n",
|
||||
"import pandas as pd\n",
|
||||
"df = pd.read_csv('olympics-data/olympics_qa.csv')\n",
|
||||
"olympics_search_fileid = \"file-c3shd8wqF3vSCKaukW4Jr1TT\"\n",
|
||||
"df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Split the sections into a training and testing set"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(3014, 754)"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)\n",
|
||||
"len(train_df), len(test_df)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"we check that he separator we intend to use isn't present within the contexts"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"0"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df.context.str.contains('->').sum()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3.1 Create the fine-tuning datasets for Q&A and discriminator models\n",
|
||||
"The fine-tuning dataset is created in the following way. For every corresponding question, answer and context pair we create:\n",
|
||||
"- Positive example: correct question, answer, context pair\n",
|
||||
"- Negative examples:\n",
|
||||
" - random negative example, where the random context is paired with the question \n",
|
||||
" - two hard negative examples\n",
|
||||
" - one originating from the same wikipedia article\n",
|
||||
" - another, which is most similar to the correct context\n",
|
||||
"\n",
|
||||
"This process is noisy, as sometimes the question might be answerable given a different context, but on average we hope this won't affect the peformance too much.\n",
|
||||
"\n",
|
||||
"We apply the same process of dataset creation for both the discriminator, and the Q&A answering model. We apply the process separately for the training and testing set, to ensure that the examples from the traing set don't feature within the test set."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import random\n",
|
||||
"\n",
|
||||
"def get_random_similar_contexts(question, context, file_id=olympics_search_fileid, search_model='ada', max_rerank=10):\n",
|
||||
" \"\"\"\n",
|
||||
" Find similar contexts to the given context using the search file\n",
|
||||
" \"\"\"\n",
|
||||
" try:\n",
|
||||
" results = openai.Engine(search_model).search(\n",
|
||||
" search_model=search_model, \n",
|
||||
" query=question, \n",
|
||||
" max_rerank=max_rerank,\n",
|
||||
" file=file_id\n",
|
||||
" )\n",
|
||||
" candidates = []\n",
|
||||
" for result in results['data'][:3]:\n",
|
||||
" if result['text'] == context:\n",
|
||||
" continue\n",
|
||||
" candidates.append(result['text'])\n",
|
||||
" random_candidate = random.choice(candidates)\n",
|
||||
" return random_candidate\n",
|
||||
" except Exception as e:\n",
|
||||
" print(e)\n",
|
||||
" return \"\"\n",
|
||||
"\n",
|
||||
"def create_fine_tuning_dataset(df, discriminator=False, n_negative=1, add_related=False):\n",
|
||||
" \"\"\"\n",
|
||||
" Create a dataset for fine tuning the OpenAI model; either for a discriminator model, \n",
|
||||
" or a model specializing in Q&A, where it says if no relevant context is found.\n",
|
||||
"\n",
|
||||
" Parameters\n",
|
||||
" ----------\n",
|
||||
" df: pd.DataFrame\n",
|
||||
" The dataframe containing the question, answer and context pairs\n",
|
||||
" discriminator: bool\n",
|
||||
" Whether to create a dataset for the discriminator\n",
|
||||
" n_negative: int\n",
|
||||
" The number of random negative samples to add (using a random context)\n",
|
||||
" add_related: bool\n",
|
||||
" Whether to add the related contexts to the correct context. These are hard negative examples\n",
|
||||
"\n",
|
||||
" Returns\n",
|
||||
" -------\n",
|
||||
" pd.DataFrame\n",
|
||||
" The dataframe containing the prompts and completions, ready for fine-tuning\n",
|
||||
" \"\"\"\n",
|
||||
" rows = []\n",
|
||||
" for i, row in df.iterrows():\n",
|
||||
" for q, a in zip((\"1.\" + row.questions).split('\\n'), (\"1.\" + row.answers).split('\\n')):\n",
|
||||
" if len(q) >10 and len(a) >10:\n",
|
||||
" if discriminator:\n",
|
||||
" rows.append({\"prompt\":f\"{row.context}\\nQuestion: {q[2:].strip()}\\n Related:\", \"completion\":f\" yes\"})\n",
|
||||
" else:\n",
|
||||
" rows.append({\"prompt\":f\"{row.context}\\nQuestion: {q[2:].strip()}\\nAnswer:\", \"completion\":f\" {a[2:].strip()}\"})\n",
|
||||
"\n",
|
||||
" for i, row in df.iterrows():\n",
|
||||
" for q in (\"1.\" + row.questions).split('\\n'):\n",
|
||||
" if len(q) >10:\n",
|
||||
" for j in range(n_negative + (2 if add_related else 0)):\n",
|
||||
" random_context = \"\"\n",
|
||||
" if j == 0 and add_related:\n",
|
||||
" # add the related contexts based on originating from the same wikipedia page\n",
|
||||
" subset = df[(df.title == row.title) & (df.context != row.context)]\n",
|
||||
" \n",
|
||||
" if len(subset) < 1:\n",
|
||||
" continue\n",
|
||||
" random_context = subset.sample(1).iloc[0].context\n",
|
||||
" if j == 1 and add_related:\n",
|
||||
" # add the related contexts based on the most similar contexts according to the search\n",
|
||||
" random_context = get_random_similar_contexts(q[2:].strip(), row.context, search_model='ada', max_rerank=10)\n",
|
||||
" else:\n",
|
||||
" while True:\n",
|
||||
" # add random context, which isn't the correct context\n",
|
||||
" random_context = df.sample(1).iloc[0].context\n",
|
||||
" if random_context != row.context:\n",
|
||||
" break\n",
|
||||
" if discriminator:\n",
|
||||
" rows.append({\"prompt\":f\"{random_context}\\nQuestion: {q[2:].strip()}\\n Related:\", \"completion\":f\" no\"})\n",
|
||||
" else:\n",
|
||||
" rows.append({\"prompt\":f\"{random_context}\\nQuestion: {q[2:].strip()}\\nAnswer:\", \"completion\":f\" No appropriate context found to answer the question.\"})\n",
|
||||
"\n",
|
||||
" return pd.DataFrame(rows) "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We apply the same process of dataset creation for both the discriminator, and the Q&A answering model. We apply the process separately for the training and testing set, to ensure that the examples from the traing set don't feature within the test set."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": []
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for name, is_disc in [('discriminator', True), ('qa', False)]:\n",
|
||||
" for train_test, dt in [('train', train_df), ('test', test_df)]:\n",
|
||||
" ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)\n",
|
||||
" ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We formatted the data according to the recommendations from the fine-tuning tool, which is available using\n",
|
||||
"> openai tools fine_tunes.prepare_data -f qa_train.jsonl\n",
|
||||
"\n",
|
||||
"We highly recommend that you use this tool, which suggests improvements in your data formatting for fine-tuning.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3.2 Submit the datasets for fine-tuning"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": []
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!openai api fine_tunes.create -t \"olympics-data/discriminator_train.jsonl\" -v \"olympics-data/discriminator_test.jsonl\" --batch_size 16 --compute_classification_metrics --classification_positive_class \" yes\" --model ada"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": []
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!openai api fine_tunes.create -t \"olympics-data/qa_train.jsonl\" -v \"olympics-data/qa_test.jsonl\" --batch_size 16"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3.3 Using the fine-tuned models\n",
|
||||
"\n",
|
||||
"We will now use the fine-tuned discriminator and the fine-tuned Q&A model. By requesting logprobs, we can see how certain the discriminator is in a `yes` vs `no` answer."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[<OpenAIObject at 0x7fe812e602b0> JSON: {\n",
|
||||
" \" no\": -10.819577,\n",
|
||||
" \" yes\": -2.045765e-05\n",
|
||||
" }]"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"ft_discriminator = \"curie:ft-openai-internal-2021-08-23-23-58-57\"\n",
|
||||
"ft_qa = \"curie:ft-openai-internal-2021-08-23-17-54-10\"\n",
|
||||
"\n",
|
||||
"def apply_ft_discriminator(context, question, discriminator_model):\n",
|
||||
" \"\"\"\n",
|
||||
" Apply the fine tuned discriminator to a question, to assess whether it can be answered from the context.\n",
|
||||
" \"\"\"\n",
|
||||
" prompt = f\"{context}\\nQuestion: {question}\\n Related:\"\n",
|
||||
" result = openai.Completion.create(model=discriminator_model, prompt=prompt, max_tokens=1, temperature=0, top_p=1, n=1, logprobs=2)\n",
|
||||
" return result['choices'][0]['logprobs']['top_logprobs']\n",
|
||||
"\n",
|
||||
"apply_ft_discriminator('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', \n",
|
||||
" 'What was the first human-made object in space?', ft_discriminator)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see that the model can generalize well to different contexts and questions. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"' The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957'"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def apply_ft_qa_answer(context, question, answering_model):\n",
|
||||
" \"\"\"\n",
|
||||
" Apply the fine tuned discriminator to a question\n",
|
||||
" \"\"\"\n",
|
||||
" prompt = f\"{context}\\nQuestion: {question}\\nAnswer:\"\n",
|
||||
" result = openai.Completion.create(model=answering_model, prompt=prompt, max_tokens=30, temperature=0, top_p=1, n=1, stop=['.','\\n'])\n",
|
||||
" return result['choices'][0]['text']\n",
|
||||
"\n",
|
||||
"apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', \n",
|
||||
" 'What was the first human-made object in space?', ft_qa)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see that the model can answer the question, when the context is appropriate."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"' The Soviet Union was the first country to successfully launch a satellite into space'"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',\n",
|
||||
" 'What is impressive about the Soviet Union?', ft_qa)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"' No appropriate context found to answer the question'"
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',\n",
|
||||
" 'How many cars were produced in the Soviet Union in 1970?', ft_qa)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see that the model knows when to answer the question, and when to say that insufficient context is present to answer the question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also combine a discriminator and a base model, or a fine-tuned Q&A model. Discriminator can essentially serve as a decision whether the question can be answered given the context or not."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"' Weather could cause a sport event to have no crowd'"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def answer_question_conditionally(answering_model, discriminator_model, context, question, discriminator_logprob_yes_modifier=0):\n",
|
||||
" logprobs = apply_ft_discriminator(context, question, discriminator_model)\n",
|
||||
" yes_logprob = logprobs[' yes'] if ' yes' in logprobs else -100\n",
|
||||
" no_logprob = logprobs[' no'] if ' no' in logprobs else -100\n",
|
||||
" if yes_logprob + discriminator_logprob_yes_modifier < no_logprob:\n",
|
||||
" return \" No appropriate context found to answer the question based on the discriminator.\"\n",
|
||||
" return apply_ft_qa_answer(context, question, answering_model)\n",
|
||||
"answer_question_conditionally(ft_qa, ft_discriminator, \n",
|
||||
" \"Crowdless games are a rare although not unheard-of occurrence in sports. \\\n",
|
||||
" When they do occur, it is usually the result of events beyond the control \\\n",
|
||||
" of the teams or fans, such as weather-related concerns, public health concerns, \\\n",
|
||||
" or wider civil disturbances unrelated to the game. For instance, \\\n",
|
||||
" the COVID-19 pandemic caused many sports leagues around the world \\\n",
|
||||
" to be played behind closed doors.\",\n",
|
||||
" \"Could weather cause a sport event to have no crowd?\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The above function illustrates how to potentially combine a discriminator and a fine-tuned Q&A model. This gives a more fine-grained control over how certain we want the model to be before it answers the question.\n",
|
||||
"\n",
|
||||
"We'll now take a look on how answers endpoint works - combining search to retrieve the relevant context from a knowledge base, and then using the fine-tuned Q&A model to answer the question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3.4 Answering the question based on a knowledge base\n",
|
||||
"Finally we can use a logic similar to the [/answers](https://beta.openai.com/docs/api-reference/answers) endpoint, where we first search for the relevant context, and then ask a Q&A model to answer the question given that context. If you'd like to see the implementation details, check out the [`answers_with_ft.py`](answers_with_ft.py) file."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"\" Canada won the Women's football tournament at the 2020 Olympic games\""
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from answers_with_ft import answer_question\n",
|
||||
"answer_question(olympics_search_fileid, ft_qa, \"Which country won the Women's football tournament at the 2020 Olympic games?\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.9.9 64-bit ('3.9.9')",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.9"
|
||||
},
|
||||
"orig_nbformat": 4,
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "cb9817b186a29e4e9713184d901f26c1ee05ad25243d878baff7f31bb1fef480"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
@ -133,7 +133,7 @@ def classifications(
|
||||
{{ an optional instruction }}
|
||||
|
||||
Text: example 1 text
|
||||
Category: example 2 label
|
||||
Category: example 1 label
|
||||
---
|
||||
Text: example 1 text
|
||||
Category: example 2 label
|
||||
|
@ -35,7 +35,7 @@ def get_score(context, query, log_probs, text_offsets) -> float:
|
||||
|
||||
def search(query, documents, engine):
|
||||
|
||||
prompts = [construct_context(query, doc) for doc in [""] + docs]
|
||||
prompts = [construct_context(query, doc) for doc in [""] + documents]
|
||||
|
||||
resps = openai.Completion.create(
|
||||
model=engine,
|
||||
|
Reference in New Issue
Block a user