Apify Dataset

Apify Dataset is a scalable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results ofApify Actors—serverless cloud programs for various web scraping, crawling, and data extraction use cases.

This notebook shows how to load Apify datasets to LangChain.

Prerequisites

You need to have an existing dataset on the Apify platform. This example shows how to load a dataset produced by theWebsite Content Crawler.

%pip install--upgrade--quiet langchain langchain-apify langchain-openai

First, importApifyDatasetLoader into your source code:

from langchain_apifyimport ApifyDatasetLoader
from langchain_core.documentsimport Document

API Reference:Document

Find yourApify API token andOpenAI API key and initialize these into environment variable:

import os

os.environ["APIFY_API_TOKEN"]="your-apify-api-token"
os.environ["OPENAI_API_KEY"]="your-openai-api-key"

Then provide a function that maps Apify dataset record fields to LangChainDocument format.

For example, if your dataset items are structured like this:

{
"url":"https://apify.com",
"text":"Apify is the best web scraping and automation platform."
}

The mapping function in the code below will convert them to LangChainDocument format, so that you can use them further with any LLM model (e.g. for question answering).

loader= ApifyDatasetLoader(
    dataset_id="your-dataset-id",
    dataset_mapping_function=lambda dataset_item: Document(
        page_content=dataset_item["text"], metadata={"source": dataset_item["url"]}
),
)

data= loader.load()

An example with question answering

In this example, we use data from a dataset to answer a question.

from langchain.indexesimport VectorstoreIndexCreator
from langchain_apifyimport ApifyWrapper
from langchain_core.documentsimport Document
from langchain_core.vectorstoresimport InMemoryVectorStore
from langchain_openaiimport ChatOpenAI
from langchain_openai.embeddingsimport OpenAIEmbeddings

API Reference:VectorstoreIndexCreator |Document |InMemoryVectorStore |ChatOpenAI |OpenAIEmbeddings

loader= ApifyDatasetLoader(
    dataset_id="your-dataset-id",
    dataset_mapping_function=lambda item: Document(
        page_content=item["text"]or"", metadata={"source": item["url"]}
),
)

index= VectorstoreIndexCreator(
    vectorstore_cls=InMemoryVectorStore, embedding=OpenAIEmbeddings()
).from_loaders([loader])

llm= ChatOpenAI(model="gpt-4o-mini")

query="What is Apify?"
result= index.query_with_sources(query, llm=llm)

print(result["answer"])
print(result["sources"])

 Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.

https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples

Document loaderconceptual guide
Document loaderhow-to guides

Movatterモバイル変換

Apify Dataset

Prerequisites

An example with question answering

Related

Movatterモバイル変換

Prerequisites​

An example with question answering​

Related​

Prerequisites

An example with question answering

Related