langchain.document_loaders.sitemap.SitemapLoader¶
- class langchain.document_loaders.sitemap.SitemapLoader(web_path: str, filter_urls: Optional[List[str]] = None, parsing_function: Optional[Callable] = None, blocksize: Optional[int] = None, blocknum: int = 0, meta_function: Optional[Callable] = None, is_local: bool = False)[source]¶
Bases:
WebBaseLoaderLoader that fetches a sitemap and loads those URLs.
Initialize with webpage path and optional filter URLs.
- Parameters
web_path – url of the sitemap. can also be a local path
filter_urls – list of strings or regexes that will be applied to filter the urls that are parsed and loaded
parsing_function – Function to parse bs4.Soup output
blocksize – number of sitemap locations per block
blocknum – the number of the block that should be loaded - zero indexed. Default: 0
meta_function – Function to parse bs4.Soup output for metadata remember when setting this method to also copy metadata[“loc”] to metadata[“source”] if you are using this field
is_local – whether the sitemap is a local file. Default: False
Methods
__init__(web_path[, filter_urls, ...])Initialize with webpage path and optional filter URLs.
aload()Load text from the urls in web_path async into Documents.
fetch_all(urls)Fetch all urls concurrently with rate limiting.
Lazy load text from the url(s) in web_path.
load()Load sitemap.
load_and_split([text_splitter])Load Documents and split into chunks.
parse_sitemap(soup)Parse sitemap xml and load into a list of dicts.
scrape([parser])Scrape data from webpage and return it in BeautifulSoup format.
scrape_all(urls[, parser])Fetch all urls, then return soups for all results.
Attributes
kwargs for beatifulsoup4 get_text
Default parser to use for BeautifulSoup.
Raise an exception if http status code denotes an error.
kwargs for requests
Max number of concurrent requests to make.
- async fetch_all(urls: List[str]) Any¶
Fetch all urls concurrently with rate limiting.
- load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document]¶
Load Documents and split into chunks. Chunks are returned as Documents.
- Parameters
text_splitter – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns
List of Documents.
- parse_sitemap(soup: Any) List[dict][source]¶
Parse sitemap xml and load into a list of dicts.
- Parameters
soup – BeautifulSoup object.
- Returns
List of dicts.
- scrape(parser: Optional[str] = None) Any¶
Scrape data from webpage and return it in BeautifulSoup format.
- scrape_all(urls: List[str], parser: Optional[str] = None) List[Any]¶
Fetch all urls, then return soups for all results.
- bs_get_text_kwargs: Dict[str, Any] = {}¶
kwargs for beatifulsoup4 get_text
- default_parser: str = 'html.parser'¶
Default parser to use for BeautifulSoup.
- raise_for_status: bool = False¶
Raise an exception if http status code denotes an error.
- requests_kwargs: Dict[str, Any] = {}¶
kwargs for requests
- requests_per_second: int = 2¶
Max number of concurrent requests to make.
- property web_path: str¶
- web_paths: List[str]¶