Ingest/Index static and dynamic web pages

%3CLINGO-SUB%20id%3D%22lingo-sub-2200087%22%20slang%3D%22en-US%22%3EIngest%2FIndex%20static%20and%20dynamic%20web%20pages%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-2200087%22%20slang%3D%22en-US%22%3E%3CP%3EWhat%20would%20the%20recommended%20method%20be%20to%20index%2Fingest%20standard%20classic%20HTML%20and%20client-side%20Javascript%20rendered%20web%20page%20content%3F%20Is%20there%20a%20native%20web%20crawler%2Findexer%20for%20%22dynamic%22%20web%20page%20content%3F%26nbsp%3B%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-BODY%3E
New Contributor

What would the recommended method be to index/ingest standard classic HTML and client-side Javascript rendered web page content? Is there a native web crawler/indexer for "dynamic" web page content?  

4 Replies

@Search720 there's no built-in indexer for crawling web pages so customers often leverage an open-source crawler such as Apache Nutch to extract content from web pages. From there, you can land the content in a supported data source such as Blob storage/Cosmos DB/ADLS Gen2 and index it. You can also push the data directly to the index via the Push API as described here.

 

@Search720 You can use the Norconex HTTP connector for dynamic webpages.

 

https://opensource.norconex.com/collectors/http/

 

Cheers.

Thanks!
We support Ukraine and condemn war. Push Russian government to act against war. Be brave, vocal and show your support to Ukraine. Follow the latest news HERE