Logicloop Tech

ABOUT

TRITON

As part of the Triton project the client wanted a meta search engine for the real estate home search eco-system. The meta search engine was to be built in a location agnostic manner.

DURATION:

2021 - Continuous

TECHNOLOGIES USED:

Java, Spring Boot, Linux, Microservices, Apache Nutch, Elasticsearch, Web Crawling.

TITLE

As part of the Triton project the client wanted a meta search engine for the real estate home search eco-system. The meta search engine was to be built in a location agnostic manner.

PROBLEM STATEMENT

The biggest problem statement was to be able to build a distributed meta search engine for the real estate home search eco-system, allowing the client to be able to create a portal where property seekers could then come and search for properties which were originally posted across a variety of different property portals.

Another problem statement was the ability to extract structured information from semi-structured or unstructured pages available on the internet, and also to be able to keep up with the changing structures of the different portals which were being supported.

This meant that the platform would have to be

Highly maintainable allowing easy extensions.
Designed using OOP best practices to allow easy extensions / modifications.
Highly scalable, as the platform had to be able to index millions of properties.
Distributed in nature.

Also the client wanted to build a user interface on top of the normalized and indexed information.

OUR SOLUTION

We chose the following tech stack

Java

Spring Boot

Microservices

Elasticsearch - for the search index.

Apache Nutch for distributed crawling.

MySQL as the RDBMS

React for the frontend portal.

We were able to set up a distributed crawler, which would pre-fetch all the properties from 15+ portals, we would then setup extraction pipelines based on this data which would then extract structured data from each of the fetched properties.

This structured data was then finally indexed into Elasticsearch to allow for super fast retrieval.

We built APIs using Spring Boot on top of the Elasticsearch index so created allowing a React user interface to be created on top of the database.

To facilitate ease of maintenance each crawler was further divided into the following sub-modules.

Generator - The generator service would start from a seed page, and crawl the page expanding across the target website to fill in pages of interest, these would be pages that we are finally interested in.
Fetcher - the service responsible for fetching the pages. The fetcher itself would require a transport using which the pages would be fetched. We created the following types of transport
- Nutch - relies on HTTP to fetch the pages.
- PhantomJS - we created a special transport using the PhantomJS headless browser, allowing us to fetch pages built using SPA based technologies also.
- Selenium - another special transport created to act like a headless browser.
Parser - The parser service would take the fetched page and extract structured information out of it. This was usually done so that we are able to get the data we are looking for from different kinds of pages.
Store - The store service would then finally store the generated, fetched & parsed pages into a normalized Elasticsearch index.

Using the above structure we were able to make a highly modular scalable platform for the real estate meta search engine.

DURATION

The initial build of this project was done in 6 months. Post which this project was handled under our "Managed Services" business unit, hence the arrangement is on a continuous basis.

RESULT

We were able to set up a distributed crawler in place that targeted 15+ real estate portal across 3 geographies. Were able to fetch property information, extra, normalize & index data in a unified format.

HOW LOGICLOOP TECH BECOMES
YOUR UNDUE ADVANTAGE