A tool designed to gather data from Agmarknet which is a government service that collates Agricultural Market information. The current implementation employs a direct URL substitution approach that aims to minimize interactions with the website for greater stability, and moves the relevant data collected to a Postgres database (as default behavior).
- Ensure selenium is operational using the Chrome WebDriver. Selenium 4.6 and later downloads the appropriate driver without any manual intervention required. However, if you're on the latest version and still run into any issues, troubleshoot using the instructions here.
- Tested on Windows, can work on other OSs with a little tinkering.
- Can likewise use WebDrivers other than Chrome.
- The process to get Selenium with Chrome WebDriver working seems to be a bit elaborate on Linux, so would recommend sticking with Windows.
Clone this git repository using:
git clone https://github.com/raashidsalih/AgMarkNet-Scraper.git
- Rename
sample.env
to.env
and replace the contents with the connection information required for your Postgres database. - These details are separated from the code for security purposes.
- If you want to name the
.env
file anything else, you can specify so in thedotenv
command when it is loading the environment variables.
I suggest running the program in a Python venv for stability and reproducibility. Although optional, it is highly recommended to avoid potential dependency related issues. Navigate to the cloned directory and create a Python virtual environment using the following command [Windows]:
python -m venv myenv
Activate your Python venv using:
myenv\Scripts\activate
If you get a permission error with regards to not being able to run scripts, use this command first:
Set-ExecutionPolicy Unrestricted -Scope Process
It'll not enforce restrictions for the rest of the session.
Install the dependencies for the program to run via the handy requirements.txt:
pip install -r requirements.txt
All the requirements have been satisfied, and you are now ready to use the program. Make a mental note of the states, commodities, and dates that you'd like to try and ensure that they adhere to the format which will be explained below.
- CLI developed using
argparse
. You can use -h to get the following details while using the interface. - General syntax of command is:
python scrape_agmarket.py [-h] --commodity COMMODITY --state STATE [STATE ...] --from_date FROM_DATE --to_date TO_DATE [--time_agg {daily,monthly,yearly}]
- You can try multiple states at once by separating them with space after the flag.
- The general format of the arguments after the flag is provided below:
Parameter | Parameter Description | Format |
---|---|---|
--commodity | Commodity Name | Make sure the first letter is capitalized. See: Potato |
--state | State Name | Make sure the first letter is capitalized. Can input multiple states at once, separate by space. See: Kerala Bihar |
--from_date | From Date of Results | Make sure date adheres to following format, first letter of month capitalized. See: 20-Sep-2023 |
--to_date | To Date of Results | Make sure date adheres to following format, first letter of month capitalized. See: 20-Sep-2023 |
--time_agg | Time Aggregate Parameter | time aggregate options to group data by. Options are [Daily, Monthly, Yearly]. Default is monthly. |
- Note that if the program fails for one state in a list, then it won't run for the other states following it automatically.
- Any argument that has a space will need to be surrounded by quotes. See:
"chili red", "Tamil Nadu"
- Keeping all of this in mind, an example:
python scrape_agmarket.py --commodity Onion --state Gujarat Haryana Punjab "Tamil Nadu" --from_date 01-Sep-2023 --to_date 24-Sep-2023 --time_agg monthly
Column Name | Description |
---|---|
state | State Name |
district | District Name |
market | Market Name |
commodity | Commodity Name |
com_variety | Commodity Variety |
com_group | Commodity Group |
arrival | Arrival amount in Tonnes |
min_price | Minimum Price (Rs./Quintal) |
max_price | Maximum Price (Rs./Quintal) |
modal_price | Modal (Average) Price (Rs./Quintal) |
date_arrival | Reported Date of Arrival |
We use venv for the benefits it provides in terms of isolation and therefore, reproducibility. However, if given the chance, the better method (and some would consider the de facto standard even) is using docker for deployment.
This would also take into account:
- Potential issues with Chrome and Selenium compatibility, and
- A good chunk of the installation steps necessary to get the program running.
This is what such a solution would look like in practice:
- Dockerize the pipeline
- Host it on an image registry
- Pull from the image registry whenever the program needs to be run
- Every github push to the repo updates the image in the registry via a CD (Continuous Deployment) implementation
- I organized the Python project following the Model-View-Controller (MVC) Architectural pattern for better separation of concerns. The controller has been renamed to
scrape_agmarket.py
to reflect the naming in the question. The other components remain unchanged.
- The straightforward approach would be to use Selenium to navigate from the AgMarkNet homepage, input the parameters we want, and finally scrape the results.
- The issue is that the site is irregular in its availability. In the off chance that one does gain access, it either has a server error or is very slow to load and respond.
- This makes rapid prototyping and experimentation rather cumbersome. It also brings into question the reliability of whatever solution that has been developed.
- Thus a conscious design decision was to use Selenium as scarcely as possible to bypass such outages.
- An alternative approach is to inject the parameters straight into the URL to directly get the results. However, certain parameters were encoded (commodity and state, for instance), for which the mapping was unknown. You can see it in part of the URL below:
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=19&Tx_State=TN&Tx_District=0....
- Notice
Tx_Commodity
andTx_State
. The number 19 represents the commodity "Banana", while TN represents the state of "Tamil Nadu". - I was able to find the mappings by looking into the source code of the website, and generate CSVs of said mapping from the HTML. The results are the data sources that you can observe in the repo.
- Employed
.env
file for security with regards to database connection credentials. GitHub typically ignores.env
files, therefore the need to rename it tosample.env
. - Error handling implemented for the inputs using a variety of methods, and in general using try-except.
- The rate determining step here is how quickly AgMarkNet is able to load. I implemented timeout using a function bundled with Selenium, and had to use
time.sleep()
to provide enough time for the download to begin after the "download excel" button was clicked. - Downloading the data in CSV/TSV was not possible (perhaps they stopped offering said service). The alternative would be to scrape the resulting table page by page using Selenium, which was an option I was not willing to consider.
- The CLI was developed using
argparse
, although various alternatives exist likeClick
, anddocopt
. However,argparse
seems to be the most popular framework for the usecase. - The SQL commands were tested on a table defined with the following SQL query:
DROP TABLE IF EXISTS schema_project.agmarket_monthly;
CREATE TABLE IF NOT EXISTS schema_project.agmarket_monthly
(
id bigint NOT NULL GENERATED ALWAYS AS IDENTITY ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 9223372036854775807 CACHE 1 ),
state character varying(50) COLLATE pg_catalog."default",
district character varying(50) COLLATE pg_catalog."default",
market character varying(50) COLLATE pg_catalog."default",
commodity character varying(50) COLLATE pg_catalog."default",
com_variety character varying(50) COLLATE pg_catalog."default",
com_group character varying(50) COLLATE pg_catalog."default",
arrival numeric(10,4),
min_price numeric(10,4),
max_price numeric(10,4),
modal_price numeric(10,4),
date_arrival date,
CONSTRAINT " agmarket_monthly_pkey" PRIMARY KEY (id)
)
TABLESPACE pg_default;
ALTER TABLE IF EXISTS schema_project.agmarket_monthly
OWNER to postgres;
- The things that can potentially break this implementation are:
- A change in the ArgMarkNet interface, and
- A change in how the backend processes user inputs (think source data files for commodity and state encoding)
-
Therefore, it would be wise to monitor for such changes to pre-emptively catch any potential issues and avoid broken pipelines.
-
In the next iteration, I'd like to:
- Find an alternative to Selenium due to the external browser dependency, and also the lag encountered by actually clicking the button instead of somehow directly triggering the JavaScript to do so.
- Implement unit and integration tests as a part of a CI (Continuous Integration) pipeline to promote robust updates.
A single SQL query to return the top 5 states for 4 commodities (Potato, Onion, Wheat, tomato). As an integrity check, the output should contain 20 records.
- The query employs CTE and a window function.
- State's Rank included in final table for readability.
- Modal price chosen (compared to min and max) since it is the average. Refer to data dictionary to understand the available columns.
with top_5 as
(SELECT
commodity,
state,
ROUND(AVG(modal_price),2) as AVERAGE,
ROW_NUMBER() OVER (PARTITION BY commodity Order by ROUND(AVG(modal_price),2) DESC) RANK
FROM schema_project.agmarket_monthly
Group by state, commodity
Order BY commodity, AVERAGE DESC)
SELECT * FROM top_5 WHERE RANK<=5 AND commodity IN ('Potato', 'Onion', 'Wheat', 'Tomato')