The Dataset Download Problem: Three Practical Approaches
If you are a beginner in the machine learning field and have worked on a unique project, you will know how tedious it can be to curate a dataset. If this problem still bothers you, don’t worry here are three of the most popular and important ways data can be gathered for your next machine learning project.
1. Downloading the Dataset
This is the obvious and the easiest option. But finding that download button is often very hard; numerous websites provide readily available datasets. I am listing a few of them below.
i) Kaggle
ii) Data World
iii) IMDb
iv) UCI Machine Learning Repository
v) Awesome-Public-Datasets on Github
vi) Data.Gov
2. Scraping Data
I)Scraping with Excel: Yes, you read that right, you can scrape data using Excel. As this is extremely unpopular I am giving a step by step guide on how to download data from a web source.
To extract data from a webpage on timeanddate.com without images and save it in a format like CSV, you can follow these steps using Microsoft Excel:
- Copy the URL of the webpage containing the desired data.
- Open a new Excel workbook.
- Go to the "Data" tab in Excel.
- Click on the "From Web" option. A dialogue box will appear, asking for the URL.
- Paste the copied URL into the dialogue box and click "OK".
- The Navigator window will open, displaying the available data on the webpage.
- If you see images or unnecessary data in the preview, select and exclude them by unchecking the corresponding boxes.
- Use the "Web View" option to get a better understanding of how the data is presented on the website.
- Once you are satisfied with the selected data, click on the "Load" button. Excel will fetch and load the data into the workbook.
- The loaded data will be displayed in a new worksheet in Excel.
- You can now save this data in a format like CSV by going to "File" > "Save As" and selecting the CSV file format.
By following these steps, you can extract data from the webpage on timeanddate.com and save it in a format without images, such as CSV, using Microsoft Excel.
Power Query Editor
II)Scraping with Python: You can scrape data from the web in python using a library called Beautifulsoup, It sits atop an HTML or XML parsers, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
I am dropping a link to the documentation of Beautifulsoup, which has a step by step guide and a good number of examples for easy understanding.
Beautifulsoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3. Downloading Data using an API
While working on real-time projects, it is quite common that the dataset is not readily available. We usually query the data using an API. There are many APIs, each serving its purpose. I am about to discuss two of the popular APIs as an example.
Before going on further I suggest you read about the requests package, it helps you make HTTPS requests using python.
Here is a detailed guide on the requests package — https://medium.com/@janhavidahihande/requests-library-for-http-f3c5b8747e47
I) BBC Weather API
BBC weather has a humungous collection of weather data on their website. Weather-related datasets are highly used in various areas as they have different implications. For example, a model that calculates travel duration might consider weather as a parameter as things like rain and fog could drastically affect it. But the data isn’t readily available from the website but can be fetched through API calls. APIs queries can be accessed by inspecting the website.
Click inspect on the website -> Select Network from the developer options -> Select Fetch/XHR
As you type the city name you can see the API calls being made in the Table on the right, the request starts with location?, and looks something like this shown below
API Request
II)Geocoding API of Open Street Maps(OSM) — Nominatim
Nominatim uses OpenStreetMap data to find locations on Earth by name and address (geocoding). It can also do the reverse, find an address for any location on the planet. We can fetch the longitude and latitude of a particular location, the correct address of the place.
Thank you for reading!!
Comments
Post a Comment