IT Experience: Scrapy

Showing posts with label Scrapy. Show all posts

Thursday, December 22, 2016

How to deploy scrapy project to scrapinghub.com then save crawling data to mongoDB

PIP required module :
pymongo

ubuntu@nutthaphongmail:~$ docker exec -it --user root scrapyd-server bash
root@6b82490da131:/home/scrapyd# pip install shub
Collecting shub
  Downloading shub-2.5.0-py2.py3-none-any.whl (47kB)
    100% |################################| 51kB 773kB/s 
Collecting requests (from shub)
  Downloading requests-2.12.4-py2.py3-none-any.whl (576kB)
    100% |################################| 583kB 1.3MB/s 
Requirement already satisfied: six in /usr/local/lib/python2.7/dist-packages (from shub)
Collecting scrapinghub (from shub)
  Downloading scrapinghub-1.9.0-py2-none-any.whl
Requirement already satisfied: pip in /usr/local/lib/python2.7/dist-packages (from shub)
Collecting PyYAML (from shub)
  Downloading PyYAML-3.12.tar.gz (253kB)
    100% |################################| 256kB 2.5MB/s 
Collecting click (from shub)
  Downloading click-6.6-py2.py3-none-any.whl (71kB)
    100% |################################| 71kB 4.2MB/s 
Collecting retrying (from shub)
  Downloading retrying-1.3.3.tar.gz
Collecting docker-py (from shub)
  Downloading docker_py-1.10.6-py2.py3-none-any.whl (50kB)
    100% |################################| 51kB 4.2MB/s 
Collecting backports.ssl-match-hostname>=3.5; python_version < "3.5" (from docker-py->shub)
  Downloading backports.ssl_match_hostname-3.5.0.1.tar.gz
Collecting websocket-client>=0.32.0 (from docker-py->shub)
  Downloading websocket_client-0.40.0.tar.gz (196kB)
    100% |################################| 204kB 2.7MB/s 
Requirement already satisfied: ipaddress>=1.0.16; python_version < "3.3" in /usr/local/lib/python2.7/dist-packages (from docker-py->shub)
Collecting docker-pycreds>=0.2.1 (from docker-py->shub)
  Downloading docker_pycreds-0.2.1-py2.py3-none-any.whl
Building wheels for collected packages: PyYAML, retrying, backports.ssl-match-hostname, websocket-client
  Running setup.py bdist_wheel for PyYAML ... done
  Stored in directory: /root/.cache/pip/wheels/2c/f7/79/13f3a12cd723892437c0cfbde1230ab4d82947ff7b3839a4fc
  Running setup.py bdist_wheel for retrying ... done
  Stored in directory: /root/.cache/pip/wheels/d9/08/aa/49f7c109140006ea08a7657640aee3feafb65005bcd5280679
  Running setup.py bdist_wheel for backports.ssl-match-hostname ... done
  Stored in directory: /root/.cache/pip/wheels/5d/72/36/b2a31507b613967b728edc33378a5ff2ada0f62855b93c5ae1
  Running setup.py bdist_wheel for websocket-client ... done
  Stored in directory: /root/.cache/pip/wheels/d1/5e/dd/93da015a0ecc8375278b05ad7f0452eff574a044bcea2a95d2
Successfully built PyYAML retrying backports.ssl-match-hostname websocket-client
Installing collected packages: requests, retrying, scrapinghub, PyYAML, click, backports.ssl-match-hostname, websocket-client, docker-pycreds, docker-py, shub
Successfully installed PyYAML-3.12 backports.ssl-match-hostname-3.5.0.1 click-6.6 docker-py-1.10.6 docker-pycreds-0.2.1 requests-2.12.4 retrying-1.3.3 scrapinghub-1.9.0 shub-2.5.0 websocket-client-0.40.0
root@6b82490da131:/home/scrapyd# exit
exit
ubuntu@nutthaphongmail:~$ docker exec -it scrapyd-server bash
scrapyd@6b82490da131:~$ 
scrapyd@6b82490da131:~$ shub
Usage: shub [OPTIONS] COMMAND [ARGS]...

  shub is the Scrapinghub command-line client. It allows you to deploy
  projects or dependencies, schedule spiders, and retrieve scraped data or
  logs without leaving the command line.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  copy-eggs     Sync eggs from one project with other project
  deploy        Deploy Scrapy project to Scrapy Cloud
  deploy-egg    [DEPRECATED] Build and deploy egg from source
  deploy-reqs   [DEPRECATED] Build and deploy eggs from requirements.txt
  fetch-eggs    Download project eggs from Scrapy Cloud
  image         Manage project based on custom Docker image
  items         Fetch items from Scrapy Cloud
  log           Fetch log from Scrapy Cloud
  login         Save your Scrapinghub API key
  logout        Forget saved Scrapinghub API key
  migrate-eggs  Migrate dash eggs to requirements.txt and project's directory
  requests      Fetch requests from Scrapy Cloud
  schedule      Schedule a spider to run on Scrapy Cloud
  version       Show shub version

  For usage and help on a specific command, run it with a --help flag, e.g.:

      shub schedule --help
scrapyd@6b82490da131:~$ 
scrapyd@6b82490da131:~$ 
scrapyd@6b82490da131:~$ pwd
/home/scrapyd

scrapyd@6b82490da131:~/projects/PyLearning$ cd stack/
scrapyd@6b82490da131:~/projects/PyLearning/stack$ cat requirements.txt 
pymongo==3.4.0
scrapyd@6b82490da131:~/projects/PyLearning/stack$ 
scrapyd@6b82490da131:~/projects/PyLearning/stack$ 
scrapyd@6b82490da131:~/projects/PyLearning/stack$ 
scrapyd@6b82490da131:~/projects/PyLearning/stack$ cat scrapinghub.yml 
projects:
  default: 136494
requirements_file: requirements.txt
scrapyd@6b82490da131:~/projects/PyLearning/stack$ 
scrapyd@6b82490da131:~/projects/PyLearning/stack$ shub login

-------------------------------------------------------------------------------
Welcome to shub version 2!

This release contains major updates to how shub is configured, as well as
updates to the commands and shub's look & feel.

Run 'shub' to get an overview over all available commands, and
'shub command --help' to get detailed help on a command. Definitely try the
new 'shub items -f [JOBID]' to see items live as they are being scraped!

From now on, shub configuration should be done in a file called
'scrapinghub.yml', living next to the previously used 'scrapy.cfg' in your
Scrapy project directory. Global configuration, for example API keys, should be
done in a file called '.scrapinghub.yml' in your home directory.

But no worries, shub has automatically migrated your global settings to
~/.scrapinghub.yml, and will also automatically migrate your project settings
when you run a command within a Scrapy project.

Visit http://doc.scrapinghub.com/shub.html for more information on the new
configuration format and its benefits.

Happy scraping!
-------------------------------------------------------------------------------

Enter your API key from https://app.scrapinghub.com/account/apikey
API key: e4bfa1fd7f8d4d9da817aa112bb82095
Validating API key...
API key is OK, you are logged in now.
scrapyd@6b82490da131:~/projects/PyLearning/stack$ 
scrapyd@6b82490da131:~/projects/PyLearning/stack$ shub deploy
Target project ID: 136494
Save as default [Y/n]: Y
Project 136494 was set as default in scrapinghub.yml. You can deploy to it via 'shub deploy' from now on.
Packing version e7d7f6c-inet1
Deploying to Scrapy Cloud project "136494"
{"spiders": 1, "status": "ok", "project": 136494, "version": "e7d7f6c-inet1"}
Run your spiders at: https://app.scrapinghub.com/p/136494/

* API key from https://app.scrapinghub.com/account/apikey
** project ID found on https://app.scrapinghub.com/p/PROJECT_ID/deploy for new project

Tuesday, November 22, 2016

Deploy dataset project to scrapyd

scrapyd@a4c2642d74db:~/projects$ git clone https://github.com/nutthaphon/PyLearning.git
Cloning into 'PyLearning'...
remote: Counting objects: 323, done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 323 (delta 0), reused 0 (delta 0), pack-reused 317
Receiving objects: 100% (323/323), 61.27 KiB | 0 bytes/s, done.
Resolving deltas: 100% (146/146), done.
Checking connectivity... done.
scrapyd@a4c2642d74db:~/projects$ 
scrapyd@a4c2642d74db:~/projects$ 
scrapyd@a4c2642d74db:~/projects$ 
scrapyd@a4c2642d74db:~/projects$ cd PyLearning/
scrapyd@a4c2642d74db:~/projects/PyLearning$ ls
Animals  CherryPy  DJango  README.md  Scraping  dataset  decor  foo  serial  test  tutorial
scrapyd@a4c2642d74db:~/projects/PyLearning$ cd dataset/
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ ls
__init__.py  __init__.pyc  build  project.egg-info  scrapinghub.yml  scrapy.cfg  settrade  setup.py
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ sed -i 's/#url/url/g' scrapy.cfg
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ cat scrapy.cfg
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = settrade.settings

[deploy]
url = http://localhost:6800/
project = settrade
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ ls
__init__.py  __init__.pyc  build  project.egg-info  scrapinghub.yml  scrapy.cfg  settrade  setup.py
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ cd settrade/
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset/settrade$ ls
DailyStockQuote.py   QueryStockSymbol.py   ThinkSpeakChannels.py   __init__.py   items.py      scrapinghub.yml  settings.pyc
DailyStockQuote.pyc  QueryStockSymbol.pyc  ThinkSpeakChannels.pyc  __init__.pyc  pipelines.py  settings.py      spiders
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset/settrade$ cd spiders/
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset/settrade/spiders$ ls
SettradeSpider.py  SettradeSpider.pyc  __init__.py  __init__.pyc
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset/settrade/spiders$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset/settrade/spiders$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset/settrade/spiders$ ls
SettradeSpider.py  SettradeSpider.pyc  __init__.py  __init__.pyc
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset/settrade/spiders$ cd ..
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset/settrade$ cd ..
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ ls
__init__.py  __init__.pyc  build  project.egg-info  scrapinghub.yml  scrapy.cfg  settrade  setup.py
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy -l
default              http://localhost:6800/
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy default -p dataset
Packing version 1479804720
Deploying to project "dataset" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "dataset", "version": "1479804720", "spiders": 1, "node_name": "a4c2642d74db"}

scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy -L default
tutorial
dataset
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ curl http://localhost:6800/listprojects.json
{"status": "ok", "projects": ["tutorial", "dataset"], "node_name": "a4c2642d74db"}
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ curl http://localhost:6800/listspiders.json?project=dataset
{"status": "ok", "spiders": ["settrade_dataset"], "node_name": "a4c2642d74db"}
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ 
scrapyd@a4c2642d74db:~/projects/PyLearning/dataset$ curl http://localhost:6800/schedule.json -d project=dataset -d spider=settrade_dataset -d setting=DOWNLOAD_DELAY=5 -d start_urls=http://www.settrade.com/servlet/IntradayStockChartDataServlet?symbol=INTUCH,http://www.settrade.com/servlet/IntradayStockChartDataServlet?symbol=CPF,http://www.settrade.com/servlet/IntradayStockChartDataServlet?symbol=ADVANC
{"status": "ok", "jobid": "c60e4566b09111e68c380242ac110002", "node_name": "a4c2642d74db"}

Scrapyd deploying your project

ubuntu@node2:~/Docker/nutthaphon/scrapyd/user$ docker run --name scrapyd-server --user scrapyd -it -P nutthaphon/scrapyd:1.1.1 bash
scrapyd@a4c2642d74db:~$ pwd
/home/scrapyd
scrapyd@a4c2642d74db:~$ ls
master.zip  scrapyd-client-master  setuptools-28.8.0.zip
scrapyd@a4c2642d74db:~$ mkdir projects
scrapyd@a4c2642d74db:~$ cd projects/
scrapyd@a4c2642d74db:~/projects$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
    /home/scrapyd/projects/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com
scrapyd@a4c2642d74db:~/projects$ cd tutorial/
scrapyd@a4c2642d74db:~/projects/tutorial$ ls
scrapy.cfg  tutorial
scrapyd@a4c2642d74db:~/projects/tutorial$ vi scrapy.cfg 
bash: vi: command not found
scrapyd@a4c2642d74db:~/projects/tutorial$ cat scrapy.cfg 
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = tutorial.settings

[deploy]
#url = http://localhost:6800/
project = tutorial
scrapyd@a4c2642d74db:~/projects/tutorial$ sed -i 's/#utl/url/g' scrapy.cfg 
scrapyd@a4c2642d74db:~/projects/tutorial$ cat scrapy.cfg 
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = tutorial.settings

[deploy]
#url = http://localhost:6800/
project = tutorial
scrapyd@a4c2642d74db:~/projects/tutorial$ sed -i 's/#url/url/g' scrapy.cfg 
scrapyd@a4c2642d74db:~/projects/tutorial$ cat scrapy.cfg 
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = tutorial.settings

[deploy]
url = http://localhost:6800/
project = tutorial
scrapyd@a4c2642d74db:~/projects/tutorial$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy -l
default              http://localhost:6800/
scrapyd@a4c2642d74db:~/projects/tutorial$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy default -p tutorial
Packing version 1479799362
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Deploy failed: <urlopen error [Errno 111] Connection refused>
scrapyd@a4c2642d74db:~/projects/tutorial$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy default -p tutorial
Packing version 1479799403
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1479799403", "spiders": 0, "node_name": "a4c2642d74db"}

scrapyd@a4c2642d74db:~/projects/tutorial$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy -L default
tutorial

Monday, August 22, 2016

Scrapy Tutorial

Pre-requirsite

#Install Python Package Index
sudo apt-get install python-pip

pip search scrapy
pip install scrapy
pip show scrapy
#pip uninstall scrapy

An instruction from Scrapy.org

nutt@nutt-pc:~/OneDrive/Scrapy$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
    /home/nutt/OneDrive/Scrapy/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com

nutt@nutt-pc:~/OneDrive/Scrapy$ tree tutorial/
tutorial/
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

2 directories, 6 files

Create new dmoz_spider.py in spiders directory
and modify items.py on top directory

nutt@nutt-pc:~/OneDrive/Scrapy/tutorial$ scrapy crawl dmoz
2016-08-22 23:08:01 [scrapy] INFO: Scrapy 1.1.2 started (bot: tutorial)
2016-08-22 23:08:01 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
2016-08-22 23:08:02 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-08-22 23:08:02 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-22 23:08:02 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-22 23:08:02 [scrapy] INFO: Enabled item pipelines:
[]
2016-08-22 23:08:02 [scrapy] INFO: Spider opened
2016-08-22 23:08:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-22 23:08:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-08-22 23:08:03 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/robots.txt> (referer: None)
2016-08-22 23:08:04 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-22 23:08:04 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-22 23:08:04 [scrapy] INFO: Closing spider (finished)
2016-08-22 23:08:04 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 734,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 16908,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 8, 22, 16, 8, 4, 322225),
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 8, 22, 16, 8, 2, 550009)}
2016-08-22 23:08:04 [scrapy] INFO: Spider closed (finished)

nutt@nutt-pc:~/OneDrive/Scrapy$ tree tutorial/
tutorial/
├── Books.html
├── Resources.html
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── __init__.pyc
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    ├── settings.pyc
    └── spiders
        ├── dmoz_spider.py
        ├── dmoz_spider.pyc
        ├── __init__.py
        └── __init__.pyc

2 directories, 13 files

Have new 2 files created Books.html and Resources.html that are scrapy.http.Response objects file of scrapy.Request objects

IT Experience