Posts

How to extract cookies from Google Chrome I have some site crawler. Site is protected from robots. To bypass this protection I pass cookies from browser session to Java URL request.  Cookies stored in Chrome in sqllite database in encrypted form. Before starting the crawler  my program opens Chrome browser with specified URL: Runtime.getRuntime().exec(new String[] { "chromium-browser" , "http://somesite.com" } ) ; TimeUnit.SECONDS.sleep(40) ; Then after browser is open and cookies written in to sqllite database I read cookies from sqllite using CookieMonster library. ChromeBrowser chrome = new ChromeBrowser() ; Set<Cookie> cookies = chrome.getCookiesForDomain("somesite.com") ; After cookies extracted I have to decrypt them.  If you use chrome of version lower than 53  you should use  password " peanuts "  to decrypt your cookies.  But if you use version 53 or higher  - you should get password from OS secure storage....
Image
How to replicate files from FTP  to local file system  with Apache NiFi          For tests we need SFTP server. In my case I created it on Google cloud. To do this we need to create compute engine based on Cent OS (because CentOS has built in sftp server)      After creating compute instance login using browser pop-up based terminal.      In opened terminal window create user sudo adduser myuser   and set password sudo passwd yourpass .     Then open sshd config for editing sudo vi /etc/ssh/sshd_config and uncomment PasswordAuthentication yes . Then simply stop and start compute engine.    Next step is to create NiFi workflow. Just add processes on the canvas like on screen below and connect them.         Then open ListFiles processor and set host, username, password, path to source directory        ...
Bulk Data Load Using Apache Beam JDBC Sometimes we have to load data from relational database in to BigQuery. In this article I will describe how to do this using  MySQL as a datasource,  Apache Beam and    Google Dataflow.  Moreover we will do this in parallel. Lets start. Firs of all we will create Apache Beam project  Apache Beam quick start guide Then we will do it step by step: 1)  Add dependency to beam JDK module < dependency > < groupId > org.apache.beam </ groupId > < artifactId > beam-sdks-java-io-jdbc </ artifactId > < version > ${beam.version} </ version > </ dependency > 2)  Because of we will have parallel data load  its good idea to have connection pool. I choose c3p0 connection pool. < dependency > < groupId > c3p0 </ groupId > < artifactId > c3p0 </ artifactId > < version > 0.9.1 </ version > ...