programming

Fuzzy matching with many to many matches without loops

Fuzzy matching As a computer scientist graduate, I always strive to reduce my computational complexity through parallelization or vectorization! Explicit loops in data science is the root of evil! For loops & while loops have their places but definitely not in data science space (fairly broad statement here). In this post here, I hope to show a really cool example that avoids the dreaded O(n square) complexity. I will be using fuzzy matching to find the closet match of strings in data-frame 2, df2 against data-frame 1, df1.

Latest lessons learnt from crawling

Lessons learnt I just realised that there’s a quick way to understand the xpaths’ patterns. In the past, usually what I did is to manually eyeball to infer the patterns from the page source or inspect page. Silly me! 1 quick way to understand the pattern is through the following, Right click on an element in a web page that you are interested in and click on ‘inspect’ Right click on the node and click ‘copy’ Copy full xpath And paste to a notepad.

Asset allocation notification

Asset allocation notification I’m in the midst of automating/ guiding my life with algorithms (largely inspired by Ray Dalio) - and 1 of the guidelines that I set is on asset allocation, Emerging market and Developed Market should be of the same proportion Bonds + Cash proportion should be equivalent to my age. This can deviate in times of crisis when I want to be more opportunistic. If it deviates from the portfolio policy statement, it will send me a pushover notification to my phone:)

ETF watchlist email notification Through Python

Email notification I finally bit the bullet and updated my previously hideous email notification! You may find the updated email notification template here - alongside with the code. Feel free to ping me if you are keen to be on the email list too. ~ Jirong import smtplib, ssl import datetime import pandas as pd from email.mime.text import MIMEText from email.mime.multipart import MIMEMultipart #Format text data = pd.read_csv(‘/home/jirong/Desktop/github/ETF_watchlist/Output/yahoo_crawled_data.csv’) data[‘Change_fr_52_week_high’] = round(100 * data[‘Change_fr_52_week_high’], 1) data = data[[‘Name’, ‘Price’, ‘Change_fr_52_week_high’]].

Convert NAs to Obscure Number in Data Frame to Aid in Recoding/ Feature Engineering

Converting NAs to obscure numbers to prevent the data from messing up the recoding. 1 issue that I encounter while I data-munge is that NAs in data seem to mess up my recoding. Here’s a neat swiss army knife utility function I developed recently. suppressMessages(library(dplyr)) # Converting NA to obscure number to prevent awkward recoding situations that require & !is.na(<variable>) # Doesn't work for factors #' @title Convert NA to obscure number #' @param dp_dataframe Dataframe in consideration #' @param np_obscure_num Numeric - Obscure number #' @param bp_na_to_num Boolean if TRUE, convert NA to num.

Loading excel data with correct variable types

Loading data with data types When reading static files into R or Python, most of the times we are lazy as we load the data with no regard to the data types. But in mission critical ETL jobs or Data analytics workflow, data types are quintessential and there’s a fine line between life and death. Ok, I’m exaggerating here. What I’ve written below is a swiss army knife function to read an excel file: 1st tab is data and 2nd tab is the variable types (e.

Function to describe clusters derived from unsupervised learning

Describing unsupervised learning clusters As a data scientist / analyst, besides doing cool modelling stuff, we’re often asked to churn out descriptive statistics. Yes, we know. It’s part of the process. I chanced upon this really nifty concept at work to describe the clusters derived from unsupervised learnig. Here’s how it goes, Say it’s a nominal or ordinal variable. First, I find the proportion of the feature across the X clusters Second, I rank this proportion through percentiles across these X values The cluster with the highest percentile will earn its right to be represented by the feature And if it’s a scale variable, you may find the mean of the feature for each cluster and repeat the steps.

Playing with Google Place API

Google Place API I was playing around with the API to obtain lat-long for my geo analytics work. I entered my credit card info but it seems that I’m not charged even with 9000+ API calls. Unsure if it’s because I’ve a 400+ dollars free cloud credit? Anyway, what I did here was to make API calls and storing the data into my local database. If you’re interested, you may visit this stackoverflow link (https://stackoverflow.

Some thoughts on Reinforcement Learning - Q Learning

Q learning I just completed a Reinforcement Learning assignment - in particular on Q-learning. According to Wikipedia here, it’s a model-free Rl algorithm. The goal for the algo is to learn a policy, which tells an agent what action to take under different circumstances. Here’s my confession. What I’m doing in this post is to summarise what I’ve just learnt so that I may come back to this at any point in future.

Hosting a Flask App on Heroku

Following the steps here –> https://realpython.com/flask-by-example-part-1-project-setup/ I managed to deploy my python flask app in Heroku. from flask import Flask app = Flask(name) @app.route(‘/’) def hello(): return "Hello World!" @app.route(‘/<name>’) def hello_name(name): return "Hello {}!".format(name) if name == ‘main‘: app.run() You may visit the following link –>https://jirong-stage.herokuapp.com/ & add a suffix to it. Example https://jirong-stage.herokuapp.com/jirong & this will return Hello jirong! Possibilites are immense! I can easily create APIs or host dashboard here.