Hard Code

P2P Lending on GCE

2022-04-02T22:15:34.000Z

Recently, one of my friend reaching out to me regarding how to install the P2P-Lending on a GCE instance.Well, the repo is old and haven’t updated for more than 3 years.So I have overcome many difficulties to manage run it in a GCE instance.Here are the steps:

Create GCE instances with allow HTTP traffic (using defaul Debian image if using other Linux distros the following preparation may not fit).

Install NodeJs, followed steps in: https://cloud.google.com/nodejs/docs/setup

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bashexport NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")"[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvmnvm install --ltsnvm use --lts

Install npm/git/python2 via command (for build-in python3 please ensure it >= 3.6 or upgrade to the latest):
```
sudo apt install git build-essential pythonnpm install -g npm
```
To avoid permission denied issue when calling the npm installed global package, please follow this link to resolve it:
```
 mkdir ~/.npm-global npm config set prefix '~/.npm-global' export PATH=~/.npm-global/bin:$PATH source ~/.profile
```
Install Ganache via command: npm install ganache -g
Install Truffle via command: npm install truffle -g
Clone the repo: git clone https://github.com/adorsys/p2p-lending.git
Start Ganache with port 8545: ganache -p 8545 &
Go to p2p-lending folder, and delete the packge-lock.json file (force to pull the latest packages):
```
cd p2p-lendingrm package-lock.jsonnpm installtruffle compilenpm run migrate:dev
```
[Optional] Set python executable, require python2 if not installed: npm config set /usr/bin/python
In frontend folder, modify package.json, change the package web version from “1.0.0-beta.37” to “^1.0.0”. Then follow commands to install:
```
cd frontendnpm uninstall node-sassnpm i -D sassrm -rf node_modules package-lock.json && npm install
```
Start the serve: npm start

That’s it. Hope it could provide some insights when you try to use the P2P-Lending, but strongly not recommended.Since I’m not familar with blockchain, so not sure if there is any alternative to the P2P-Lending.If anyone knows, please share in the comments. I will let my friend know.Thanks in advance.

Cloud Functions VS Lambda

2020-08-05T22:15:34.000Z

Many articles are talking about the differences between GCP and AWS.Is that true one is superior to another?Or they are just apples and oranges?For me, I never treat them seriously, as my daily job is heavily on GCP.However, recently I got a project which needs to integrate these two cloud providers’ services.Now the headache is coming.

This comparison is based on a developer perspective, who is fresh to AWS and expert (self-claimed) in GCP.

Development Speed

Both can start coding without documents when using the first time.I feel this is a great advantage to move product to cloud, saving tons of development time.

However, I feel Cloud Function is simpler, agile, and more intuitive compared to Lambda based on the following:

Lambda needs to set API gateway in order to trigger it by URL
Lambda does not support install Python packages, to do so, you have to upload your code with packages in a zip or build a customized runtime environment.
Lambda HTTP trigger passes in the request within a wrapper, not raw request with body and header info. So need to understand its parameter structures.
Lambda has to enable the binary support for certain content types.
Lambda built-in editor does not support uploaded zip with too many files.

So the winner is obvious, Cloud Function is the better choice if you don’t want to deal with package installation or HTTP trigger settings.

Deploy Speed

This one I have to say that AWS has impressed me.Every time if I clicked the save, the code has already deployed, although this comes with a drawback that less customizability.I enjoy this fast and non-interrupted deployment process.Almost every service could be instantly provisioned, modified, and deleted.(So far our codebase is relatively small like ~500 lines of code for each service)

So what does the GCP lack?

Most services under GCP need several minutes to provision.
Cloud Functions take a couple of minutes to deploy a new version, even just change a line of comment.
Searching for other services feels slower than AWS.

All in all, GCP feels slower in many aspects, so AWS has a smoother user experience.

Debugging

Lambda auto wins this track, due to an on-going log issue in Cloud Functions, which causes no logging when function crashes.Besides the issue, both provide detailed trace logs.Moreover, Lambda provides a configurable test event to run the function on the fly, which really neat.

Docs

This won’t affect the user experience for experts, but definitely a plus for newcomers.Maybe just because I’m newbie to AWS, I always find it difficult to read the official documents.

There is no programming language switcher in the code section, you always need to find a dedicated page for a specific programming language.
Most operations have to use CLI, which creates a barrier for people like me who just would like to try and do not want to mess up our dev machine.
No Cloud Shell. A similar reason to the previous one. Although this is not a doc, it is nice to try the tutorials in a setup environment, so less weird errors due to package versioning.

Actually, cloud techniques are pretty alike among providers.Although the under table infrastructures may differ, the user-level implementations do not vary too much.With better doc would attract more people who are new to the cloud.From this perspective, GCP did a very good job of providing tutorials and tools.

Pricing

This influences the least to developers.It is no good or bad, it is all depending on the traffic volume and the boss’s preferences.

Provider	Lambda	Cloud Functions
Pricing	1M/month requests and 400K GB-sec/month compute time for free, then $0.20/1M requests and $0.00001667/GB-sec	2M/month requests and (400K GB-sec/month, 200K GHz-sec/month) compute time for free, then $0.40/1M invocations, plus $0.0000025/GB-sec and $0.00001/GHz-sec

Cloud Functions have a more complex formula to calculate the compute time pricing.Basically, it adds the CPU usage (GHz-sec) into the calculation.So Lambda is cheaper if not using the provisioned concurrency, which charges extra cost.

Here comes the question, does Cloud Function have a similar provisioned concurrency feature?

Integration

This is a general talk about how to integrate services on different cloud platforms, not specified to cloud functions or lambda.

Both products are integrated well with other cloud products within the same cloud provider.Unless there is a unique cloud service, most development could be done within a single cloud provider.If you feel the developers do not have enough work, distribute your services into two or three cloud providers.They will spend most of their time on how to integrate those services, instead of actual development.

Finally

This comparison is based on my personal experience.Both have advantages and disadvantages, please choose based on your use cases.But keep in mind, don’t try to integrate them!This is the only advice I learned from using both.

How to print access token inside Cloud Function

2020-07-22T22:15:34.000Z

As Cloud Function supports many programming languages, this article will use Python for the demonstration

If anyone has used the Google Cloud Function, they probably stuck at the limited system packages [1]. There is no curl or wget, and could not customize the runtime system.Like the document stated, it is a fully managed environment.Someone may jump out and yell out the name Cloud Run.Yes, Cloud Run will be the successor of Cloud Function in many aspects.However, it does not support the trigger from the Cloud Storage bucket.Yes, yes, yes, you can use Pub/Sub in Cloud Run to implement the bucket trigger.But why not keep it simple?

Use cases

During the local environment testing, we normally use gcloud auth application-default print-access-token [2] to get the authentication to call the Google API endpoint.It could be integrated into a curl command within a script or subprocess in your code.You may see the following command in may GCP API tutorials:

curl -H "Content-Type: application/x-www-form-urlencoded" -d "access_token=$(gcloud auth application-default print-access-token)" https://www.googleapis.com/oauth2/v1/tokeninfo

After tested everything working fine on our local machine, it is time to move to the cloud.We assume the Google Cloud will handle the credentials for us, because we are accessing resources withing the GCP, and using the same service account.Our assumption will be busted by the brutal reality.We still need to explicitly create credentials when calling other GCP services.

This becomes a barrier when moving to the Cloud Function.The initial thought would be using the Python subprocess to call curl command.As stated above, the curl does not exist and could not be installed in the Cloud Function system.Luckily, the curl command can easily be replaced by the Python request package.But, how about the gcloud auth? It is also not included in the system packages. So here is the solution.

Solutions

We know there is a google-auth Python package [3] to handle GCP related authentications.The google.auth.default() could return a credential object which has a token field.Looks promising, isn’t it?How about getting the token from google-auth?So I wrote the following code:(The following code could be directly run in Cloud Function)

import google.authdef get_token(request):  cred, project_id = google.auth.default()  return f'{cred.__dict__}'

Well, the output shows nothing:

{'token': None, 'expiry': None, '_scopes': None, '_service_account_email': 'default'}

The google-auth package is not open source, so I could find the logic when the token field is populated.Therefore, I came up with an assumption that the token field would be populated during usage.I will use Document AI as an example, you could use other GCP services, I think the logic behind should be the same.I rewrite the sample code [4] to fit the Cloud Function:

import google.cloud.documentai as gcdimport google.authdef get_token(request):  cred, project_id = google.auth.default()  gcd_client = gcd.DocumentUnderstandingServiceClient(credentials=cred)  req = gcd.ProcessDocumentRequest(      parent=f"projects/{project_id}",      input_config={              'gcs_source':{                  'uri': 'gs://cloud-samples-data/documentai/form.pdf'},               'mime_type':"application/pdf"},       document_type="general",       form_extraction_params={'enabled': True})  return f'{cred.__dict__}'

Still, the token is None.Okay, seems it has not been updated at all.I only used the document AI package initialization to avoid the extra charge on invoking the actual process.However, if we actual process the document by adding response = gcd_client.process_document(request=req) before the return statement.The magic happens.

import google.cloud.documentai as gcdimport google.authdef get_token(request):  cred, project_id = google.auth.default()  gcd_client = gcd.DocumentUnderstandingServiceClient(credentials=cred)  req = gcd.ProcessDocumentRequest(      parent=f"projects/{project_id}",      input_config={              'gcs_source':{                  'uri': 'gs://cloud-samples-data/documentai/form.pdf'},               'mime_type':"application/pdf"},       document_type="general",       form_extraction_params={'enabled': True})  response = gcd_client.process_document(request=req)  return f'{cred.__dict__}'

To avoid the security breach, I will not post the output here.You will see the token field has the value we are seeking for.Well, we have to pay for the usage of the document AI API calls.

Need simpler

We could use the Cloud Logging SDK, which does not need many details for the request body.Comparing to the previous method, the cost is even lower, almost free [5].

import google.authimport google.cloud.logging as cloud_loggingdef get_token(request):    cred, _ = google.auth.default()    cloud_client = cloud_logging.Client(credentials=credentials)    log_name = 'cloudfunctions.googleapis.com%2Fcloud-functions'    cloud_logger = cloud_client.logger(log_name)    all_entries = cloud_logger.list_entries(page_size=1)    entries = next(all_entries.pages)    return f"{cred.__dict__}"

More simpler

If we could use different Cloud Python SDKs, is there an SDK can have a minimal number of line of code?Yes, here is what I found – Cloud Translate, which charges based on the translated characters [6].So we only need to process one char to obtain the token.The free tier quota (500,000 chars) is big enough for testing purposes.

from google.cloud import translate_v3import google.authdef get_token(request):    credentials, project_id = google.auth.default()    client = translate_v3.TranslationServiceClient(credentials=credentials)    parent = client.location_path('project-id', 'us-central1')    response = client.translate_text('a', target_language_code='en', parent=parent)    return f"{credentials.__dict__}"

The pattern is:

Get credential object from google.auth.default()
Initialize a Cloud Service SDK and pass in the credential object as a parameter
Call the service API with minimal cost data, the token will be populated into the credential object

You can come up with your solution by using different Python SDKs [7].

Mission complete!?

Ultimate solution

If there are Python SDKs for GCP services, why bother to get the access token, call them directly in your code.

Anyway, this is a great example of rebuilding the wheel process, hope can provide some insights!

[1] https://cloud.google.com/functions/docs/reference/python-system-packages

[2] https://cloud.google.com/sdk/gcloud/reference/auth/application-default/print-access-token

[3] https://google-auth.readthedocs.io/en/latest/

[4] https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/document/cloud-client/parse_form_beta.py

[5] https://cloud.google.com/stackdriver/pricing

[6] https://cloud.google.com/translate/pricing

[7] https://github.com/googleapis/google-cloud-python

Why you need QT for your side projects

2019-10-09T22:15:34.000Z

I’m working on an side project which need a fancy GUI interface, so I was choosing between Webassemble and QT.This article will briefly descibe how I made the decision.

TL;DR: I choose QT as sibe project GUI creator for its strong community and webassemble support.

Before I start mubling, I would like to state my view on the GUI vs CLI.I like to use CLI, as it is easy to integrate with pipe and automation process.Also devloping CLI is focusing on functionality instead of aligning pixels.Moreover, it feels geek and cool.However, I also found the esiest understandable application usually have a good GUI.A picture is worthing a thousand of words.GUI is that picture in a program.The best applications in my mind is that have both GUI and CLI.

Recently, I was looking for a GUI library which is open source and cross-platform.The first thing comes to my mind is the web development, like React and Vue.However, JS has bad reputation on performance, and my side project is based on video processing.Then I googled a little bit, seems webassemble is the best solution to overcome the performance concerns.But QT can also generate webassemble applications.

After trying both, I have conclude their pros and cons in the following:

Webassemble

Pros

New technology
Supported by major browsers
Faster

Cons

Performance is generally 30% better than JS (not really impresive)(ref)
Not mature, many features underdevlopment and can cause wired issues
Only support C/C++

QT

Pros

Support Python (personal preference)
Cross-platform build binary
IDE support

Cons

Need to pay attention to its license
Current project directory could not be recognized correctly on Windows
Many issues on QT creator

In summary, webassemble is still not mature for building a project.If you still don’t want to miss this trends, use QT instead.As it can also generate webassemble application, meanwhile, doesn’t need to worry about the low level implementation.

OpenAI Gym Board Games Env for 0.9.6 Above

2018-08-06T22:32:53.000Z

Since Gym 0.9.6, the board games environment has been removed from the default package as they are not maintained by OpenAI [ref].This article helps who would like to run their AI on Go or Hex in OpenAI Gym.

Locate the python Gym package folder. In my case, it is under ~/anaconda3/envs/openai-gym/lib/python3.5/site-packages/gym.
Download the Gym 0.9.5 source code which contains the board games environment.
Copy the board_game folder from 0.9.5 source code (under /gym-0.9.5/gym/envs/) to your local Gym package envrionment folder (my case is ~/anaconda3/envs/openai-gym/lib/python3.5/site-packages/gym/envs).
Add the following code into init.py (~/anaconda3/envs/openai-gym/lib/python3.5/site-packages/gym/envs/__init__.py). It will register those envs.

# Board games# ----------------------------------------register(    id='Go9x9-v0',    entry_point='gym.envs.board_game:GoEnv',    kwargs={        'player_color': 'black',        'opponent': 'pachi:uct:_2400',        'observation_type': 'image3c',        'illegal_move_mode': 'lose',        'board_size': 9,    },    # The pachi player seems not to be determistic given a fixed seed.    # (Reproduce by running 'import gym; h = gym.make('Go9x9-v0'); h.seed(1); h.reset(); h.step(15); h.step(16); h.step(17)' a few times.)    #    # This is probably due to a computation time limit.    nondeterministic=True,)register(    id='Go19x19-v0',    entry_point='gym.envs.board_game:GoEnv',    kwargs={        'player_color': 'black',        'opponent': 'pachi:uct:_2400',        'observation_type': 'image3c',        'illegal_move_mode': 'lose',        'board_size': 19,    },    nondeterministic=True,)register(    id='Hex9x9-v0',    entry_point='gym.envs.board_game:HexEnv',    kwargs={        'player_color': 'black',        'opponent': 'random',        'observation_type': 'numpy3c',        'illegal_move_mode': 'lose',        'board_size': 9,    },)

Install pachi-py via pip install pachi-py for go env.

Then it’s all set to use the board game environment.

All in all, Gym is built for testing reinforcement learning, and the reinforcement learning gains fames from the DeepMind AlphaGo. Personally, removing Go env from Gym is not a smart move for marketing.

Leetcode VS EPI

2018-07-19T00:41:21.000Z

This article is still work in progress.

Here is a list of problems which both on Leetcode and EPI¹. In my opinion, these common appeared questions are more important. Hope this could help job seekers to prepare for their final interviews.

Leetcode²	EPI³	Difficulty⁴	Problem⁵	Company⁶
21	7.1	Easy	Merge Two Sorted Lists	Amazon
20	8.3	Easy	Valid Parentheses	Amazon
8	6.1	Medium	String to Integer (atoi)	Amazon
48	5.19	Medium	Rotate Image	Amazon
15	17.4	Medium	3-Sum	Amazon
42	24.32	Hard	Trapping Rain Water	Amazon
121	5.6	Easy	Best Time to Buy and Sell Stock	Amazon
89	15.10	Medium	Gray Code	Amazon
235	9.3	Easy	Lowest Common Ancestor of a Binary Search Tree	Amazon
98	14.1	Medium	Validate Binary Search Tree	Amazon
141	7.3	Easy	Linked List Cycle	Amazon
240	11.6	Medium	Search a 2D Matrix II	Amazon
234	7.11	Easy	Palindrome Linked List	Amazon
215	24.17	Medium	Kth Largest Element in an Array	Amazon
579	13.12	Hard	Find Cumulative Salary of an Employee	Amazon

1.The book "Elements of Programming Interviews". https://elementsofprogramminginterviews.com/ ↩
2.Problem index on Leetcode. https://leetcode.com/ ↩
3.Problem index on EPI. ↩
4.The difficulty of the problem on Leetcode. ↩
5.The problem title on Leetcode. ↩
6.The company tags on Leetcode indicate the problem has appeared in their interview. ↩

A Real Scenario Machine Learning Question

2018-06-18T23:13:10.000Z

Recently I got a real scenario machine learning question, which does not have existing models. I would like to record my thoughts here which may save my time later.

Here is the question:

A company receives thousands of documents everyday uploaded by our users. Generally these documents are invoices or bills. We would like to extract the vendor and amount from these documents automatically (i.e. using software rather than human inspection).

They store the following pieces of information for each document:
The pdf document uploaded by the user (please see example.pdf attached)
The text extracted from that pdf (please see example.txt attached - Note: often the extracted text would not be in an order that seems natural to a human reader)
Labels of what the vendor and amount should be for each document (in the attachedexample, vendor would be “Marketing Fuel Biz.”, and amount would be “747.50”).

Question: Describe a machine learning solution to this problem.

Additon: Some percentage of the stored labels may be incorrect. What would you change to mitigate this problem.

The sample pdf and OCR output txt is downloadable.

Intuitive Thoughts

As the OCR result loses the invoice position information (sample txt file), the traditional NLP methods, which expect sequential structure, would not work on such text corpus [1]. So my proposed solution would focus on rebuilding the invoice structure information.

Based on my understanding, the invoice structure follows a certain pattern, such as the left top area is vendor logo/name and the total amount is in the bottom right. There are definitely some special cases, but the prior statement is an assumption of my solution.

In order to track the position in a PDF file (could be easily convert to an image file), the convolutional neural networks (CNN) [2] could fit this task, it has been proved successfully on many image processing tasks [3, 4, 5]. Although a paper [6] states to extract invoice info from the recurrent neural network (RNN), their input is words and positions (in our case, we do not have positions). So I propose to use Faster R-CNN [4] or YOLO [7] to solve the problem, they both are mature models of object detection and applied in many products.

The CNN model input should be images, and outputs are labels and region coordinate (the format would be like {vendor, 5, 15, 20, 40}).

Therefore, we need a dataset to train the CNN model. Since we already have the original PDF files and labels of vendor/amount, we could generate an image dataset for training the model. For each training entry, it contains an image converted from PDF and a region info of vendor/amount. The region info is the coordinate of two points which consist of a rectangle (e.g. (x1, y1) and (x2, y2) in figure 1). The dataset generation process could be done via a method of ORC and image process combination. The method is cropping the image into multiple rectangles (moving windows), then apply ORC on each rectangle. Based on the text outputs of rectangles, the area contains only vendor name is labeled as vendor, and the areas contain the amount are labeled as amount.

After we have the dataset, we split it into training and testing datasets. The split ratio could be 80/20 [8].

The evaluation metric is mean precision average (mPA) at the different intersection over union (IoU) thresholds. The IoU of a proposed set of object pixels and a set of true object pixels is calculated as IoU(A,B)=A∩B/A∪B (image below [9]). The metric sweeps over a range of IoU thresholds, at each point calculating an average precision value. The threshold values range from 0.5 to 0.95 with a step size of 0.05: (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95). In other words, at a threshold of 0.5, a predicted object is considered a “hit” if its intersection over union with a ground truth object is greater than 0.5. At each threshold t, we can check if it “hit” on the ground truth. Then the mPA is 1/thresholds∑hit(t).

When the model training finished (let’s assume it meets our expectation), we could apply a post process to convert the result to our label. We will get the region coordinate from the model output, and based on the coordinate, we can crop the rectangle and feed it to OCR. It is the final result. Then we can compare the predicated labels with our ground truth in the database to evaluate the model performance (the evaluation metric could be precision on both labels).

Here is the overview of the proposed solution:

Convert PDF to Image -> Dataset Preparation -> CNN Model -> OCR -> Results

To mitigate the impact of incorrect labels, we could add an extra step in dataset preparation. By calculating the frequency of each vendor label in the dataset, we could remove those entries with label frequency lower than a threshold. This is based on an assumption that incorrect labels cannot occur multiple times (the threshold) with the same value.

Failure on the Intuitive Thoughts

I set up an experiment for the solution, which trained YOLOv3 [10] (pre-trained on ImageNet [11]) with 30 manually labeled invoices images (google searched images, each invoice contains vendor, logo, and amount labels). Although the predicated labels on validation dataset look promising, the mAP is almost zero on the test dataset. The reason for the low performance may be caused by the principle of CNN, the CNN only can learn the features that appear in the training set. The way to improve the model would be training on a larger dataset and assume it covers all test cases. Therefore, I would like to propose two new solutions for the ML question.

Using the Recurrent Neural Network (RNN) with Long Short-Term Memory cells (LSTM) [12].
Using the Reinforcement Learning (RL) [13].

RNN

I reviewed the example.txt file, it doesn’t fully unorganized. We could recognize some patterns from it, like it reads column by column, not row by row as human do. Although RNN is good at the sequential data, due to the gradient vanishing problem, it won’t work for long sentence. So the LSTM method came up, bring the ability to memory long distance relationship to RNN. For example, “A cat jumps on the table, it breaks a cup, so we chase it off the table” which “it” represents the “cat” in the previous phrase. It may be easy to identify the first “it” as the cat because they are close, but for the second “it”, it’s hard to tell which it represents (cat, table, or cup).

As we are processing the text data, we need a preprocessing step to clean up. First, remove punctuation marks, like semi-colon, colon, exclamation mark, and etc. But keep the period and comma, because they may use in numbers. Second, tokenize the words, we could build up a vocabulary dictionary and convert each word into it represented the index in the dictionary. For unknown word and numbers, we use [UNK] and [NUM] instead. Finally, clear up the common words which do not help for our task, like the word “invoice”, it appears in every invoice.

Then we could feed the data to the RNN model. The RNN model supports many to many. Input is word sequence, and the output is the all possible labels for vendor/amount.

The evaluation metric would be the F1 measurement [14], which combines the precision and recall.

RL

When AlphaGo [15] defeats the world champion Lee Sedol, reinforcement learning becomes a hot topic in AI domain. The reinforcement learning is an algorithm to make AI compete with AI, the best set of policies is searched out during the competition. It has successfully applied to robotics, game playing, fintech, and ect. [16]

The reason I pick up the RL is the intuitive thought in my previous email. For a human, it is easy to identify vendor and amount by a glance, so the images contain all the info we need. Therefore, I think it is not necessary to take an extra step to convert the images to text, which lose information and create an extra layer to process.

So the idea is to find out the areas in invoice images which represent vendor and amount, then apply OCR against those areas to get final text formatted outputs.

Preprocess input documents to convert them into greyscale images.

Before we feed images into RL model, we need to set up rules for the agents, like identify the correct item plus points, identify the logo get some point, output wrong result get minus points. Then the RL model could brute force to find the best policies, like font size, bold style, and etc.

The RL model output should be the rectangle area of the vendor/amount. Then apply OCR to convert them to text.

The evaluation metric should be same as above, using the F1 score.

In addition to the above methods, I also thought about the generative adversarial networks (GANs) [17], but the tuning process is more like mystery comparing other models. Moreover, I found a paper [18] that using the deep CNN model to classify document images based their structure. In our case, I think we could use the similar approach to identify vendors, but we still need more info to retrieve the amounts.

BTW, besides the machine learning models, I wonder we also could improve the OCR to include the structure information in the output, like the PDF to HTML [18] and Zonal OCR [19]. If the company mainly deal with PDF files. As the PDF format specification [20] is open to the public, we could analysis PDF files directly, this would be another story.

DIY SLI Liquid Cooling GPU Computer

2017-08-10T13:54:35.000Z

It’s kind of shame to say, as a computer science major professional, I have never build my own desktop computer ever, until recent. All because the old desktop dead (purchased 2012), I finally made the decision to DIY a powerful amateur deep learning machine.

Here I will record all the pitfalls, caveats, and tricks during the building processes. Hope could provide some help or hint for people would like to build their dream game (or amateur machine learning) machine.

Here is the components list for my build on PCPartPicker. The currency is Canadian Dollar since I’m living in the cold North (but don’t have the wall of cause).

Parts	Brand	Purchase Price	Comment
CPU	Inter Core i7-7700K 4.2Ghz Quad-Core	$427.95	From Memory Express.
GPU	2 X MSI GeForce 1080 DirectX 12 GTX SEA HAWK	$732.99 X 2	From NewEgg. Got total $40 mail-in rebates
CPU Cooling	Corsair H100i v2 Hydro Liquid CPU Cooler	$124.99	From Memory Express
Motherboard	ASUS PRIME Z270-A LGA 1151	$199.99	From NewEgg
Disk	Kingston SSD A400 480GB	$194.10	From Memory Express
Memory	Kingston HyperX Fury 32GB DDR4 2666MHZ 4X8GB	$269.99	From Memory Express
Case	Fractal Design Define S ATX Mid Tower Window Case	$89.99	From Canada Computers
Power	EVGA SuperNOVA 750W G2L Modular PSU	$139.99	From Memory Express
Extra	Asus USB-AC51 Dual-Band Wireless AC600 Wireless Adapter	$39.99	From Memory Express. Got $10 mail-in rebates

Purchase items

I have spent about a month to collect all parts. The reason why it took so long is I would lik to get the price as cheaper as possible. To ensure that, I have developed several “techniques”.

The tricks use here are suitable for Canada or best North America, because these online services/stores are only shipping within North America or Canada only. However, you could definitely find alternatives within your regions/countries (check out PCPartPicker support countries).

First of all, don’t forget using ebates. It’s a online rebate website, user could get instant rebate if purchase through the ebates supported vendors online. The rebate rate is from 1% to 5%. Although not much, better than nothing.

Second, use price matching. Many online store support price matching, some even give 10% more off (I only found one). The Memory Express is one of my favorite computer component online shop. It provides the lowest prices most of the time, moreover, its price match is better – match the price plus 10% off on the difference. E.g, a CPU costs $500 on ME, another store has lower price for the same CPU, let’s say it $400, then the price on ME after the price match is 400-(500-400)*10% = $390. The only con shopping on ME is they charge shipping fee and don’t have local store in east Canada.

As mentioned above, the shipping fee may also cost a lot, especially the item weight increases. So if online store has physical store, use pick up wisely. I have purchased several parts on NCIX and Canada Computer just to save the shipping fee. Even more, both also provide price match, which make them more affordable.

Amazon also has great deals eventually, but due to the currency exchange rate Amazon Canada alway have match higher price regardless of the physical distant is so close. Therefore, using Amazon US is also a good option if living near the boarder. Sometime, even including the import fee purchasing from Amazon US, the price is still cheaper than in Canada. What a life Canadian is!

Finally, wait holiday or promotional events. Especially Boxing day, you could get the best price. However, normally the quantity are limit. One of my friends bought he loved GPU by lining up at 5am in front of the store.

Build up

It is usually easy to assembly the parts (just follow the manual), however, my build is a little bit special that have three radiators – one for CPU, two for GPU. Not many people using dual GPU with AIO liquid radiators, which makes there is no much info online to help me choose the right case.

After googling, I conclude the Define S could be the best fit for my needs. But I was only half right. Define S is well designed for water cooling system, and beautiful out look. I’m definitely satisfied with the choice. However, the case is a little bit wider to put the GPU AIO radiator on the front panel.

As the above pic shows, I have no options to put the radiator into the tiny space beside PSU. Here is the part I love Define S, it supports the bottom fan and saves my day. Otherwise, I would have to disassemble all parts and exchange another case, then do it again.

At the bottom front panel, another fan is installed to pull air in to provide positive air pressure. The air flow inside the case illustrates in the following pic:

All in all

After a day and a night, the PC finally up and run. Here is some conclusions I learned:

Patient while purchasing. There is alway a deal waiting to be discovered. However, the price also could infinitely low, watch the history price, it only could be at the lowest at a certain point, after that the price remain the normal or the item discontinued.
Case is as important as GPU and CPU. A good case can save a day.
Modular PSU is clean and simple. My build could become true is not only because the case is great, but also the modular PSU save much space for the radiator.

Here is the complete shot:

Update(2017-Aug-14): I also bought an M.2 SSD to install the Ubuntu OS, as the most machine learning algorithms are build upon Linux system. Moreover, the driver officially provided for the Asus USB-AC51 is not compatible for Ubuntu 16.04 LTS (I also tried several customized drivers, they may fit the 14.04 LTS but not the 16.04). So be careful when purchase the USB Wifi adapter, here is a list that has some “work out of box” USB for Ubuntu 16.04 and above.

Update(2019-Jun-24): Recently bought another M.2 SSD for share disk between Ubuntu and Windows. Due to recently SSD price drop, the 960G SSD only cost $150 (included tax). Even cheaper and faster than 500G SATA SSD I bought 2 years ago. How technology changes so fast! To mount the new SSD in Ubuntu, I followed the top answer in the StackOverflow post.

Auto Sync Github Forked Repos

2017-07-22T22:32:53.000Z

As a Github heavy user, when I saw an interested repos, I would fork them to my “secret” organization to read/use in future. However, when checked them later, most of the time (99%), the repos have already out of synced. There are three options I normally choose:

delete the forked repo and re-fork;
manually merge to the latest commit;
use the PR and switch the source and destination (ref).

Although those three options works like a “charm” in a small amount of repos, it’s more like chores when maintaining hundreds forked repos.

TL;DR

Then I came up the idea to use Travis-CI to sync those repos automatically. The mechanism is simple and straightforward, which runs a script periodically to update forked repos with source repos. The script could be bash, javascript, python, or any language which could call git command in the Linux OS.

My Approach

Based on the above idea, I wrote a js script to do the job. But this script is only a solution my problems:

Only works on forked repos in a Github Org.
Forked repos should never be modified.
User Github personal token to access.
Not support private repos.

If your needs also match the aboves, then simple fork my script, and modify the org in .config.yml.

Build SSL HTTPS Website Using Docker

2016-11-18T22:35:10.000Z

When you’re visiting my website, you may not see the https in the URL, which means you have been directed to a CDN node other than my VPS server. This doesn’t mean the method doesn’t work. Anyway, let’s begin the talk.

The purpose of this post is to help people to avoid the pitfalls that I encountered, and severs as a note for future reference.

Prerequisites

All my setup is on Ubuntu 16.04, and may not suitable for other version/OS.
Docker must be installed and properly functioning.
Docker compose is an option, but this article only showing the method that is using docker compose, which is simpler than using docker alone.
Any machine/VPS/cloud server that you have root control, like Digital Ocean[Get $10 with coupon code ACTIVATE10], Vultr[Get $50 (expired after 6 months) with coupon code DOMORE ], Lindo[Get $20 with coupon code PodcastInIt20], and etc.
DNS already points to your machine, and all domains which would like to support Https also have CNAME or A record.
Git is also an option, unless you would like to build the image on your own.

Set Up

Only two docker images are used:

The niginx server has to start up before running the letsencrypt, because the letsencrypt needs to access the server to finish the generating certificate process.

docker-compose.yml for nignx

Create docker-compose.yml and paste the following into it.

nginx:  image: bringnow/nginx-letsencrypt  volumes:    - ./nginx.conf:/etc/nginx/nginx.conf    - /etc/letsencrypt:/etc/letsencrypt    - /var/acme-webroot:/var/acme-webroot    - /srv/docker/nginx/dhparam:/etc/nginx/dhparam  ports:    - "80:80"    - "443:443"  net: "host"  dns_search:    - "example.com"

Modify it accordingly to fit your environment.

Generate dhparam

Although the nginx docker will create DH parameters on initial start up, it is time comsuming to generate the 4096 bit DH parameters (more than an hour on my VPS). Run the following command and copy the generated file to the /srv/docker/nginx/dhparam folder (set in docker-compose.yml).

openssl dhparam -out RSA4096.pem -5 4096

Create Nginx Config file

In order to complete the letsencrypt challenge, the server has to open the 80 port. The nginx-letsencrypt image already come with the setting snippets: snippets/letsencryptauth.conf and snippets/sslconfig.conf.

Here is the sample config file:

events {  worker_connections 1024;}http {  include snippets/letsencryptauth.conf;  include snippets/sslconfig.conf;  server {    listen 443 ssl default_server;    server_name example.com www.example.com    ssl_certificate /etc/letsencrypt/live/www.example.com/fullchain.pem;    ssl_certificate_key /etc/letsencrypt/live/www.example.com/privkey.pem;    add_header Strict-Transport-Security "max-age=31536000; includeSubdomains" always;    location / {      # Just return a blank response      return 200;    }  }}

NOTE: Please comment out those two lines start up ssl_certificate before the certificate generated.

Make Nginx Online

Now run the following command to bring Nginx online:

docker-compose up -d

To confim if the docker is running correctly, we could look the log file to check:

docker-compose logs

If there are some error messages, please check the Nginx config file and restart the docker.

docker-compose.yml for letsencrypt

In another folder, create a docker-compose.yml:

cli:  image: bringnow/letsencrypt-manager:latest  env_file: config.env  volumes:    - /etc/letsencrypt:/etc/letsencrypt    - /var/lib/letsencrypt:/var/lib/letsencrypt    - /var/acme-webroot:/var/acme-webrootcron:  image: bringnow/letsencrypt-manager:latest  env_file: config.env  volumes:    - /etc/letsencrypt:/etc/letsencrypt    - /var/lib/letsencrypt:/var/lib/letsencrypt    - /var/acme-webroot:/var/acme-webroot  command: cron-auto-renewal  restart: always

Modify it accordingly. Make sure the folder /var/lib/letsencrypt and /var/acme-webroot have created and exist.

Then create config.env file in the same folder and input your email:

LE_EMAIL=LE_RSA_KEY_SIZE=4096

Generate SSL Certificate

Finally, we could create our Https certificate. Run the commands:

docker-compose run cli add <domain> [alternative domains]

If it fails, please check if Nginx is runing and the DNS setting is correct.

NOTE: If the certificate generate, don’t forget to remove comment on ssl_certificate lines in Nginx config file, and restart it.

Conclusion

Now your website should up and running with https. Enjoy.

~ EOF ~