CURATE

Curate Solutions

Curate Solutions scrapes through minutes and agenda documents from local meetings of hundreds of counties across the country each week. The Madison based startup scans through the collected data using a mixture of machine learning/language processing models and a human data analysis team to retrieve information valuable to customers. Customers may include construction companies looking for new projects to bid on etc. During my time at the company, I worked on building, training and deploying machine learning models on large scale textual datasets to extract data valuable to customers.

  • Time period: May 2018 - May 2019
  • Location: Madison, WI

Software Engineer

Built a text vectorization framework using the word2vec distributed representation algorithm

  • W2V
  • SKIPgRAM

One of the fundamental challenges in using machine learning on textual data was contextually representing this data in a machine-readable numeric format. To do this each word was converted to a 300-dimensional vector using the pre trained GloVe vectors. GloVe vectors are created using the unsupervised word2vec learning algorithm. This algorithm uses the continuous Skip-Gram architecture to predict the window of contextually suitable words around a specific pivot word. This is done by taking a corpus of textual data and assigning each word a random 300-dimensional vector. Then a window size is chosen and every word in the window is used to predict the neighboring words. Every word is assigned a random 300-dimensional vector prior to training. A neural network is used for the prediction task in which backpropagation is used for training. By doing this, words that occur in similar contexts or close to each other end up getting placed close to one another in vector space.

spaCy provides a way to efficiently implement word2vec along with many more language processing functions such as tokenization, lemmatization, part of speech tagging etc. and was thus used to convert text data into 300-dimensional word vectors in production. The pre trained 300-dimensional spaCy GloVe vectors were improved on by conducting further training using a custom corpus of the company’s textual data. This was done with the help of Gensim which uses the word2vec Skip-Gram technique as discussed above to train word vectors for a corpus of data. But instead of using random vectors to initialize training as discussed above, the pre trained 300-dimensional spaCy GloVe vectors were used to initialize each word vector in the corpus. The corpus was then trained using the word2vec Skip-Gram technique which resulted in word vectors that better represented the contexts of the corpus it was trained on.

Built a distributed Spark-Kubernetes framework to run machine learning algorithms on large datasets

KUBE

Once the input vector creation method was ready, the best scikit-learn classifier with the right hyperparameters had to be chosen. This was done by running a grid search across multiple models with varying hyperparameters. Training and testing the list of viable models was done using Kubernetes running Apache Spark. The models with their corresponding hyperparameters were split up by the Spark driver pod into a list of jobs distributed among available executor pods. Each executor pod retrieved labeled data stored in mongoDB, vectorized the data using customized spaCy 300-dimension vectors, used a training set of data to train a single model and used a test set of the data to generate precision recall scores. Once the model with best precision recall scores were identified, the model was trained on a larger dataset using the Spark-Kubernetes framework. The Spark driver pod split up the training data into jobs and each executor pod batch trained the classifier. AWS S3 buckets were used to store and run trained models.

Trained and deployed a URL classifier to find valuable minutes and agenda webpage

The first step of the entire data gathering system involved finding the right webpages (URLs) to scrape. A dataset of URLs containing minutes and agenda data vs ones that do not (good vs bad URLs) was used to build a binary classifier. To identify the parts of the URL that could be used as features, the good and bad URLs were analyzed using basic statistical analysis techniques. In the data scrubbing step, the the URL was further preprocessed using spaCy’ s tokenization, lemmatization and various regular expressions. The data was then vectorized using pre trained spacy 300-dimensional GloVe vectors which were customized by further training the entire URL corpus using Gensim as discussed above. A grid search was conducted using the above-mentioned distributed Spark-Kubernetes framework to find the best scikit-learn classifier. Once the best classifier was found, it was trained and deployed in production using the same distributed framework as mentioned above.

Trained and deployed a classifier to eliminate syntactic junk data

The next task was to eliminate data that was corrupted by the OCR (optical character recognition) extractor. OCR junk consisted of data points (chunks of textual data) that contained junk characters and thus did not make grammatic sense. For the vectorization step, the previously discussed 300 dimensional spaCy vectors could not be used as these vectors represented words in context and could not interpret junk text logically. To numerically represent junk vs not junk characters, a custom vector was constructed. Once the data was vectorized, a simple binary classifier approach could not be used as there was no labeled OCR junk data to train on. To generate a large dataset of junk data, the k-means unsupervised learning algorithm was first used to separate junk and not junk data into large homogeneous cluster. The clusters were analyzed using TensorFlow's visualization tool TensorBoard. Each cluster of data was then hand labeled as containing junk or not junk using a custom in-house Angular data labeling web application. MongoDB was used to set data point labels in real time. The newly created labeled data was vectorized using custom vectors and grid search was conducted using the distributed spark framework to find the best scikit-learn classifier with optimal hyper-parameters. The best model was trained and deployed in production using the above-mentioned distributed framework.

TENSOR-BOARD

Trained and deployed a classifier to identify datapoints valuable to customers

This step involved building a classifier to rate data points on how valuable they would be to clients. Once the junk data was removed to prevent bad samples in the data set, training this classifier was relatively straightforward as there were labels of data points previously sent to customers. The data points previously sent to customers made up the positive samples while data points not sent made up negative samples once the OCR junk was removed. The 300-dimension spaCy vectors customized using gensim word2vec was used to vectorize data points. Customizing pretrained spaCy 300-dimensional GloVe vectors for the data point corpus using genism greatly improved the contextual representation of each word vector for the data point corpus. This helped train classification models with superior precision recall scores. Once the custom vectorization technique was created, grid search was conducted using the distributed spark framework to find the best scikit-learn classifier with optimal hyperparameters. The best model was trained and deployed in production using the above-mentioned distributed framework.

  • The back end code for model training and deployment was written in Python 3.

  • Apache Spark and Kubernetes were used to train models on large datasets.

  • MongoDB and AWS S3 were used to store data and machine learning models respectively.

  • Scikit-learn, SpaCy and Gensim were used in building the machine learning framework.

  • Angular 5 was used for the front end website components when designing data labelling tools.