Day 1: My plan was to collect data from an online database and code a Naive Bayes Classifier . . . . So I found several data sources, here are some good ones, by the way:
However, I ran into a brick wall, because the data I wanted to use, when I found it, was in a HUGE table, with thousands of columns and rows! So this cost me a lot of time, first searching for a smaller dataset, and eventually concluding that the real solution was to find a way to reduce the size of the dataset, or find a smaller dataset on a different subject than what I was originally looking for. (I wanted to do something original.) So of the two solutions, I wanted to discover how to reduce the size of a dataset, to perhaps two to three columns, only the interesting columns, and perhaps 100 rows. So through my own recognizance, I figured out that you can open a csv (comma separated values) file in Open Office, and filter or delete rows and columns. At that point it was later in the day I had reached the point of saturation, and was done for the day. But there is one more thing I want to add. Siraj Raval has a video on youtube about how to reduce datasets, I would suppose there is also more info on this subject using a youtube or google search, but here is his video on this subject. https://www.youtube.com/watch?v=0xVqLJe9_CY&index=4&list=PL2-dafEMk2A7EEME489DsI468AB0wQsMV By the way Siraj is an excellent source for tutorials on AI! Spent the rest of the day watching more AI tutorials on youtube . . . . loading information . . . loading . . . . loading . . . . loading . . .
Day 2. I have two servers, with 16 quad-core processors, so it was time to install Python and Tensorflow on them. No problem . . . . ha ha. Because this is cutting edge stuff, there are a lot of dependencies! And, because you can't just install a pre-built version of Tensorflow on a multiprocessor server and expect it to use all of the processors, you have to compile Tensorflow! This is not a concern on a single-processor PC. So far, it has been a battle so epic, that I am going to start setting up the second server tomorrow, because what I have done so far is still fresh in my mind, and it will help me solidify my knowledge of what needs to be done to do this. (The first server is almost there, but it appears I will need to uninstall Tensorflow, then compile.) Watching more of Siraj's videos, and, I found a pretty clear explanation of what exactly a neural network is here: Introduction To Tensorflow. So a tensor is really just a weight between nodes, and the matrix is built from those tensors. While I have done some coding, I have not been pushing that end of it very hard, because I want to keep loading more information and trying to get a better grasp on how it all works, and learn the tools better, so Siraj's videos help a lot . . . . still loading . . .
Day 3. Windows and Tensorflow are like archenemies . . . they seem to hate each other. I have yet to get them to work together in Pycharm, although I did succeed in getting the console to work with Tensorflow. For Tensorflow in Pycharm, I have so far resorted to my Linux (Fedora 25 to be exact), and that was easier to work with. With Siraj's videos, I am discovering that some of the code is dysfunctional, so I have been spending some time cleaning it up to get it to work, which is OK, as it gets me to look at the code and consider what it is doing, so I am learning more about Python and its packages. The packages are not straightforward in some cases . . . first, if you are required to use pip to install them, it appears that you need to make sure you use the version of pip that matches the version of Python you are using. (versions of pip are located in the /usr/bin folder as far as I know.) And, some of the projects you download may be done in Python 2.7, some in Python 3.5.2, and some in Python 3.6! A valuable thing I have learned previously, is that you can do searches for packages, and use wild cards. Interestingly, many references use yum to install, but Fedora 25 uses dnf. I will not comment on yum, other than that I don't quite understand why some Linux versions use it, and some use dnf. What I do know though is that to do a search with dnf, the syntax is: dnf search keyword . . . you can use wild cards, or, quotation marks. If you are searching for something that is difficult, like cv2, you can also do so using keywords that you would expect to find in the description, if you put them in quotes, which found the cv2 package for me using "computer vision". I was able to get Siraj's web scraper up and running in Pycharm, although it seemed to balk at loading Russian text, which was a dissappointment. I want to address that issue, but not now, as I will move on to another project so I can progress as fast as possible.
Day 4. If you are wanting to get up and running in this field, I have a book recommendation for you. Python for Data Science For Dummies John Wiley & Sons 2015. This book has a lot of useful information including how to get and format data, what to look out for in data, like duplicate records, etc, and, a lot of code that is in simple-to-understand form. Additionally, I want to mention that you can register with and contact Robert Half to gain free access to Books 24x7, which then gives you access to a lot of books for free, so you can study online. The instruction on how to access skillport are here, but you likely will need to contact Robert Half to get registered, using the email address in the following information: https://www.roberthalf.ca/sites/roberthalf.ca/files/RH-PDFs/logon_instructions_113007_2.pdf At this point, I am reading a lot of information from the aforementioned book, still loading more information . . . .
Day 5. Data! My, my . . . you'll be wondering what you got yourself into when you start looking at the data. Right now, I am working on a program that can scrape data off of a PDF file, and it has proven to be quite difficult to implement, although cetainly not impossible. The difficult part has been tweeking the data to land it in either a comma separated value file (CSV file) or an excel file. The problem with CSV files seems to be lack of standardization, and that really slowed things down when it came to learning about them, and how to create them. I can't yet say that it has been a total success, but progress has definitely been made. Compared to CSV files, it looks like Excel files are much easier to deal with, and there is a Python package you can download to help you out with it, along with instructions . . . XlsxWriter . . . So while I am working on the data end of things, I am also learning more linear algebra, and taking an online Machine Learning course . . . . Coursera . . . from Stanford University, which seems to be quite interesting, and I would recommend it so far . . . in conclusion, clearly, the biggest challenge appears to be dealing with data. For that reason, having a working Python program that can deal with importing the data and creating both CSV files and (preferrably) excel files will be one of my priorities. Without a good way to wrangle data, not much is likely to happen.
Day 6 Square Error Function. I decided to do a tutorial on the Square Error Function, because it took me a long time to figure it out, and when I did, I realized that the reason I had so much trouble with it was because the tutorials were just BAD. First, they called the function the "Cost Function", when the function really indicates error, and then, in the example they used, the did the presentation on housing prices of all things, which really confused the issue. So if you need help with understanding the Square Error Function, go here and watch the video I posted, it should help you out, and I have the Python code posted for it below the video. Square Error Function