Glossary

While all efforts have been made to communicate clearly, some terms are used in the following report which are domain specific. Below are some common terms and their definitions as they pertain to this paper.

Word Definition
Dry Bulb Temperature Air temperature in Kelvin
Feature An attribute of the dataset. Also known in database theory as an attribute, and commonly seen in the form of a column. Some features are inherent in the data; others must be computed (these are often called labels).
Foreign Key (FK) An attribute in a table that ties a record uniquely to a record in another table in a 1-1 or 1-many relationship.
Mean Absolute Error (MAE) A performance statistic computed as:
Primary Key (PK) An attribute in a table that uniquely identifies a record. Commonly implemented as an unsigned integer for low storage and retrieval costs.
Relation In database theory a set of tuples which share a set of labeled attributes. Commonly viewed as a table.
Tuple In database theory a set of related facts; commonly viewed as a row. Also referred to as a record, which is an atomic piece of information.

Database Design

1st Iteration Intra Model Performance Table

subset model operation projection mae stddev iqr
100 linreg average 24 4.631831543 7.267661744 7.376935522
1000 linreg average 24 2.143428653 3.899018518 3.351450163
10000 linreg average 24 2.030554289 2.657135622 3.2303772
100000 linreg average 24 1.994362735 2.637388048 3.142809138
100 linreg average 168 5.913616302 8.252230627 9.359491752
1000 linreg average 168 2.572654771 5.303453545 4.023931935
10000 linreg average 168 2.338230608 3.025636025 3.790127042
100000 linreg average 168 2.32681975 3.00391192 3.781893953
100 linreg max 24 50.64446932 67.76835863 77.90457765
1000 linreg max 24 14.13080713 25.57563651 18.26589205
10000 linreg max 24 13.32337275 24.25696593 17.37206546
100000 linreg max 24 12.58745831 23.29352607 16.12725384
100 linreg max 168 55.96865409 78.26706052 80.61073331
1000 linreg max 168 20.35751872 52.43354441 26.1918654
10000 linreg max 168 17.64412576 27.90933131 23.93082071
100000 linreg max 168 17.5438013 28.00114329 23.63088294
100 linreg min 24 8.94821248 11.8499813 14.48369101
1000 linreg min 24 3.630860988 12.80957702 5.53195469
10000 linreg min 24 3.317300065 4.279692232 5.36504644
100000 linreg min 24 3.286721203 4.247615987 5.312396743
100 linreg min 168 10.40786872 13.34390973 16.92560316
1000 linreg min 168 826951148.6 33334747900 6.951128047
10000 linreg min 168 4.040246507 5.175617132 6.678711019
100000 linreg min 168 4.009690602 5.121298401 6.603335997
100 rf average 24 2.760442915 3.531526854 4.399495536
1000 rf average 24 2.232302384 2.914016873 3.564670586
10000 rf average 24 2.11029389 2.801957525 3.301544268
100 rf average 168 2.735508517 3.517131784 4.450532264
1000 rf average 168 2.480721127 3.187435833 4.0349717
10000 rf average 168 2.480108425 3.205892087 3.992762653
100 rf max 24 14.22043586 27.76593003 13.11885
1000 rf max 24 9.674497262 20.62727379 9.433325
10000 rf max 24 8.130446313 16.22976069 8.78005
100 rf max 168 27.00145684 43.0775496 33.286
1000 rf max 168 14.86749773 27.88180684 15.347525
10000 rf max 168 13.05649929 23.04437046 14.00775
100 rf min 24 3.633476838 4.615422392 5.9898
1000 rf min 24 3.3779545 4.376241856 5.448625
10000 rf min 24 3.279107259 4.262930004 5.274225
100 rf min 168 4.70533 6.064141173 7.57625
1000 rf min 168 4.16066327 5.368543541 6.71275
10000 rf min 168 4.05572153 5.271970466 6.51225
100 ridge average 24 3.098310021 4.147110535 5.160014631
1000 ridge average 24 2.15449149 4.769758985 3.31390058
10000 ridge average 24 2.234714442 2.986644779 3.605391636
100000 ridge average 24 2.007740807 2.628262112 3.19764219
100 ridge average 168 3.079077471 4.269716253 4.80456551
1000 ridge average 168 2.418976865 3.427145717 3.872898081
10000 ridge average 168 2.356801016 3.038834447 3.809819451
100000 ridge average 168 2.336227296 3.011133991 3.800794591
100 ridge max 24 24.47592318 38.14257757 35.00257022
1000 ridge max 24 12.5754371 37.926706 14.65336718
10000 ridge max 24 14.36215771 26.3561325 17.8562204
100000 ridge max 24 12.50123122 23.20131167 15.97997729
100 ridge max 168 34.54528896 49.97080019 50.25217266
1000 ridge max 168 19.28877169 49.93754326 25.24681336
10000 ridge max 168 17.7943851 28.01418666 24.21251711
100000 ridge max 168 17.47328446 27.87214109 23.60672248
100 ridge min 24 4.106146543 5.199621811 6.824983235
1000 ridge min 24 3.488770104 4.476913878 5.636792866
10000 ridge min 24 3.348207229 4.348413345 5.402443727
100000 ridge min 24 3.275361169 4.227956557 5.314580584
100 ridge min 168 4.949920094 6.284755109 8.321237635
1000 ridge min 168 4.206105848 5.48801513 6.8713299
10000 ridge min 168 4.039781869 5.154984591 6.67984543
100000 ridge min 168 4.018035449 5.133184139 6.618504286
100 sgd average 24 3.056592705 3.890190526 4.913693062
1000 sgd average 24 2.144314574 2.995426857 3.373765784
10000 sgd average 24 1.969138831 2.859141633 3.028862511
100000 sgd average 24 2.016884972 2.645805447 3.206370431
100 sgd average 168 3.09022719 4.258520113 4.943657472
1000 sgd average 168 2.517679952 3.482076453 4.072744993
10000 sgd average 168 2.334856811 3.119236849 3.707557699
100000 sgd average 168 2.337246015 3.011933936 3.795810279
100 sgd max 24 18.23503843 34.29850597 23.89585424
1000 sgd max 24 13.47896046 24.62646673 17.30176546
10000 sgd max 24 10.67221511 29.67377625 10.22257095
100 sgd max 168 35.23448871 47.5295067 54.83831383
1000 sgd max 168 19.08934225 30.49607477 25.80554427
10000 sgd max 168 972.385479 1355.847489 1543.708972
100 sgd min 24 3.924853796 5.005220747 6.527904439
1000 sgd min 24 3.638021589 4.676746577 6.033199488
10000 sgd min 24 3.28900021 4.656878363 5.186712738
100 sgd min 168 4.58477292 5.894275553 7.51592956
1000 sgd min 168 4.503050078 5.707805696 7.477669474
10000 sgd min 168 4.044766103 5.214737283 6.639052405
100000 sgd min 168 4.012253118 5.126297319 6.596457717

Code Repository

The code, along with example files, can be found on GitHub at the following public repository:

https://github.com/danieledwardknudsen/weather

Summary of Tools Used

Tool Brief Description Summary and Use
Python3 (Python) Programming Language Coordination and analytics code was written in the Python 3.6 programming language. Python was chosen because it is intuitive and allows rapid scripting, and because it has mature data science and ML libraries.
SQL Server Express (MSSQL) Database MSSQL is a database management system (DBMS) used to store and manipulate bulk data.
Transactional-Structured Query Language (T-SQL) Programming Language Code that directly manipulated the database was written in T-SQL, a dialect of SQL.
Pandas Python Library Pandas is an open-source data science library that allows processing of large datasets.
Scikit-Learn (sklearn) Python Library Sklearn is an open-source Python library that allows customization and training of dozens of types of regressors and other models.
Jupyter Notebook Scripting Tool Jupyter Notebook is a tool that allows rapid script development and data manipulation from within a web-browser, and was used for data exploration and analysis.
Microsoft Excel Spreadsheet Tool Flat file spreadsheets and data visualization were done via Excel.

Table 1 – Summary of Software Tools Used. Python tools were ised for analytics, and SQL tools were used for data movement and manipulation.