狠狠综合久久久久综合网址-a毛片网站-欧美啊v在线观看-中文字幕久久熟女人妻av免费-无码av一区二区三区不卡-亚洲综合av色婷婷五月蜜臀-夜夜操天天摸-a级在线免费观看-三上悠亚91-国产丰满乱子伦无码专区-视频一区中文字幕-黑人大战欲求不满人妻-精品亚洲国产成人蜜臀av-男人你懂得-97超碰人人爽-五月丁香六月综合缴情在线

COMP 330代做、Python設計程序代寫

時間:2024-04-02  來源:  作者: 我要糾錯



COMP 330 Assignment #5
1 Description
In this assignment, you will be implementing a regularized, logistic regression to classify text documents. The implementation will be in Python, on top of Spark. To handle the large data set that we will be
giving you, it is necessary to use Amazon AWS.
You will be asked to perform three subtasks: (1) data preparation, (2) learning (which will be done via
gradient descent) and (3) evaluation of the learned model.
Note: It is important to complete HW 5 and Lab 5 before you really get going on this assignment. HW
5 will give you an opportunity to try out gradient descent for learning a model, and Lab 5 will give you
some experience with writing efficient NumPy code, both of which will be important for making your A5
experience less challenging!
2 Data
You will be dealing with a data set that consists of around 170,000 text documents and a test/evaluation
data set that consists of 18,700 text documents. All but around 6,000 of these text documents are Wikipedia
pages; the remaining documents are descriptions of Australian court cases and rulings. At the highest level,
your task is to build a classifier that can automatically figure out whether a text document is an Australian
court case.
We have prepared three data sets for your use.
1. The Training Data Set (1.9 GB of text). This is the set you will use to train your logistic regression
model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
2. The Testing Data Set (200 MB of text). This is the set you will use to evaluate your model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
3. The Small Data Set (37.5 MB of text). This is for you to use for training and testing of your model on
a smaller data set:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/SmallTrainingDataOneLinePerDoc.txt
Some Data Details to Be Aware Of. You should download and look at the SmallTrainingData.txt
file before you begin. You’ll see that the contents are sort of a pseudo-XML, where each text document
begins with a <doc id = ... > tag, and ends with </doc>. All documents are contained on a single
line of text.
Note that all of the Australia legal cases begin with something like <doc id = ‘‘AU1222’’ ...>;
that is, the doc id for an Australian legal case always starts with AU. You will be trying to figure out if the
document is an Australian legal case by looking only at the contents of the document.
1
3 The Tasks
There are three separate tasks that you need to complete to finish the assignment. As usual, it makes
sense to implement these and run them on the small data set before moving to the larger one.
3.1 Task 1
First, you need to write Spark code that builds a dictionary that includes the 20,000 most frequent words
in the training corpus. This dictionary is essentially an RDD that has the word as the key, and the relative
frequency position of the word as the value. For example, the value is zero for the most frequent word, and
19,999 for the least frequent word in the dictionary.
To get credit for this task, give us the frequency position of the words “applicant”, “and”, “attack”,
“protein”, and “car”. These should be values from 0 to 19,999, or -1 if the word is not in the dictionary,
because it is not in the to 20,000.
Note that accomplishing this will require you to use a variant of your A4 solution. If you do not trust
your A4 solution and would like mine, you can post a private request on Piazza.
3.2 Task 2
Next, you will convert each of the documents in the training set to a TF-IDF vector. You will then use
a gradient descent algorithm to learn a logistic regression model that can decide whether a document is
describing an Australian court case or not. Your model should use l2 regularization; you can play with in
things a bit to determine the parameter controlling the extent of the regularization. We will have enough
data that you might find that the regularization may not be too important (that is, it may be that you get good
results with a very small weight given to the regularization constant).
I am going to ask that you not just look up the gradient descent algorithm on the Internet and implement
it. Start with the LLH function from class, and then derive your own gradient descent algorithm. We can
help with this if you get stuck.
At the end of each iteration, compute the LLH of your model. You should run your gradient descent
until the change in LLH across iterations is very small.
Once you have completed this task, you will get credit by (a) writing up your gradient update formula,
and (b) giving us the fifty words with the largest regression coefficients. That is, those fifty words that are
most strongly related with an Australian court case.
3.3 Task 3
Now that you have trained your model, it is time to evaluate it. Here, you will use your model to predict
whether or not each of the testing points correspond to Australian court cases. To get credit for this task,
you need to compute for us the F1 score obtained by your classifier—we will use the F1 score obtained as
one of the ways in which we grade your Task 3 submission.
Also, I am going to ask you to actually look at the text for three of the false positives that your model
produced (that is, Wikipedia articles that your model thought were Australian court cases). Write paragraph
describing why you think it is that your model was fooled. Were the bad documents about Australia? The
legal system?
If you don’t have three false positives, just use the ones that you had (if any).
4 Important Considerations
Some notes regarding training and implementation. As you implement and evaluate your gradient descent algorithm, here are a few things to keep in mind.
2
1. To get good accuracy, you will need to center and normalize your data. That is, transform your data so
that the mean of each dimension is zero, and the standard deviation is one. That is, subtract the mean
vector from each data point, and then divide the result by the vector of standard deviations computed
over the data set.
2. When classifying new data, a data point whose dot product with the set of regression coefs is positive
is a “yes”, a negative is a “no” (see slide 15 in the GLM lecture). You will be trying to maximize the
F1 of your classifier and you can often increase the F1 by choosing a different cutoff between “yes”
and “no” other than zero. Another thing that you can do is to add another dimension whose value is
one in each data point (we discussed this in class). The learning process will then choose a regression
coef for this special dimension that tends to balance the “yes” and “no” nicely at a cutoff of zero.
However, some students in the past have reported that this can increase the training time.
3. Students sometimes face overflow problems, both when computing the LLH and when computing the
gradient update. Some things that you can do to avoid this are, (1) use np.exp() which seems to
be quite robust, and (2) transform your data so that the standard deviation is smaller than one—if you
have problems with a standard deviation of one, you might try 10−2 or even 10−5
. You may need to
experiment a bit. Such are the wonderful aspects of implementing data science algorithms in the real
world!
4. If you find that your training takes more than a few hours to run to convergence on the largest data set,
it likely means that you are doing something that is inherently slow that you can speed up by looking
at your code carefully. One thing: there is no problem with first training your model on a small sample
of the large data set (say, 10% of the documents) then using the result as an initialization, and continue
training on the full data set. This can speed up the process of reaching convergence.
Big data, small data, and grading. The first two tasks are worth three points, the last four points. Since it
can be challenging to run everything on a large data set, we’ll offer you a small data option. If you train your
data on TestingDataOneLinePerDoc.txt, and then test your data on SmallTrainingDataOneLinePerDoc.twe’ll take off 0.5 points on Task 2 and 0.5 points on Task 3. This means you can still get an A, and
you don’t have to deal with the big data set. For the possibility of getting full credit, you can train
your data on the quite large TrainingDataOneLinePerDoc.txt data set, and then test your data
on TestingDataOneLinePerDoc.txt.
4.1 Machines to Use
If you decide to try for full credit on the big data set you will need to run your Spark jobs three to five
machines as workers, each having around 8 cores. If you are not trying for the full credit, you can likely
get away with running on a smaller cluster. Remember, the costs WILL ADD UP QUICKLY IF YOU
FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and shut down your cluster as soon as
you are done working. You can always create a new one easily when you begin your work again.
4.2 Turnin
Create a single document that has results for all three tasks. Make sure to be very clear whether you
tried the big data or small data option. Turn in this document as well as all of your code. Please zip up all
of your code and your document (use .gz or .zip only, please!), or else attach each piece of code as well as
your document to your submission individually. Do NOT turn in anything other than your Python code and
請加QQ:99515681  郵箱:99515681@qq.com   WX:codinghelp













 

標簽:

掃一掃在手機打開當前頁
  • 上一篇:AIC2100代寫、Python設計程序代做
  • 下一篇:COMP3334代做、代寫Python程序語言
  • 無相關信息
    昆明生活資訊

    昆明圖文信息
    蝴蝶泉(4A)-大理旅游
    蝴蝶泉(4A)-大理旅游
    油炸竹蟲
    油炸竹蟲
    酸筍煮魚(雞)
    酸筍煮魚(雞)
    竹筒飯
    竹筒飯
    香茅草烤魚
    香茅草烤魚
    檸檬烤魚
    檸檬烤魚
    昆明西山國家級風景名勝區
    昆明西山國家級風景名勝區
    昆明旅游索道攻略
    昆明旅游索道攻略
  • NBA直播 短信驗證碼平臺 幣安官網下載 歐冠直播 WPS下載

    關于我們 | 打賞支持 | 廣告服務 | 聯系我們 | 網站地圖 | 免責聲明 | 幫助中心 | 友情鏈接 |

    Copyright © 2025 kmw.cc Inc. All Rights Reserved. 昆明網 版權所有
    ICP備06013414號-3 公安備 42010502001045

    狠狠综合久久久久综合网址-a毛片网站-欧美啊v在线观看-中文字幕久久熟女人妻av免费-无码av一区二区三区不卡-亚洲综合av色婷婷五月蜜臀-夜夜操天天摸-a级在线免费观看-三上悠亚91-国产丰满乱子伦无码专区-视频一区中文字幕-黑人大战欲求不满人妻-精品亚洲国产成人蜜臀av-男人你懂得-97超碰人人爽-五月丁香六月综合缴情在线
  • <dl id="akume"></dl>
  • <noscript id="akume"><object id="akume"></object></noscript>
  • <nav id="akume"><dl id="akume"></dl></nav>
  • <rt id="akume"></rt>
    <dl id="akume"><acronym id="akume"></acronym></dl><dl id="akume"><xmp id="akume"></xmp></dl>
    2022中文字幕| 91热视频在线观看| 日本熟妇人妻xxxx| 国产精品8888| 肉大捧一出免费观看网站在线播放 | 天天摸天天舔天天操| 国产精品专区在线| 成熟丰满熟妇高潮xxxxx视频| 4444亚洲人成无码网在线观看| 三级在线免费观看| 又大又硬又爽免费视频| 日本在线xxx| 国产亚洲综合视频| 性刺激的欧美三级视频| 91精品无人成人www| 一区二区久久精品| 波多野结衣与黑人| 免费在线a视频| 污版视频在线观看| 国产盗摄视频在线观看| 加勒比成人在线| 国产超级av在线| 成人黄色一级大片| 欧美国产综合在线| 国产免费又粗又猛又爽| 一本二本三本亚洲码 | 啊啊啊一区二区| 亚洲这里只有精品| 四虎精品欧美一区二区免费| 亚洲 自拍 另类小说综合图区| av观看免费在线| 亚洲男人天堂2021| 国产成人精品视频免费看| 亚洲第一狼人区| 日韩精品视频在线观看视频 | 国产裸体免费无遮挡| 91香蕉视频免费看| 欧美牲交a欧美牲交aⅴ免费真| 亚洲怡红院在线| 欧美日韩一道本| 日本高清xxxx| 青青青在线视频免费观看| 国产一区二区三区乱码| 伊人网在线综合| 男女激情无遮挡| 欧美人与动牲交xxxxbbbb| 色婷婷综合久久久久中文字幕| 国产成人免费高清视频| 午夜免费福利在线| 国产h视频在线播放| 欧美xxxxxbbbbb| 色七七在线观看| 日本三级免费观看| a级黄色一级片| 久久av综合网| www.avtt| 男人添女荫道口喷水视频| 欧美视频亚洲图片| 性欧美在线视频| 亚洲天堂2018av| 天天爽夜夜爽一区二区三区| 久久网站免费视频| 成熟了的熟妇毛茸茸| ww国产内射精品后入国产| 人人妻人人澡人人爽欧美一区双| 麻豆中文字幕在线观看| 三级av免费观看| 伊人国产精品视频| 国产精品亚洲天堂| 视色,视色影院,视色影库,视色网 日韩精品福利片午夜免费观看 | 日本日本19xxxⅹhd乱影响| 久操手机在线视频| 日本中文字幕在线视频观看| 国产小视频免费| 精品人妻少妇一区二区| 亚洲午夜无码av毛片久久| 日韩精品视频久久| 亚洲 中文字幕 日韩 无码| 成年人网站大全| 久久久精品高清| 中文字幕一区二区三区四| 黄瓜视频免费观看在线观看www| 日本特级黄色大片| 大j8黑人w巨大888a片| 能在线观看的av| 国产美女18xxxx免费视频| 穿情趣内衣被c到高潮视频| 欧美乱大交xxxxx潮喷l头像| 丰满少妇被猛烈进入高清播放| 国产一级特黄a大片免费| 欧美激情第3页| 隔壁人妻偷人bd中字| 久久久精品麻豆| 99热这里只有精品免费| 国产三区在线视频| 国产高清精品软男同| 成熟丰满熟妇高潮xxxxx视频| 亚洲第一中文av| 免费看欧美一级片| 在线观看的毛片| 欧洲精品在线播放| www.cao超碰| 无码无遮挡又大又爽又黄的视频| 911av视频| 亚洲国产精品毛片av不卡在线| 午夜啪啪福利视频| av网站在线不卡| 免费无码毛片一区二三区| 国产喷水theporn| 成人中文字幕在线播放| 九九九九九九九九| 992kp快乐看片永久免费网址| 国产一区二区片| 99久久99精品| 亚洲成人天堂网| 北条麻妃视频在线| 日韩中文字幕在线视频观看| 樱空桃在线播放| 国产欧美一区二| 国产三级三级看三级| 无码人妻丰满熟妇区96| 亚洲乱码日产精品bd在线观看| 九一精品久久久| 日韩av手机版| 三年中国国语在线播放免费| 91猫先生在线| 男人天堂1024| 国产99久久九九精品无码| 蜜臀av无码一区二区三区| 成人免费在线视频播放| 性欧美18一19内谢| 国产日韩视频在线播放| 午夜av中文字幕| 不卡中文字幕在线观看| 亚洲免费黄色录像| 日本不卡一区二区三区四区| 亚洲天堂伊人网| 97超碰免费观看| 国产资源第一页| 欧美人成在线观看| 国产精品久久中文字幕| 农村妇女精品一二区| 国产精品igao| 成人性生交视频免费观看| 午夜探花在线观看| 久久久性生活视频| 三级a在线观看| 一级片免费在线观看视频| 97av中文字幕| 苍井空浴缸大战猛男120分钟| 国产精品乱码久久久久| 日韩av卡一卡二| 韩国无码av片在线观看网站| 两根大肉大捧一进一出好爽视频| 麻豆传传媒久久久爱| 五月天中文字幕在线| 欧美视频在线第一页| 不卡影院一区二区| 二级片在线观看| 日韩精品xxxx| 天天综合成人网| av天堂永久资源网| 国产日本欧美在线| 999香蕉视频| 中文字幕av久久| 国产成人久久777777| www.亚洲一区二区| 国产视频一区二区三区在线播放 | 国产 日韩 欧美在线| 亚洲国产成人va在线观看麻豆| 熟女视频一区二区三区| 免费黄色日本网站| 日韩av加勒比| 玩弄中年熟妇正在播放| 午夜剧场高清版免费观看| 你真棒插曲来救救我在线观看| 777av视频| 日韩欧美国产片| 91专区在线观看| 日本特级黄色大片| 婷婷免费在线观看| 黄色一级在线视频| 日日干日日操日日射| 成年人视频网站免费观看| 国产三级中文字幕| 日韩欧美亚洲另类| 99草草国产熟女视频在线| 国产成人美女视频| 国产a级一级片| 日韩精品视频在线观看视频| 国产卡一卡二在线| theporn国产精品| av亚洲天堂网| 97超碰成人在线| 免费观看成人在线视频| 国产97在线 | 亚洲| 丁香色欲久久久久久综合网| 国产人妻互换一区二区| 欧美激情第四页| 1314成人网|