当然,数据挖掘,数据准备部分考虑这样做:配置文件的基础上,打开相应的网站,并保存。之后这些文件的内容,然后分析、文本提取、矩阵变换、集群。
public static void main(String[] args){ final int THREAD_COUNT=5; String baseUrl=null; String searchBlogs=null; String blogs[]=null; String fileDir=null; //String category=null; InputStream inputStream =CsdnBlogMining.class.getClassLoader().getResourceAsStream("config.properties"); Properties p = new Properties(); try { p.load(inputStream); baseUrl=p.getProperty("baseUrl"); fileDir=p.getProperty("fileDir"); searchBlogs=p.getProperty("searchBlogs"); if(searchBlogs!=""){ blogs=searchBlogs.split(";"); } ExecutorService pool=Executors.newFixedThreadPool(THREAD_COUNT); for(String s:blogs){ pool.submit(new SaveWeb(baseUrl+s,fileDir+"/"+s+".html")); } pool.shutdown(); //category=new String(p.getProperty("category").getBytes("ISO-8859-1"),"UTF-8"); } catch (IOException e) { e.printStackTrace(); } }
打开网页并保存模块:
public class SaveWeb implements Runnable{ private String url; private String filename; public SaveWeb(String url,String filename){ this.url=url; this.filename=filename; } @Override public void run() { HttpClient httpclient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(url); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"); try{ HttpResponse response = httpclient.execute(httpGet); HttpEntity entity = response.getEntity(); BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(filename)); if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){ if (entity != null) { String res=EntityUtils.toString(entity,"UTF-8"); outputStream.write(res.getBytes("UTF-8")); outputStream.flush(); } } outputStream.close(); }catch(IOException e){ e.printStackTrace(); } }}兴许:
作业完毕了,但差点儿和上面的内容没啥关系,本来想全删了。
再想也不算写错。仅仅是没用上而已。还是留着吧。
终于,用java代码循环加并发去获得一个地址列表存到文件中。
而採用R语言去做的挖掘工作。包含获取网页、解析正文、分词、聚类、结果输出等。R语言真是省事,几十行代码全搞定了。但终于分类的结果不理想。看来基于全文的计算通用,刻出来的类是非常不准确,我们必须考虑改进。
版权声明:本文博主原创文章,转载保留原文链接。